Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
Skip over successive whitespace in search string.
Make android use text_search.c
|
|
|
|
|
|
This should (pretty much) give us enough to write a mupdftoraster
equivalent of gstoraster.
|
|
To prepare for color management, we have to make the device colorspaces
per-context and able to be overridden by users.
|
|
|
|
No font support (just font names are sent through).
No group support.
No shading support.
No image mask support.
Line art, text position/size, bitmaps, clipping all seem to work
though.
|
|
Allow us to get an image as a png in a buffer.
|
|
|
|
|
|
|
|
Rename fz_new_output_buffer to be fz_new_output_with_buffer.
Rename fz_new_output_file to be fz_new_output_with_file.
This is more consistent with other functions such as
fz_new_pixmap_with_data.
|
|
Add some more consts's and use void *'s where appropriate.
|
|
Add configuration functions to control the hints set on a given device.
Use this to set whether image data is captured or not in the text
extraction process.
Also update the display list device to respect the device hints during
playback.
|
|
Extract the core of fz_write_png so that it can work to an fz_output *
rather than a FILE *. fz_write_png continues to work as before, but now
we can output to buffer to.
|
|
|
|
Extract such records as part of the text device.
|
|
In order to be able to output images (either in the pdfwrite device or
in the html conversion), we need to be able to get to the original
compressed data stream (or else we're going to end up recompressing
images). To do that, we need to expose all of the contents of pdf_image
into fz_image, so it makes sense to just amalgamate the two.
This has knock on effects for the creation of indexed colorspaces,
requiring some of that logic to be moved.
Also, we need to make xps use the same structures; this means pushing
PNG and TIFF support into the decoding code. Also we need to be able
to load just the headers from PNG/TIFF/JPEGs as xps doesn't include
dimension/resolution information.
Also, separate out all the fz_image stuff into fitz/res_image.c rather
than having it in res_pixmap.
|
|
This actually turned out to be far easier than I'd feared; remove the
explicit check that stopped this working, and ensure that we pass the
correct value in for the 'indexed' param.
Add a function to check for colorspaces being indexed. Bit nasty that
this requires a strcmp...
|
|
|
|
Update fz_text_analysis function to look for 'regions'; use this to
spot columns etc. Spot columns/width/alignment info.
"Intelligently" merge lines based on this.
Update html output to make use of this extra information.
|
|
Rework the text extraction structures - the broad strokes are similar
but we now hold more information at each stage to enable us to perform
more detailed analysis on the structure of the page.
We now hold:
fz_text_char's (the position, ucs value, and style of each char).
fz_text_span's (sets of chars that share the same baseline/transform,
with no more than an expected amount of whitespace between each char).
fz_text_line's (sets of spans that share the same baseline (more or
less, allowing for super/subscript, but possibly with a larger than
expected amount of whitespace).
fz_text_block's (sets of lines that follow one another)
After fz_text_analysis is called, we hope to have fz_text_blocks split
such that each block is a paragraph.
This new implementation has the same restrictions as the current
implementation it replaces, namely that chars are only considered for
addition onto the most recent span at the moment, but this revised form
is designed to allow more easy extension, and for this restriction to
be lifted.
Also add simple paragraph splitting based on finding the most common
'line distance' in blocks.
When we add spans together to collate them into lines, we record the
'horizontal' and 'vertical' spacing between them. (Not actually
horizontal or vertical, so much as 'in the direction of writing' and
'perpendicular to the direction of writing').
The 'horizontal' value enables us to more correctly output spaces when
converting to (say) html later.
The 'vertical' value enables us to spot subscripts and superscripts etc,
as well as small changes in the baseline due to style changes. We are
careful to base the baseline comparison on the baseline for the line,
not the baseline for the previous span, as otherwise superscripts/
subscripts on the end of the line affect what we match next.
Also, we are less tolerant of vertical shifts after a large gap. This
avoids false positives where different columns just happen to almost
line up.
|
|
|
|
This requires a slight change to the device interface.
Callers that use fz_begin_tile will see no change (and no caching
will be done). We add a new fz_begin_tile_id function that takes an
extra 'id' parameter, and returns 0 or 1. If the id is 0 then the
function behaves exactly as fz_being_tile does, and always returns 0.
The PDF and XPS code continues to call the old (uncached) version.
The display list code however generates a unique id for every
BEGIN_TILE node, and passes this in.
If the id is non zero, then it is taken to be a unique identifier
for this tile; the implementer of the fz_begin_tile_id entry point
can choose to use this to implement caching. If it chooses to ignore
the id (and do no caching), it returns 0.
If the device implements caching, then it can check on entry for a
previously rendered tile with the appropriate matrix and a matching id.
If it finds one, then it returns 1. It is the callers responsibility
to then skip over all the device calls that would usually happen to
render the tiles (i.e. to skip forward to the matching 'END_TILE'
operation).
|
|
|
|
When running under Windows, replace fopen with our own fopen_utf8
that converts from utf8 to unicode before calling the unicode
version of fopen.
|
|
Use of the bbox device to derive the area of the display list can lead
to bad results because of heuristics used to handle corners of stroked
paths.
|
|
|
|
Thanks to zeniko.
|
|
|
|
|
|
Also, move fz_is_infinite_rect and fz_is_empty_rect to be a static
inline rather than a macro. (Static inlines are preferred over
macros by at least one customers).
We appear to be calling them with bboxes too, so add fz_is_infinite_bbox
and fz_is_empty_bbox to solve this.
|
|
This is faster on ARM in particular. The primary changes involve
fz_matrix, fz_rect and fz_bbox.
Rather than passing 'fz_rect r' into a function, we now consistently
pass 'const fz_rect *r'. Where a rect is passed in and modified, we
miss the 'const' off. Where possible, we return the pointer to the
modified structure to allow 'chaining' of expressions.
The basic upshot of this work is that we do far fewer copies of
rectangle/matrix structures, and all the copies we do are explicit.
This has opened the way to other optimisations, also performed in
this commit.
Rather than using expressions like:
fz_concat(fz_scale(sx, sy), fz_translate(tx, ty))
we now have fz_pre_{scale,translate,rotate} functions. These
can be implemented much more efficiently than doing the fully
fledged matrix multiplication that fz_concat requires.
We add fz_rect_{min,max} functions to return pointers to the
min/max points of a rect. These can be used to in transformations
to directly manipulate values.
With a little casting in the path transformation code we can avoid
more needless copying.
We rename fz_widget_bbox to the more consistent fz_bound_widget.
|
|
Various functions in the code output to FILE *, when there are times
we'd like them to output to other things, such as fz_buffers.
Add an fz_output type, together with fz_printf to allow things to
output to this.
|
|
|
|
|
|
|
|
It used to be called fz_bbox_covering_rect. It does exact rounding outwards
of a rect, so that the resulting irect will always cover the entire
area of the input rect.
Use fz_round_rect for fuzzy rounding where near-integer values are
rounded inwards.
|
|
Inside the renderer we often deal with integer sized areas, for
pixmaps and scissoring regions. Use a new fz_irect type in these
places.
|
|
|
|
|
|
Simple patch to replace const char * with char *. I made the patch
myself, but I suspect it's extremely close to the one submitted
by Evgeniy A Dushistov, who reported the bug - many thanks!
|
|
While investigating samples_mupdf_001/2599.pdf.asan.58.1778, a leak
showed up while cleaning the file, due to not dropping an object in
an error case.
mutool clean -dif samples_mupdf_001/2599.pdf.asan.58.1778 leak.pdf
Simple Fix. Also extend PDF writing so that it can cope with skipping
errors so we at least get something out at the end.
Problem found in a test file supplied by Mateusz "j00ru" Jurczyk and
Gynvael Coldwind of the Google Security Team using Address Sanitizer.
Many thanks!
|
|
Two problems with tiling are fixed here.
Firstly, if the tiling bounds are huge, the 'patch' region (the region
we are writing into), can overflow, causing a SEGV due to the paint code
being very confused by pixmaps that go from just under INT_MAX to just
over INT_MIN. Fix this by checking explicitly for overflow in these
bounds.
If the tiles are stupidly huge, but the scissor is small, we can end up
looping many more times than we need to. We fix mapping the scissor
region back through the inverse transform, and intersecting this
with the pattern area.
Problem found in 4201.pdf.SIGSEGV.622.3560, a test file supplied by
Mateusz "j00ru" Jurczyk and Gynvael Coldwind of the Google Security
Team using Address Sanitizer. Many thanks!
|
|
When calculating the bbox for draw_glyph, if the x and y origins of
the glyph are extreme (too large to fit in an int), we get overflows
of the bbox; empty bboxes are transformed to large ones.
The fix is to introduce an fz_translate_bbox function that checks for
such things.
Also, we update various bbox/rect functions to check for empty bboxes
before they check for infinite ones (as a bbox of x0=0 x1=0 y0=0 y1=-1
will be detected both as infinite and empty).
Problem found in 2485.pdf.SIGSEGV.2a.1652, a test file supplied by
Mateusz "j00ru" Jurczyk and Gynvael Coldwind of the Google Security
Team using Address Sanitizer. Many thanks!
|
|
Whenever we have an error while pushing a gstate, we run the risk of
getting confused over how many pops we need etc.
With this commit we introduce some checking at the dev_null level that
attempts to make this behaviour consistent.
Any caller may now assume that calling an operation that pushes a clip
will always succeed. This means the only error cleanup they need to
do is to ensure that if they have pushed a clip (or begun a group, or
a mask etc) is to pop it too.
Any callee may now assume that if it throws an error during the call
to a device entrypoint that would create a group/clip/mask then no more
calls will be forthcoming until after the caller has completely finished
with that group.
This is achieved by the dev_null layer (the layer that indirects from
device calls through the device structure to the function pointers)
swallowing errors and regurgitating them later as required. A count is
kept of the number of pushes that have happened since an error
occurred during a push (including that initial one). When this count
reaches zero, the original error is regurgitated. This allows the
caller to keep the cookie correctly updated.
|