Age | Commit message (Collapse) | Author |
|
One for the raw span extraction pass, one for paragraph sorting,
and another for HTML output.
|
|
|
|
2 more memory problems pointed out by mhfan - many thanks.
In the text device, run through the line height list to it's length,
not to it's capacity.
In the X11 image code, when copying data unchanged, copy whole ints,
not just the first quarter of the bytes.
|
|
Thanks to mhfan for the reports.
|
|
Rename fz_new_output_buffer to be fz_new_output_with_buffer.
Rename fz_new_output_file to be fz_new_output_with_file.
This is more consistent with other functions such as
fz_new_pixmap_with_data.
|
|
JPEGs and PNGs are left unchanged. Any other image gets stored as a
PNG and sent as a data URL.
|
|
Add configuration functions to control the hints set on a given device.
Use this to set whether image data is captured or not in the text
extraction process.
Also update the display list device to respect the device hints during
playback.
|
|
Extract such records as part of the text device.
|
|
In order to be able to output images (either in the pdfwrite device or
in the html conversion), we need to be able to get to the original
compressed data stream (or else we're going to end up recompressing
images). To do that, we need to expose all of the contents of pdf_image
into fz_image, so it makes sense to just amalgamate the two.
This has knock on effects for the creation of indexed colorspaces,
requiring some of that logic to be moved.
Also, we need to make xps use the same structures; this means pushing
PNG and TIFF support into the decoding code. Also we need to be able
to load just the headers from PNG/TIFF/JPEGs as xps doesn't include
dimension/resolution information.
Also, separate out all the fz_image stuff into fitz/res_image.c rather
than having it in res_pixmap.
|
|
The div/spans still use table style rendering, but it's simpler
code (and html) this way.
|
|
|
|
|
|
|
|
Update fz_text_analysis function to look for 'regions'; use this to
spot columns etc. Spot columns/width/alignment info.
"Intelligently" merge lines based on this.
Update html output to make use of this extra information.
|
|
If a line starts with a recognised unicode bullet char, then split
the paragraph there. Don't use this lines separation from the previous
line to determine paragraph line step.
Also attempt to spot numbered list items (digits or roman numerals).
The digits/roman numerals code is disabled by default, as while it
worked, later commits made it less useful - but it may be worth
reinstating later.
|
|
Rework the text extraction structures - the broad strokes are similar
but we now hold more information at each stage to enable us to perform
more detailed analysis on the structure of the page.
We now hold:
fz_text_char's (the position, ucs value, and style of each char).
fz_text_span's (sets of chars that share the same baseline/transform,
with no more than an expected amount of whitespace between each char).
fz_text_line's (sets of spans that share the same baseline (more or
less, allowing for super/subscript, but possibly with a larger than
expected amount of whitespace).
fz_text_block's (sets of lines that follow one another)
After fz_text_analysis is called, we hope to have fz_text_blocks split
such that each block is a paragraph.
This new implementation has the same restrictions as the current
implementation it replaces, namely that chars are only considered for
addition onto the most recent span at the moment, but this revised form
is designed to allow more easy extension, and for this restriction to
be lifted.
Also add simple paragraph splitting based on finding the most common
'line distance' in blocks.
When we add spans together to collate them into lines, we record the
'horizontal' and 'vertical' spacing between them. (Not actually
horizontal or vertical, so much as 'in the direction of writing' and
'perpendicular to the direction of writing').
The 'horizontal' value enables us to more correctly output spaces when
converting to (say) html later.
The 'vertical' value enables us to spot subscripts and superscripts etc,
as well as small changes in the baseline due to style changes. We are
careful to base the baseline comparison on the baseline for the line,
not the baseline for the previous span, as otherwise superscripts/
subscripts on the end of the line affect what we match next.
Also, we are less tolerant of vertical shifts after a large gap. This
avoids false positives where different columns just happen to almost
line up.
|
|
|
|
This is faster on ARM in particular. The primary changes involve
fz_matrix, fz_rect and fz_bbox.
Rather than passing 'fz_rect r' into a function, we now consistently
pass 'const fz_rect *r'. Where a rect is passed in and modified, we
miss the 'const' off. Where possible, we return the pointer to the
modified structure to allow 'chaining' of expressions.
The basic upshot of this work is that we do far fewer copies of
rectangle/matrix structures, and all the copies we do are explicit.
This has opened the way to other optimisations, also performed in
this commit.
Rather than using expressions like:
fz_concat(fz_scale(sx, sy), fz_translate(tx, ty))
we now have fz_pre_{scale,translate,rotate} functions. These
can be implemented much more efficiently than doing the fully
fledged matrix multiplication that fz_concat requires.
We add fz_rect_{min,max} functions to return pointers to the
min/max points of a rect. These can be used to in transformations
to directly manipulate values.
With a little casting in the path transformation code we can avoid
more needless copying.
We rename fz_widget_bbox to the more consistent fz_bound_widget.
|
|
Better tolerate long horizontal spaces without breaking lines.
|
|
Send blocks as paragraphs, rather than lines. Send lines as spans.
|
|
Various functions in the code output to FILE *, when there are times
we'd like them to output to other things, such as fz_buffers.
Add an fz_output type, together with fz_printf to allow things to
output to this.
|
|
Thanks to zeniko for these.
Use otf as extension for opentype fonts.
fz_clampi should take ints, not floats!
Fix typo in prototype.
Squash unwanted warning.
Remove magic number in favour of #define.
Reset generation numbers when renumbering.
|
|
Instead of using macros for min/max/abs/clamp, we move to using
inline functions. These are more typesafe, and should produce
equivalent code on compilers that support inline (i.e. pretty much
everything we care about these days).
People can always do their own macro versions if they prefer.
|
|
Don't reset the size of arrays until we have successfully resized them.
|
|
The default page userspace transform changed to a top-down coordinate
space, and I forgot this detail when updating the text device branch.
Also remove the final block sorting pass to give preference to the original
PDF text order.
|
|
|
|
|
|
last character was across style changes.
|
|
|
|
Also tidy up the taking of fz_context *'s, and hide an unwanted indent
param.
|
|
Fix a couple of silly problems (one gccism, and one windows specific
bug).
|
|
Debug printing functions: debug -> print.
Accessors: get noun attribute -> noun attribute.
Find -> lookup when the returned value is not reference counted.
pixmap_with_rect -> pixmap_with_bbox.
We are reserving the word "find" to mean lookups that give ownership
of objects to the caller. Lookup is used in other places where the
ownership is not transferred, or simple values are returned.
The rename is done by the sed script in scripts/rename3.sed
|
|
|
|
Add some function documentation to fitz.h.
Add fz_ prefix to runetochar, chartorune, runelen etc. Change
fz_runetochar to avoid passing unnecessary pointer.
|
|
Attempt to separate public API from internal functions.
|
|
We only open one instance of freetype per document. We therefore
have to ensure that only 1 call to it takes place at a time. We
introduce a lock for this purpose (FZ_LOCK_FREETYPE), and arrange
to take/release it as required.
We also update the font context so it is properly shared.
|
|
|
|
The new fz_malloc_struct(A,B) macro allocates sizeof(B) bytes using
fz_malloc, and then passes the resultant pointer to Memento_label
to label it with "B".
This costs nothing in non-memento builds, but gives much nicer
listings of leaked blocks when memento is enabled.
|
|
|
|
The existing code uses recursion for text span handling. With sufficiently
many chained spans we get stack overflow.
Simple fixes to use a loop.
|
|
|
|
|
|
Huge pervasive change to lots of files, adding a context for exception
handling and allocation.
In time we'll move more statics into there.
Also fix some for(i = 0; i < function(...); i++) calls.
|
|
The run-together words are dead! Long live the underscores!
The postscript inspired naming convention of using all run-together
words has served us well, but it is now time for more readable code.
In this commit I have also added the sed script, rename.sed, that I used
to convert the source. Use it on your patches and application code.
|
|
freetype.
|
|
|
|
vertical values.
|
|
|
|
and change the signature of fz_realloc to match.
|
|
|