summaryrefslogtreecommitdiff
path: root/fitz/dev_text.c
AgeCommit message (Collapse)Author
2013-04-30Split dev_text into three parts.Tor Andersson
One for the raw span extraction pass, one for paragraph sorting, and another for HTML output.
2013-04-30Move device hint functions to a more appropriate source file.Tor Andersson
2013-04-29Bug 693939: Fix memory problems.Robin Watts
2 more memory problems pointed out by mhfan - many thanks. In the text device, run through the line height list to it's length, not to it's capacity. In the X11 image code, when copying data unchanged, copy whole ints, not just the first quarter of the bytes.
2013-04-29Fix various leaks in the dev_text device.Robin Watts
Thanks to mhfan for the reports.
2013-04-26Rename functions for consistency.Robin Watts
Rename fz_new_output_buffer to be fz_new_output_with_buffer. Rename fz_new_output_file to be fz_new_output_with_file. This is more consistent with other functions such as fz_new_pixmap_with_data.
2013-04-26Add image output for HTML.Robin Watts
JPEGs and PNGs are left unchanged. Any other image gets stored as a PNG and sent as a data URL.
2013-04-26Hint enabling/disabling for devices.Robin Watts
Add configuration functions to control the hints set on a given device. Use this to set whether image data is captured or not in the text extraction process. Also update the display list device to respect the device hints during playback.
2013-04-25Tweak fz_text_page to include image records.Robin Watts
Extract such records as part of the text device.
2013-04-11Move pdf_image to fz_image.Robin Watts
In order to be able to output images (either in the pdfwrite device or in the html conversion), we need to be able to get to the original compressed data stream (or else we're going to end up recompressing images). To do that, we need to expose all of the contents of pdf_image into fz_image, so it makes sense to just amalgamate the two. This has knock on effects for the creation of indexed colorspaces, requiring some of that logic to be moved. Also, we need to make xps use the same structures; this means pushing PNG and TIFF support into the decoding code. Also we need to be able to load just the headers from PNG/TIFF/JPEGs as xps doesn't include dimension/resolution information. Also, separate out all the fz_image stuff into fitz/res_image.c rather than having it in res_pixmap.
2013-03-26Reflow: Move from html output using tables to html output using div/spanRobin Watts
The div/spans still use table style rendering, but it's simpler code (and html) this way.
2013-03-26Spot indents.Robin Watts
2013-03-26Add superscript and subscript handling.Robin Watts
2013-03-26Simple dehyphenation support.Robin Watts
2013-03-26Text region analysis.Robin Watts
Update fz_text_analysis function to look for 'regions'; use this to spot columns etc. Spot columns/width/alignment info. "Intelligently" merge lines based on this. Update html output to make use of this extra information.
2013-03-26Add simple bullet point detection to paragraph analysis.Robin Watts
If a line starts with a recognised unicode bullet char, then split the paragraph there. Don't use this lines separation from the previous line to determine paragraph line step. Also attempt to spot numbered list items (digits or roman numerals). The digits/roman numerals code is disabled by default, as while it worked, later commits made it less useful - but it may be worth reinstating later.
2013-03-26Rework text extraction structures.Robin Watts
Rework the text extraction structures - the broad strokes are similar but we now hold more information at each stage to enable us to perform more detailed analysis on the structure of the page. We now hold: fz_text_char's (the position, ucs value, and style of each char). fz_text_span's (sets of chars that share the same baseline/transform, with no more than an expected amount of whitespace between each char). fz_text_line's (sets of spans that share the same baseline (more or less, allowing for super/subscript, but possibly with a larger than expected amount of whitespace). fz_text_block's (sets of lines that follow one another) After fz_text_analysis is called, we hope to have fz_text_blocks split such that each block is a paragraph. This new implementation has the same restrictions as the current implementation it replaces, namely that chars are only considered for addition onto the most recent span at the moment, but this revised form is designed to allow more easy extension, and for this restriction to be lifted. Also add simple paragraph splitting based on finding the most common 'line distance' in blocks. When we add spans together to collate them into lines, we record the 'horizontal' and 'vertical' spacing between them. (Not actually horizontal or vertical, so much as 'in the direction of writing' and 'perpendicular to the direction of writing'). The 'horizontal' value enables us to more correctly output spaces when converting to (say) html later. The 'vertical' value enables us to spot subscripts and superscripts etc, as well as small changes in the baseline due to style changes. We are careful to base the baseline comparison on the baseline for the line, not the baseline for the previous span, as otherwise superscripts/ subscripts on the end of the line affect what we match next. Also, we are less tolerant of vertical shifts after a large gap. This avoids false positives where different columns just happen to almost line up.
2013-02-11Fix problem with text selection caused by 0399332d54Paul Gardiner
2013-02-06Change to pass structures by reference rather than value.Robin Watts
This is faster on ARM in particular. The primary changes involve fz_matrix, fz_rect and fz_bbox. Rather than passing 'fz_rect r' into a function, we now consistently pass 'const fz_rect *r'. Where a rect is passed in and modified, we miss the 'const' off. Where possible, we return the pointer to the modified structure to allow 'chaining' of expressions. The basic upshot of this work is that we do far fewer copies of rectangle/matrix structures, and all the copies we do are explicit. This has opened the way to other optimisations, also performed in this commit. Rather than using expressions like: fz_concat(fz_scale(sx, sy), fz_translate(tx, ty)) we now have fz_pre_{scale,translate,rotate} functions. These can be implemented much more efficiently than doing the fully fledged matrix multiplication that fz_concat requires. We add fz_rect_{min,max} functions to return pointers to the min/max points of a rect. These can be used to in transformations to directly manipulate values. With a little casting in the path transformation code we can avoid more needless copying. We rename fz_widget_bbox to the more consistent fz_bound_widget.
2013-02-06Tweak text extraction block creation.Robin Watts
Better tolerate long horizontal spaces without breaking lines.
2013-02-05Tweak HTML output.Robin Watts
Send blocks as paragraphs, rather than lines. Send lines as spans.
2013-02-04Add fz_output, and make output functions use it.Robin Watts
Various functions in the code output to FILE *, when there are times we'd like them to output to other things, such as fz_buffers. Add an fz_output type, together with fz_printf to allow things to output to this.
2012-11-29Bug 693463: Various small fixes.Robin Watts
Thanks to zeniko for these. Use otf as extension for opentype fonts. fz_clampi should take ints, not floats! Fix typo in prototype. Squash unwanted warning. Remove magic number in favour of #define. Reset generation numbers when renumbering.
2012-07-05Move to static inline functions from macros.Robin Watts
Instead of using macros for min/max/abs/clamp, we move to using inline functions. These are more typesafe, and should produce equivalent code on compilers that support inline (i.e. pretty much everything we care about these days). People can always do their own macro versions if they prefer.
2012-04-05Fix potential problems on malloc failure.Robin Watts
Don't reset the size of arrays until we have successfully resized them.
2012-03-19Fix typo in text device where lines would group into blocks too eagerly.Tor Andersson
The default page userspace transform changed to a top-down coordinate space, and I forgot this detail when updating the text device branch. Also remove the final block sorting pass to give preference to the original PDF text order.
2012-03-19Don't create empty spans and lines in the text device.Tor Andersson
2012-03-14Some fixes to the new text device, courtesy of Zeniko.Tor Andersson
2012-03-14Put 'lastchar' into the text device struct to remember what theTor Andersson
last character was across style changes.
2012-03-14Fix memory leaks in style sheet handling of the new text device.Tor Andersson
2012-03-13Make fz_print functions all take a FILE *.Robin Watts
Also tidy up the taking of fz_context *'s, and hide an unwanted indent param.
2012-03-13Fix building on windows.Robin Watts
Fix a couple of silly problems (one gccism, and one windows specific bug).
2012-03-13Rename some functions and accessors to be more consistent.Tor Andersson
Debug printing functions: debug -> print. Accessors: get noun attribute -> noun attribute. Find -> lookup when the returned value is not reference counted. pixmap_with_rect -> pixmap_with_bbox. We are reserving the word "find" to mean lookups that give ownership of objects to the caller. Lookup is used in other places where the ownership is not transferred, or simple values are returned. The rename is done by the sed script in scripts/rename3.sed
2012-03-12Create style sheet and group extracted text into blocks, lines and spans.Tor Andersson
2012-03-07More release tidyups.Robin Watts
Add some function documentation to fitz.h. Add fz_ prefix to runetochar, chartorune, runelen etc. Change fz_runetochar to avoid passing unnecessary pointer.
2012-03-06Split fitz.h/mupdf.h into internal/external headers.Robin Watts
Attempt to separate public API from internal functions.
2012-02-13Add locking around freetype calls.Robin Watts
We only open one instance of freetype per document. We therefore have to ensure that only 1 call to it takes place at a time. We introduce a lock for this purpose (FZ_LOCK_FREETYPE), and arrange to take/release it as required. We also update the font context so it is properly shared.
2012-02-03Be consistent about passing a fz_context in path/text/shade functions.Tor Andersson
2011-12-16Add fz_malloc_struct, and make code use it.Robin Watts
The new fz_malloc_struct(A,B) macro allocates sizeof(B) bytes using fz_malloc, and then passes the resultant pointer to Memento_label to label it with "B". This costs nothing in non-memento builds, but gives much nicer listings of leaked blocks when memento is enabled.
2011-11-25Merge branch 'master' into contextRobin Watts
2011-11-17Fix bug 692627: stack overflows in text handling.Robin Watts
The existing code uses recursion for text span handling. With sufficiently many chained spans we get stack overflow. Simple fixes to use a loop.
2011-09-21Add warning context.Tor Andersson
2011-09-21Rename malloc functions for arrays (fz_calloc and fz_realloc).Tor Andersson
2011-09-15Add context to mupdf.Robin Watts
Huge pervasive change to lots of files, adding a context for exception handling and allocation. In time we'll move more statics into there. Also fix some for(i = 0; i < function(...); i++) calls.
2011-04-04Le Roi est mort, vive le Roi!Tor Andersson
The run-together words are dead! Long live the underscores! The postscript inspired naming convention of using all run-together words has served us well, but it is now time for more readable code. In this commit I have also added the sed script, rename.sed, that I used to convert the source. Use it on your patches and application code.
2011-02-23Remove fthint workaround for DynaLab fonts, since that is now a part of ↵Tor Andersson
freetype.
2011-02-18Make pdfdraw -tt output valid XML.Tor Andersson
2011-02-08Use horizontal metrics to create text boxes instead of guessing at bad ↵Tor Andersson
vertical values.
2011-02-03Various patches from SumatraPDF.Tor Andersson
2011-01-27Add fz_calloc function to check for integer overflow when allocating arrays, ↵Tor Andersson
and change the signature of fz_realloc to match.
2010-07-26Fix bug where storage capacity of 0 or 1 was not taken care of.Sebastian Rasmussen