summaryrefslogtreecommitdiff
path: root/source/pdf/pdf-interpret.c
AgeCommit message (Collapse)Author
2015-03-24Rework handling of PDF names for speed and memory.Robin Watts
Currently, every PDF name is allocated in a pdf_obj structure, and comparisons are done using strcmp. Given that we can predict most of the PDF names we'll use in a given file, this seems wasteful. The pdf_obj type is opaque outside the pdf-object.c file, so we can abuse it slightly without anyone outside knowing. We collect a sorted list of names used in PDF (resources/pdf/names.txt), and we add a utility (namedump) that preprocesses this into 2 header files. The first (include/mupdf/pdf/pdf-names-table.h, included as part of include/mupdf/pdf/object.h), defines a set of "PDF_NAME_xxxx" entries. These are pdf_obj *'s that callers can use to mean "A PDF object that means literal name 'xxxx'" The second (source/pdf/pdf-name-impl.h) is a C array of names. We therefore update the code so that rather than passing "xxxx" to functions (such as pdf_dict_gets(...)) we now pass PDF_NAME_xxxx (to pdf_dict_get(...)). This is a fairly natural (if widespread) change. The pdf_dict_getp (and sibling) functions that take a path (e.g. "foo/bar/baz") are therefore supplemented with equivalents that take a list (pdf_dict_getl(... , PDF_NAME_foo, PDF_NAME_bar, PDF_NAME_baz, NULL)). The actual implementation of this relies on the fact that small pointer values are never valid values. For a given pdf_obj *p, if NULL < (intptr_t)p < PDF_NAME__LIMIT then p is a literal entry in the name table. This enables us to do fast pointer compares and to skip expensive strcmps. Also, bring "null", "true" and "false" into the same style as PDF names. Rather than using full pdf_obj structures for null/true/false, use special pointer values just above the PDF_NAME_ table. This saves memory and makes comparisons easier.
2015-03-24Don't pass interpreter context to pdf_processor opcode callbacks.Tor Andersson
Update buffer and filter processors. Filter both colors and stroke states. Move OCG hiding logic into interpreter.
2015-02-17Add ctx parameter and remove embedded contexts for API regularity.Tor Andersson
Purge several embedded contexts: Remove embedded context in fz_output. Remove embedded context in fz_stream. Remove embedded context in fz_device. Remove fz_rebind_stream (since it is no longer necessary). Remove embedded context in svg_device. Remove embedded context in XML parser. Add ctx argument to fz_document functions. Remove embedded context in fz_document. Remove embedded context in pdf_document. Remove embedded context in pdf_obj. Make fz_page independent of fz_document in the interface. We shouldn't need to pass the document to all functions handling a page. If a page is tied to the source document, it's redundant; otherwise it's just pointless. Fix reference counting oddity in fz_new_image_from_pixmap.
2015-02-17Rename fz_close_* and fz_free_* to fz_drop_*.Tor Andersson
Rename fz_close to fz_drop_stream. Rename fz_close_archive to fz_drop_archive. Rename fz_close_output to fz_drop_output. Rename fz_free_* to fz_drop_*. Rename pdf_free_* to pdf_drop_*. Rename xps_free_* to xps_drop_*.
2014-09-17Fix whitespace.Tor Andersson
2014-09-09test-device: Abort interpretation when color found.Robin Watts
Add a new class of errors and use them to abort interpretation when the test device detects a color page.
2014-03-19Add routine to clean pdf content streams for pages.Robin Watts
New routine to filter the content streams for pages, xobjects, type3 charprocs, patterns etc. The filtered streams are guaranteed to be properly matched with q/Q's, and to not have changed the top level ctm. Additionally we remove (some) repeated settings of colors etc. This filtering can be extended to be smarter later. The idea of this is to both repair after editing, and to leave the streams in a form that can be easily appended to. This is preparatory to work on Bates numbering and Watermarking. Currently the streams produced are uncompressed.
2014-03-18Fix operator buffering of inline images.Robin Watts
Previously pdf_process buffer did not understand inline images. In order to make this work without needlessly duplicating complex code from within pdf-op-run, the parsing of inline images has been moved to happen in pdf-interpret.c. When the op_table entry for BI is called it now expects the inline image to be in csi->img and the dictionary object to be in csi->obj. To make this work, we have had to improve the handling of inline images in general. While non-inline images have been loaded and held in memory in their compressed form and only decoded when required, until now we have always loaded and decoded inline images immediately. This has been due to the difficulty in knowing how many bytes of data to read from the stream - we know the length of the stream once uncompressed, but relating this to the compressed length is hard. To cure this we introduce a new type of filter stream, a 'leecher'. We insert a leecher stream before we build the filters required to decode the image. We then read and discard the appropriate number of uncompressed bytes from the filters. This pulls the compressed data through the leecher stream, which stores it in an fz_buffer. Thus images are now always held in their compressed forms in memory. The pdf-op-run implementation is now trivial. The only real complexity in the pdf-op-buffer implementation is the need to ensure that the /Filter entry in the dictionary object matches the exact point at which we backstopped the decompression.
2014-03-17Ensure that BDC operators get both params.Robin Watts
Currently, when parsing, each time we encounter a name, we throw away the last name we had. BDC operators are called with: /Name <object> BDC If the <object> is a name, we lose the original /Name. To fix this, parsing a name when we already have a name will cause the name to be stored as an object. This has various knock on effects throughout the code to read from csi->obj rather than csi->name. Also, ensure that when cleaning, we collect a list of the object names in our new resources dictionary.
2014-03-04Fix potential leak of pdf_process state.Robin Watts
In the event that an annot is hidden or invisible, the pdf_process would never be freed. Solve that here. Thanks to Simon for spotting this!
2014-03-04Add pdf_process interface.Robin Watts
Currently the only processing we can do of PDF pages is to run them through an fz_device. We introduce new "pdf_process" functionality here to enable us to do more things. We define a pdf_processor structure with a set of function pointers in, one per PDF operator, together with functions for processing xobjects etc. The guts of pdf_run_page_contents and pdf_run_annot operations are then extracted to give pdf_process_page_contents and pdf_process_annot, and the originals implemented in terms of these. This commit contains just one instance of a pdf_processor, namely the "run" processor, which contains the original code refactored. The graphical state (and device pointer) is now part of private data to the run operator set, rather than being in pdf_csi.
2014-02-28Fix potential illegal access.Robin Watts
pdf_flush_text can cause the list of gstates to be extended. This can in turn cause them to move in memory. This means that any gstate pointers already held can be invalidated. Update the code to allow for this.
2014-02-11fix memory leak in 08c632046474e72f2e08e54f31e31a343808f6cbSimon Bünzli
2014-02-10Move rdb and file entries into pdf_csi.Robin Watts
This makes every pdf_run_XX operator function have the same function type. This paves the way for future changes in this area.
2014-02-10Tweak handling of PDF arrays during text object operator stream parsing.Robin Watts
Acrobat honours Tc and Tw operators found during parsing TJ arrays. We update the code here to cope. Possibly to completely match we should honour other operators too, but this will do for now. This maintains the behaviour of tests_private/pdf/sumatra/916_-_invalid_argument_to_TJ.pdf 916.pdf and improves the behaviour in general.
2014-01-17Bug 694899: Avoid using invalid gstate pointer.Robin Watts
When we call pdf_begin_group, this can go away and do lots of drawing. This can result in the gstate stack growing, which can involve a realloc. Any gstate pointer we are holding must therefore be recalculated after such a call. The neatest way to do this is to get pdf_begin_group to return the gstate pointer, thus making it hard to forget to do. This solves: e2a1dda5393f4cb8a446fd8edd9d94f9_asan_heap-uaf_b938cf_2075_2393.pdf Thanks to Mateusz Jurczyk and Gynvael Coldwind of the Google Security Team for providing the example files.
2014-01-16Bug 694894: Avoid throwing away an object while in use.Robin Watts
When we call to execute a pattern, we clear out the pdf_csi (the interpreter state). This involves clearing the stack and throwing away the record of the object we have just parsed. Unfortunately, when filling glyphs with a pattern, that object is still in use. We therefore amend the pdf_run_contents_stream to safely stash the object away and restore it afterwards. This solves this problem, and protects us against any other similar problems that might also arise. This solves: b8e2b57991896bf8120215cfbf7b54bb_asan_heap-uaf_86064f_2362_2587.pdf Thanks to Mateusz Jurczyk and Gynvael Coldwind of the Google Security Team for providing the example files.
2013-11-28Bug 694127: Valgrind fix for pdf_decode_cmapRobin Watts
A poorly formed string can cause us to overrun the end of the buffer. Now we check the end of the string at each stage to avoid this.
2013-11-05Improve stroke state function names that take the dash array length.Tor Andersson
2013-10-14Handle stroke+fill operations with transparency/blendmodes.Robin Watts
When stroking and filling in a single operation, we are supposed to form the complete stroke+fill image, then blend it back, rather than filling and blending, then stroking and blending. This only matters during transparency, or with non-normal blend modes. We fix MuPDF to push a knockout group when doing such operations.
2013-10-10Use the 'rect' param to fz_clip_path.Robin Watts
fz_clip_path takes a rect parameter, but all the callers of it use NULL. In most cases they have a perfectly reasonable value that they could pass to hand anyway. Update the code to pass this value, which saves the need for the scissor stack keeping code to recalculate it.
2013-09-27fix bug 694618Simon Bünzli
For Separation and DeviceN colorspaces, the initial color value is 1.0 for all components instead of 0.0 as for most other colorspaces. The current initialization in pdf_set_colorspace initializes for CMYK which happens to work for all non-tint colorspaces.
2013-09-10correctly set indexed colors in pdf_set_colorSimon Bünzli
Required for 1879_-_Indexed_colors_wrongly_converted.pdf Also, removing broken code in the same place (where mat->v[] is overwritten right after being set in the L*a*b* case).
2013-09-04Fix memory leak due to softmasks.Robin Watts
A user (sebblonline) reports a problem when rendering the 300Meg document downloadable from: http://www.mitsubishicarbide.com/EU/de/product/epaper/index.html (7th icon on the bottom). Memento builds report that softmask objects are leaked. Tracing these it appears that the handling of softmasks in pdf_run_xobject is not quite nested right. Fixing this reveals another problem with clipping paths being doubly removed. Adding a gsave/grestore pair solves this and leaves us with a well behaved program.
2013-08-28default OCGs to visibleSimon Bünzli
By default an OCG is supposed to be visible (for a testcase, see 2011 - ocg without ocgs invisible.pdf). Also, the default visibility value can be overwritten in both ways, so that pdf_is_hidden_ocg must check the state both for being "OFF" and "ON" (testcase was 2066 - ocg not printed.pdf rendered with event="Print").
2013-08-28fix memory leaksSimon Bünzli
* If fz_alpha_from_gray throws in fz_render_t3_glyph, then glyph is leaked. * If fz_new_image throws in pdf_load_image_imp, then colorspace and mask are leaked. * pdf_copy_pattern_gstate overwrites font and softmask without dropping them first.
2013-07-19Initial work on progressive loadingRobin Watts
We are testing this using a new -p flag to mupdf that sets a bitrate at which data will appear to arrive progressively as time goes on. For example: mupdf -p 102400 pdf_reference17.pdf Details of the scheme used here are presented in docs/progressive.txt
2013-06-27Move to using a flags bit rather than "Dirty" dict entries.Robin Watts
Correct the naming scheme for pdf_obj_xxx functions.
2013-06-25Rid the world of "pdf_document *xref".Robin Watts
For historical reasons lots of the code uses "xref" when talking about a pdf document. Now pdf_xref is a separate type this has become confusing, so replace 'xref' with 'doc' for clarity.
2013-06-25Update pdf_obj's to have a pdf_document field.Robin Watts
Remove the fz_context field to avoid the structure growing.
2013-06-20Rearrange source files.Tor Andersson