Age | Commit message (Collapse) | Author |
|
|
|
|
|
Move the definition of the structure contents into new fitz-imp.h
file. Make all code outside of fitz access the buffer through the
defined API.
Add a convenience API for people that want to get buffers as
null terminated C strings.
|
|
|
|
|
|
|
|
The generation number is only needed for decryption, and is assumed
to be zero or irrelevant for all other uses.
Store the original object number and generation in the xref slot, so
that we can decrypt them even when the objects have been renumbered,
without needing to pass the original object number around through
the stream loading APIs.
|
|
If thirdparty/luratech is populated then this decoder will be preferred
over jbig2dec (even if both are present).
|
|
Split compressed images (images based on a compressed buffer)
and pixmap images (images based on a pixmap) out into separate
subclasses.
|
|
Update the core fz_get_pixmap_from_image code to allow fetching
a subarea of a pixmap. We pass in the required subarea, together
with the transformation matrix for the whole image.
On return, we have a pixmap at least as big as was requested,
and the transformation matrix is updated to map the supplied
area to the correct place on the screen.
The draw device is updated to use this as required. Everywhere
else passes NULLs in, and so gets unchanged behaviour.
The standard 'get_pixmap' function has been updated to decode
just the required areas of the bitmaps.
This means that banded rendering of pages will decode just the
image subareas that are required for each band, limiting the
memory use. The downside to this is that each band will redecode
the image again to extract just the section we want.
The image subareas are put into the fz_store in the same way
as full images. Currently image areas in the store are only
matched when they match exactly; subareas are not identified
as being able to use existing images.
|
|
|
|
When sanitizing a file, while cleaning with decompression, I was
seeing a flate problem reported.
The issue is that pdf_open_filter was passing pdf_open_raw_filter
the orig_num as both num and orig_num. This was causing us to
find an fz_buffer attached to the (wrong) xref entry and to open
that instead of the underlying stream.
The fix is to propogate num a bit further.
|
|
|
|
|
|
Ensure that subsampling and caching happen in the generic image
code, not in the specific.
Previously, the subsampling happened only for images that were
decoded from streams. Images that were loaded direct were never
subsampled and hence were always cached at full size. After this
change both classes of image are correctly subsampled, and
the subsampled version kept in the cache.
This produces various image diffs in the cluster, none of which
are noticable to the naked eye.
|
|
If FZ_LARGEFILE is defined when building, MuPDF uses 64bit offsets
for files; this allows us to open streams larger than 2Gig.
The downsides to this are that:
* The xref entries are larger.
* All PDF ints are held as 64bit things rather than 32bit things
(to cope with /Prev entries, hint stream offsets etc).
* All file positions are stored as 64bits rather than 32.
The implementation works by detecting FZ_LARGEFILE. Some #ifdeffery
in fitz/system.h sets fz_off_t to either int or int64_t as appropriate,
and sets defines for fz_fopen, fz_fseek, fz_ftell etc as required.
These call the fseeko64 etc functions on linux (and so define
_LARGEFILE64_SOURCE) and the explicit 64bit functions on windows.
|
|
pdf_load_image_stream is supposed to return a buffer containing the
uncompressed stream from an object (or, in the case of image streams
where an fz_compression_params structure is supplied, a stream
decompressed up to the point of the image format compression).
We have an optimisation in pdf_load_image_stream to allow it to
return the existing buffer from a cached object rather than
reloading it again, but as bug 695549 points out, this breaks in
the case where the cached stream is compressed.
The suggested fix by the bug reporter (Stefan Klein) would work
in that it would stop compressed streams being returned as
uncompressed ones, but it is not perfect as it could lead to
several copies of shortstoppable image streams being loaded (and
for streams with null or empty array filters being mistaken for
compressed ones).
The fix here solves these cases too.
|
|
Currently, every PDF name is allocated in a pdf_obj structure, and
comparisons are done using strcmp. Given that we can predict most
of the PDF names we'll use in a given file, this seems wasteful.
The pdf_obj type is opaque outside the pdf-object.c file, so we can
abuse it slightly without anyone outside knowing.
We collect a sorted list of names used in PDF (resources/pdf/names.txt),
and we add a utility (namedump) that preprocesses this into 2 header
files.
The first (include/mupdf/pdf/pdf-names-table.h, included as part of
include/mupdf/pdf/object.h), defines a set of "PDF_NAME_xxxx"
entries. These are pdf_obj *'s that callers can use to mean "A PDF
object that means literal name 'xxxx'"
The second (source/pdf/pdf-name-impl.h) is a C array of names.
We therefore update the code so that rather than passing "xxxx" to
functions (such as pdf_dict_gets(...)) we now pass PDF_NAME_xxxx (to
pdf_dict_get(...)). This is a fairly natural (if widespread) change.
The pdf_dict_getp (and sibling) functions that take a path (e.g.
"foo/bar/baz") are therefore supplemented with equivalents that
take a list (pdf_dict_getl(... , PDF_NAME_foo, PDF_NAME_bar,
PDF_NAME_baz, NULL)).
The actual implementation of this relies on the fact that small
pointer values are never valid values. For a given pdf_obj *p,
if NULL < (intptr_t)p < PDF_NAME__LIMIT then p is a literal
entry in the name table.
This enables us to do fast pointer compares and to skip expensive
strcmps.
Also, bring "null", "true" and "false" into the same style as PDF names.
Rather than using full pdf_obj structures for null/true/false, use
special pointer values just above the PDF_NAME_ table. This saves
memory and makes comparisons easier.
|
|
Update buffer and filter processors.
Filter both colors and stroke states.
Move OCG hiding logic into interpreter.
|
|
Purge several embedded contexts:
Remove embedded context in fz_output.
Remove embedded context in fz_stream.
Remove embedded context in fz_device.
Remove fz_rebind_stream (since it is no longer necessary).
Remove embedded context in svg_device.
Remove embedded context in XML parser.
Add ctx argument to fz_document functions.
Remove embedded context in fz_document.
Remove embedded context in pdf_document.
Remove embedded context in pdf_obj.
Make fz_page independent of fz_document in the interface.
We shouldn't need to pass the document to all functions handling a page.
If a page is tied to the source document, it's redundant; otherwise it's
just pointless.
Fix reference counting oddity in fz_new_image_from_pixmap.
|
|
Rename fz_close to fz_drop_stream.
Rename fz_close_archive to fz_drop_archive.
Rename fz_close_output to fz_drop_output.
Rename fz_free_* to fz_drop_*.
Rename pdf_free_* to pdf_drop_*.
Rename xps_free_* to xps_drop_*.
|
|
The recent change to holding pdf xrefs in a sparse format has resulted
in a significant decrease in speed (x10). Malc points out that some of
this (2x) can be recovered simply by making pdf_cache_object return the
entry which it found the object in.
This saves us having to immediately call pdf_get_xref_entry again
afterwards.
I am still thinking about ways to try and get the remaining time back.
|
|
|
|
In load_sample_func, the stream is not closed and thus leaked if one of
the fz_read_byte or fz_read_bits calls throws (which might happen e.g.
on a Deflate data error).
In pdf_load_compressed_inline_image, the allocated buffer is not freed
if one of the stream initializers or the tile creation throws
(fz_open_leecher does not take ownership of the stream).
|
|
Return the null object rather than throwing an exception when parsing
indirect object references with negative object numbers.
Do range check for object numbers (1 .. length) when object numbers
are used instead.
Object number 0 is not a valid object number. It must always be 'free'.
|
|
After rushing to get the fix for a crash in, I realised the
routine could be simplified a bit.
|
|
Michael spotted that double closing an fz_stream on an inline image
does bad things. Simple fix is not to double close.
|
|
Previously pdf_process buffer did not understand inline images.
In order to make this work without needlessly duplicating complex code
from within pdf-op-run, the parsing of inline images has been moved to
happen in pdf-interpret.c. When the op_table entry for BI is called
it now expects the inline image to be in csi->img and the dictionary
object to be in csi->obj.
To make this work, we have had to improve the handling of inline images
in general. While non-inline images have been loaded and held in
memory in their compressed form and only decoded when required, until
now we have always loaded and decoded inline images immediately. This
has been due to the difficulty in knowing how many bytes of data to
read from the stream - we know the length of the stream once
uncompressed, but relating this to the compressed length is hard.
To cure this we introduce a new type of filter stream, a 'leecher'.
We insert a leecher stream before we build the filters required to
decode the image. We then read and discard the appropriate number
of uncompressed bytes from the filters. This pulls the compressed
data through the leecher stream, which stores it in an fz_buffer.
Thus images are now always held in their compressed forms in memory.
The pdf-op-run implementation is now trivial. The only real complexity
in the pdf-op-buffer implementation is the need to ensure that the
/Filter entry in the dictionary object matches the exact point at
which we backstopped the decompression.
|
|
In the existing code, if build_filter fails, chain will be freed. If
pdf_array_get fails however, it will leak.
Rectify this. No specific bug or example file, just observation arising
from discussions about previous commit.
|
|
pdf_open_raw_renumbered_stream and pdf_open_image_stream both have the
same issue that 98a111c8e49916f8f5ac21d11f4627540f9ddd49 fixes.
|
|
When constructing a filter chain, we pass ownership of 'chain' inwards.
This means we need to be careful not to double close chain.
This fixes:
5df97f8539d31745f1c45cc9e1468825_asan_heap-oob_a59afe_1862_225.pdf
a736faf6f4a34b7ad8eff207ba52aa57_asan_heap-oob_a59dd9_5744_4860.pdf
Thanks to Mateusz Jurczyk and Gynvael Coldwind of the Google Security
Team for providing the fuzzing files.
|
|
Certain optimized documents use a rather large common symbol dictionary
for all JBIG2 images. Caching these JBIG2Globals speeds up loading and
rendering of such documents.
|
|
This is required e.g. for 2314 - jpeg tables in tiff.xps.
This folds fz_open_resized_dct back into fz_open_dct instead of adding
further variations for calls with and without the jpegtables argument.
|
|
If opening a filter in pdf_open_crypt throws, the stream is closed in
the used fz_open_* method and thus mustn't be closed again.
|
|
We are testing this using a new -p flag to mupdf that sets a bitrate at
which data will appear to arrive progressively as time goes on. For
example:
mupdf -p 102400 pdf_reference17.pdf
Details of the scheme used here are presented in docs/progressive.txt
|
|
For historical reasons lots of the code uses "xref" when talking about
a pdf document. Now pdf_xref is a separate type this has become
confusing, so replace 'xref' with 'doc' for clarity.
|
|
|