summaryrefslogtreecommitdiff
path: root/source/pdf/pdf-lex.c
AgeCommit message (Collapse)Author
2018-08-28Truncate name tokens that are too long.Tor Andersson
2018-08-10Treat invalid and truncated hex string characters as '0'.Tor Andersson
2018-07-06Add debug functionality to show lexed stream contents.Robin Watts
If you define DUMP_LEXER_STREAM than the lexer dumps the input that it reads from the stream.
2018-01-31Return error token if strings are unterminated.Tor Andersson
2018-01-31Return PDF_TOK_ERROR when encountering isolated '>' and ')' characters.Tor Andersson
Also return PDF_TOK_ERROR instead of swallowing string opening quotes in pdf_lex_no_string. Also fix the repair code to not skip an extra byte whenever it scans an error token.
2017-12-13Fix 698785: Catch malformed numbers in PDF lexical scanner.Tor Andersson
Return error tokens when parsing numbers with trailing garbage rather than ignoring the extra characters. Also handle error tokens more gracefully in array and dictionary parsing. Treat error tokens as the 'null' keyword and continue parsing.
2017-11-01Use int64_t for public file API offsets.Tor Andersson
Don't mess with conditional compilation with LARGEFILE -- always expose 64-bit file offsets in our public API.
2017-10-05Remove shadowed variables.Sebastian Rasmussen
2017-09-22Skip to next whitespace character instead of aborting when repairing PDF.Tor Andersson
2017-06-28Throw on overly long PDF names.Sebastian Rasmussen
The architectural limit is 127 bytes according to the PDF specification.
2017-05-27Bug 697947: Handle Illegal hex codes in PDF names.Sebastian Rasmussen
PDF 1.2 and prior treats # in PDF names to be regular characters. PDF 1.2 and later treats # as escape characters for character hex codes. Previously illegal hex codes, e.g. #BX, were partially parsed as escaped hex codes and the illegal remainder parsed as regular characters. Now illegal hex codes are handled as consisting entirely of regular characters. Note that character code 0 is also considered to be an illegal hex code.
2017-05-27Handle extremely long PDF names.Sebastian Rasmussen
Previously the parser would cut these names short and then parse the remainder as a separate name.
2017-04-27Include required system headers.Tor Andersson
2017-03-22Rename fz_putc/puts/printf to fz_write_*.Tor Andersson
Rename fz_write to fz_write_data. Rename fz_write_buffer_* and fz_buffer_printf to fz_append_*. Be consistent in naming: fz_write_* calls write to fz_output. fz_append_* calls append to fz_buffer. Update documentation.
2017-03-01Bug 697620: Avoid clash with "isprint".Robin Watts
2017-01-17pdf: Convert non-printable keywords into PDF_TOK_ERROR.Tor Andersson
All known keywords are printable. Converting non-printable keywords into error tokens means we don't try to print garbage when showing error messages about unknown tokens.
2016-02-03Move pdf's lex_number routine over to use fast atof.Robin Watts
Spot (broken) values that will require special 'acrobat compatible' handling and use the old code for that.
2016-01-15pdf: Consume entire token before lexing numbers.Tor Andersson
"0.00-70" should be parsed as one token, not two tokens as we did.
2016-01-08Tweak lex_number to avoid (or minimise) underflowRobin Watts
Keeps operations in the int domain as long as possible, and only resorts to floats if required.
2015-12-15Rename fz_buffer_cat to fz_append_buffer.Tor Andersson
2015-10-02Bug 696131: Detect some overflow conditionsRobin Watts
When lexing a number, do NOT check for overflow. This causes loss of data in some files. The current implementation matches Acrobat. When lexing a startxref offset, check for overflow. If found, throw an error.
2015-05-15Support pdf files larger than 2Gig.Robin Watts
If FZ_LARGEFILE is defined when building, MuPDF uses 64bit offsets for files; this allows us to open streams larger than 2Gig. The downsides to this are that: * The xref entries are larger. * All PDF ints are held as 64bit things rather than 32bit things (to cope with /Prev entries, hint stream offsets etc). * All file positions are stored as 64bits rather than 32. The implementation works by detecting FZ_LARGEFILE. Some #ifdeffery in fitz/system.h sets fz_off_t to either int or int64_t as appropriate, and sets defines for fz_fopen, fz_fseek, fz_ftell etc as required. These call the fseeko64 etc functions on linux (and so define _LARGEFILE64_SOURCE) and the explicit 64bit functions on windows.
2015-02-17Add ctx parameter and remove embedded contexts for API regularity.Tor Andersson
Purge several embedded contexts: Remove embedded context in fz_output. Remove embedded context in fz_stream. Remove embedded context in fz_device. Remove fz_rebind_stream (since it is no longer necessary). Remove embedded context in svg_device. Remove embedded context in XML parser. Add ctx argument to fz_document functions. Remove embedded context in fz_document. Remove embedded context in pdf_document. Remove embedded context in pdf_obj. Make fz_page independent of fz_document in the interface. We shouldn't need to pass the document to all functions handling a page. If a page is tied to the source document, it's redundant; otherwise it's just pointless. Fix reference counting oddity in fz_new_image_from_pixmap.
2015-01-20don't decode '8' and '9' as octal digitsSimon Bünzli
At https://github.com/sumatrapdfreader/sumatrapdf/issues/66 there's a document which contains a string (\358) which is parsed as (\360) with the 8 overflowing instead of as (\0358) with the 8 being the first character after the octal escape. This patch restricts octal digits to '0' to '7' to fix that issue.
2014-09-02Add fz_snprintf and use it for formatting floating point numbers.Tor Andersson
2014-01-02Improve PDF repair logic.Robin Watts
When we meet a broken PDF file, we attempt to repair it. We do this by reading tokens from the file and attempting to interpret them as a normal PDF stream. Unfortunately, if the file is corrupt enough so that we start to read from the middle of a stream, and we happen to hit an '(' character, we can go into string reading mode. We can then end up skipping over vast swathes of file that we could otherwise repair. We fix this here by using a new version of the pdf_lex function that refuses to ever return a string. This means we may take more time over skipping things than we did before, but are less likely to skip stuff. We also tweak other parts of the pdf repair logic here. If we hit a badly formed piece of data, clear the num/gen we have stored so that the next plausible piece we get does not get assigned to a random object number.
2013-09-24Bug 694557: Fix infinite loop in pdf_lex.Robin Watts
When we read a '>' during lexing, we try to read another char to see if it's another '>'. If not, we warn that it's unexpected, put the char back and retry. Putting the char back fails if the '>' was the last char in the stream as we will then have read EOF. We then loop and reread the '>' resulting in an infinite loop. Simple fix is to check for EOF.
2013-06-20Rearrange source files.Tor Andersson