diff options
Diffstat (limited to 'docs/coding-progressive.html')
-rw-r--r-- | docs/coding-progressive.html | 378 |
1 files changed, 378 insertions, 0 deletions
diff --git a/docs/coding-progressive.html b/docs/coding-progressive.html new file mode 100644 index 00000000..fb9ea78d --- /dev/null +++ b/docs/coding-progressive.html @@ -0,0 +1,378 @@ +<!DOCTYPE html> +<html> +<head> +<title>MuPDF Progressive Loading</title> +<link rel="stylesheet" href="style.css" type="text/css"> +<meta name="viewport" content="width=device-width, initial-scale=1"> +</head> + +<body> + +<header> +<h1>MuPDF Progressive Loading</h1> +</header> + +<nav> +<a href="http://mupdf.com/index.html">ABOUT</a> +<a href="http://mupdf.com/news.html">NEWS</a> +<a href="index.html">DOCUMENTATION</a> +<a href="http://mupdf.com/downloads/">DOWNLOAD</a> +<a href="http://git.ghostscript.com/?p=mupdf.git;a=summary">SOURCE</a> +<a href="https://bugs.ghostscript.com/">BUGS</a> +</nav> + +<article> + +<p> +How to do progressive loading with MuPDF. + +<h2>What is progressive loading?</h2> + + +<p> +The idea of progressive loading is that as you download a PDF file +into a browser, you can display the pages as they appear. + +<p> +MuPDF can make use of 2 different mechanisms to achieve this. The +first relies on the file being "linearized", the second relies on +the caller of MuPDF having fine control over the http fetch and on +the server supporting byte-range fetches. + +<p> +For optimum performance a file should be both linearized and be +available over a byte-range supporting link, but benefits can still +be had with either one of these alone. + +<h2>Progressive download using "linearized" files</h2> + +<p> +Adobe defines "linearized" PDFs as being ones that have both a +specific layout of objects and a small amount of extra +information to help avoid seeking within a file. The stated aim +is to deliver the first page of a document in advance of the whole +document downloading, whereupon subsequent pages will become +available. Adobe also refers to these as "Optimized for fast web +view" or "Web Optimized". + +<p> +In fact, the standard outlines (poorly) a mechanism by which 'hints' +can be included that enable the subsequent pages to be found within +the file too. Unfortunately this is very poorly supported with +many tools, and so the hints have to be treated with suspicion. + +<p> +MuPDF will attempt to use hints if they are available, but will also +use a linear search of the file to discover pages if not. This means +that the first page will be displayed quickly, and then subsequent +ones will appear with 'incomplete' renderings that improve over time +as more and more resources are gradually delivered. + +<p> +Essentially the file starts with a slightly modified header, and the +first object in the file is a special one (the linearization object) +that a) indicates that the file is linearized, and b) gives some +useful information (like the number of pages in the file etc). + +<p> +This object is then followed by all the objects required for the +first page, then the "hint stream", then sets of object for each +subsequent page in turn, then shared objects required for those +pages, then various other random things. + +<p> +[Yes, really. While page 1 is sent with all the objects that it +uses, shared or otherwise, subsequent pages do not get shared +resources until after all the unshared page objects have been +sent.] + +<h2>The Hint Stream</h2> + +<p> +Adobe intended Hint Stream to be useful to facilitate the display +of subsequent pages, but it has never used it. Consequently you +can't trust people to write it properly - indeed Adobe outputs +something that doesn't quite conform to the spec. + +<p> +Consequently very few people actually use it. MuPDF will use it +after sanity checking the values, and should cope with illegal/ +incorrect values. + +<h2>So how does MuPDF handle progressive loading?</h2> + +<p> +MuPDF has made various extensions to its mechanisms for handling +progressive loading. + +<ul> + <li> Progressive streams + + <p> + At its lowest level MuPDF reads file data from a fz_stream, + using the fz_open_document_with_stream call. (fz_open_document + is implemented by calling this). We have extended the fz_stream + slightly, giving the system a way to ask for meta information + (or perform meta operations) on a stream. + + <p> + Using this mechanism MuPDF can query: + + <ul> + <li> whether a stream is progressive or not (i.e. whether the + entire stream is accessible immediately) + <li> what the length of a stream should ultimately be (which an + http fetcher should know from the Content-Length header), + </ul> + + <p> + With this information MuPDF can decide whether to use its normal + object reading code, or whether to make use of a linearized + object. Knowing the length enables us to check with the length + value given in the linearized object - if these differ, the + assumption is that an incremental save has taken place, thus the + file is no longer linearized. + + <p> + When data is pulled from a progressive stream, if we attempt to + read data that is not currently available, the stream should + throw a FZ_ERROR_TRYLATER error. This particular error code + will be interpreted by the caller as an indication that it + should retry the parsing of the current objects at a later time. + + <p> + When a MuPDF call is made on a progressive stream, such as + fz_open_document_with_stream, or fz_load_page, the caller should + be prepared to handle a FZ_ERROR_TRYLATER error as meaning that + more data is required before it can continue. No indication is + directly given as to exactly how much more data is required, but + as the caller will be implementing the progressive fz_stream + that it has passed into MuPDF to start with, it can reasonably + be expected to figure out an estimate for itself. + + <li> Cookie + + <p> + Once a page has been loaded, if its contents are to be 'run' + as normal (using e.g. fz_run_page) any error (such as failing + to read a font, or an image, or even a content stream belonging + to the page) will result in a rendering that aborts with an + FZ_ERROR_TRYLATER error. The caller can catch this and display + a placeholder instead. + + <p> + If each pages data was entirely self-contained and sent in + sequence this would perhaps be acceptable, with each page + appearing one after the other. Unfortunately, the linearization + procedure as laid down by Adobe does NOT do this: objects shared + between multiple pages (other than the first) are not sent with + the pages themselves, but rather AFTER all the pages have been + sent. + + <p> + This means that a document that has a title page, then contents + that share a font used on pages 2 onwards, will not be able to + correctly display page 2 until after the font has arrived in + the file, which will not be until all the page data has been + sent. + + <p> + To mitigate against this, MuPDF provides a way whereby callers + can indicate that they are prepared to accept an 'incomplete' + rendering of the file (perhaps with missing images, or with + substitute fonts). + + <p> + Callers prepared to tolerate such renderings should set the + 'incomplete_ok' flag in the cookie, then call fz_run_page etc + as normal. If a FZ_ERROR_TRYLATER error is thrown at any point + during the page rendering, the error will be swallowed, the + 'incomplete' field in the cookie will become non-zero and + rendering will continue. When control returns to the caller + the caller can check the value of the 'incomplete' field and + know that the rendering it received is not authoritative. + +</ul> + +<h2>Progressive loading using byte range requests</h2> + +<p> +If the caller has control over the http fetch, then it is possible +to use byte range requests to fetch the document 'out of order'. +This enables non-linearized files to be progressively displayed as +they download, and fetches complete renderings of pages earlier than +would otherwise be the case. This process requires no changes within +MuPDF itself, but rather in the way the progressive stream learns +from the attempts MuPDF makes to fetch data. + +<p> +Consider for example, an attempt to fetch a hypothetical file from +a server. + +<ul> + <li><p> + The initial http request for the document is sent with a "Range:" + header to pull down the first (say) 4k of the file. + + <li><p> + As soon as we get the header in from this initial request, we can + respond to meta stream operations to give the length, and whether + byte requests are accepted. + + <ul> + <li><p> + If the header indicates that byte ranges are acceptable the + stream proceeds to go into a loop fetching chunks of the file + at a time (not necessarily in-order). Otherwise the server + will ignore the Range: header, and just serve the whole file. + + <li><p> + If the header indicates a content-length, the stream returns + that. + </ul> + + <li><p> + MuPDF can then decide how to proceed based upon these flags and + whether the file is linearized or not. (If the file contains a + linearized object, and the content length matches, then the file + is considered to be linear, otherwise it is not). + + <p> + If the file is linear: + + <ul> + <li><p> + We proceed to read objects out of the file as it downloads. + This will provide us the first page and all its resources. It + will also enable us to read the hint streams (if present). + + <li><p> + Once we have read the hint streams, we unpack (and sanity + check) them to give us a map of where in the file each object + is predicted to live, and which objects are required for each + page. If any of these values are out of range, we treat the + file as if there were no hint streams. + + <li><p> + If we have hints, any attempt to load a subsequent page will + cause MuPDF to attempt to read exactly the objects required. + This will cause a sequence of seeks in the fz_stream followed + by reads. If the stream does not have the data to satisfy that + request yet, the stream code should remember the location that + was fetched (and fetch that block in the background so that + future retries will succeed) and should raise an + FZ_ERROR_TRYLATER error. + + <p> + [Typically therefore when we jump to a page in a linear file + on a byte request capable link, we will quickly see a rough + rendering, which will improve fairly fast as images and fonts + arrive.] + + <li><p> + Regardless of whether we have hints or byte requests, on every + fz_load_page call MuPDF will attempt to process more of the + stream (that is assumed to be being downloaded in the + background). As linearized files are guaranteed to have pages + in order, pages will gradually become available. In the absence + of byte requests and hints however, we have no way of getting + resources early, so the renderings for these pages will remain + incomplete until much more of the file has arrived. + + <p> + [Typically therefore when we jump to a page in a linear file + on a non byte request capable link, we will see a rough + rendering for that page as soon as data arrives for it (which + will typically take much longer than would be the case with + byte range capable downloads), and that will improve much more + slowly as images and fonts may not appear until almost the + whole file has arrived.] + + <li><p> + When the whole file has arrived, then we will attempt to read + the outlines for the file. + </ul> + + <p> + For a non-linearized PDF on a byte request capable stream: + + <ul> + <li><p> + MuPDF will immediately seek to the end of the file to attempt + to read the trailer. This will fail with a FZ_ERROR_TRYLATER + due to the data not being here yet, but the stream code should + remember that this data is required and it should be prioritized + in the background fetch process. + + <li><p> + Repeated attempts to open the stream should eventually succeed + therefore. As MuPDF jumps through the file trying to read first + the xrefs, then the page tree objects, then the page contents + themselves etc, the background fetching process will be driven + by the attempts to read the file in the foreground. + + <p> + [Typically therefore the opening of a non-linearized file will + be slower than a linearized one, as the xrefs/page trees for a + non-linear file can be 20%+ of the file data. Once past this + initial point however, pages and data can be pulled from the + file almost as fast as with a linearized file.] + </ul> + + <p> + For a non-linearized PDF on a non-byte request capable stream: + + <ul> + <li><p> + MuPDF will immediately seek to the end of the file to attempt + to read the trailer. This will fail with a FZ_ERROR_TRYLATER + due to the data not being here yet. Subsequent retries will + continue to fail until the whole file has arrived, whereupon + the whole file will be instantly available. + + <p> + [This is the worst case situation - nothing at all can be + displayed until the entire file has downloaded.] + </ul> + + <p> + A typical structure for a fetcher process (see curl-stream.c in + mupdf-curl as an example) might therefore look like this: + + <ul> + <li><p> + We consider the file as an (initially empty) buffer which we are + filling by making requests. In order to ensure that we make + maximum use of our download link, we ensure that whenever + one request finishes, we immediately launch another. Further, to + avoid the overheads for the request/response headers being too + large, we may want to divide the file into 'chunks', perhaps 4 or 32k + in size. + + <li><p> + We can then have a receiver process that sits there in a loop + requesting chunks to fill this buffer. In the absence of + any other impetus the receiver should request the next 'chunk' + of data from the file that it does not yet have, following the last + fill point. Initially we start the fill point at the beginning of + the file, but this will move around based on the requests made of + the progressive stream. + + <li><p> + Whenever MuPDF attempts to read from the stream, we check to see if + we have data for this area of the file already. If we do, we can + return it. If not, we remember this as the next "fill point" for our + receiver process and throw a FZ_ERROR_TRYLATER error. + </ul> + +</ul> + +</article> + +<footer> +<a href="http://artifex.com"><img src="artifex-logo.png" align="right"></a> +Copyright © 2006-2017 Artifex Software Inc. +</footer> + +</body> +</html> |