Organize docs into HTML files.

author: Tor Andersson <tor.andersson@artifex.com> 2017-04-07 16:18:53 +0200
committer: Tor Andersson <tor.andersson@artifex.com> 2017-04-13 13:52:16 +0200
commit: c88941abc6f0fe91a41dc35dcaa1874d4de2c429 (patch)
tree: 51ff32d773f80219922a0458ec76f401ba21e138 /docs/coding-progressive.html
parent: 9c805cc9f934cd12e89014db8ad70e3191cdaf2d (diff)
download: mupdf-c88941abc6f0fe91a41dc35dcaa1874d4de2c429.tar.xz
1 files changed, 378 insertions, 0 deletions
diff --git a/docs/coding-progressive.html b/docs/coding-progressive.html
new file mode 100644
index 00000000..fb9ea78d
--- /dev/null
+++ b/docs/coding-progressive.html
@@ -0,0 +1,378 @@
+<!DOCTYPE html>
+<html>
+<head>
+<title>MuPDF Progressive Loading</title>
+<link rel="stylesheet" href="style.css" type="text/css">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+</head>
+
+<body>
+
+<header>
+<h1>MuPDF Progressive Loading</h1>
+</header>
+
+<nav>
+<a href="http://mupdf.com/index.html">ABOUT</a>
+<a href="http://mupdf.com/news.html">NEWS</a>
+<a href="index.html">DOCUMENTATION</a>
+<a href="http://mupdf.com/downloads/">DOWNLOAD</a>
+<a href="http://git.ghostscript.com/?p=mupdf.git;a=summary">SOURCE</a>
+<a href="https://bugs.ghostscript.com/">BUGS</a>
+</nav>
+
+<article>
+
+<p>
+How to do progressive loading with MuPDF.
+
+<h2>What is progressive loading?</h2>
+
+
+<p>
+The idea of progressive loading is that as you download a PDF file
+into a browser, you can display the pages as they appear.
+
+<p>
+MuPDF can make use of 2 different mechanisms to achieve this. The
+first relies on the file being "linearized", the second relies on
+the caller of MuPDF having fine control over the http fetch and on
+the server supporting byte-range fetches.
+
+<p>
+For optimum performance a file should be both linearized and be
+available over a byte-range supporting link, but benefits can still
+be had with either one of these alone.
+
+<h2>Progressive download using "linearized" files</h2>
+
+<p>
+Adobe defines "linearized" PDFs as being ones that have both a
+specific layout of objects and a small amount of extra
+information to help avoid seeking within a file. The stated aim
+is to deliver the first page of a document in advance of the whole
+document downloading, whereupon subsequent pages will become
+available. Adobe also refers to these as "Optimized for fast web
+view" or "Web Optimized".
+
+<p>
+In fact, the standard outlines (poorly) a mechanism by which 'hints'
+can be included that enable the subsequent pages to be found within
+the file too. Unfortunately this is very poorly supported with
+many tools, and so the hints have to be treated with suspicion.
+
+<p>
+MuPDF will attempt to use hints if they are available, but will also
+use a linear search of the file to discover pages if not. This means
+that the first page will be displayed quickly, and then subsequent
+ones will appear with 'incomplete' renderings that improve over time
+as more and more resources are gradually delivered.
+
+<p>
+Essentially the file starts with a slightly modified header, and the
+first object in the file is a special one (the linearization object)
+that a) indicates that the file is linearized, and b) gives some
+useful information (like the number of pages in the file etc).
+
+<p>
+This object is then followed by all the objects required for the
+first page, then the "hint stream", then sets of object for each
+subsequent page in turn, then shared objects required for those
+pages, then various other random things.
+
+<p>
+[Yes, really. While page 1 is sent with all the objects that it
+uses, shared or otherwise, subsequent pages do not get shared
+resources until after all the unshared page objects have been
+sent.]
+
+<h2>The Hint Stream</h2>
+
+<p>
+Adobe intended Hint Stream to be useful to facilitate the display
+of subsequent pages, but it has never used it. Consequently you
+can't trust people to write it properly - indeed Adobe outputs
+something that doesn't quite conform to the spec.
+
+<p>
+Consequently very few people actually use it. MuPDF will use it
+after sanity checking the values, and should cope with illegal/
+incorrect values.
+
+<h2>So how does MuPDF handle progressive loading?</h2>
+
+<p>
+MuPDF has made various extensions to its mechanisms for handling
+progressive loading.
+
+<ul>
+  <li> Progressive streams
+
+    <p>
+    At its lowest level MuPDF reads file data from a fz_stream,
+    using the fz_open_document_with_stream call. (fz_open_document
+    is implemented by calling this). We have extended the fz_stream
+    slightly, giving the system a way to ask for meta information
+    (or perform meta operations) on a stream.
+
+    <p>
+    Using this mechanism MuPDF can query:
+
+    <ul>
+      <li> whether a stream is progressive or not (i.e. whether the
+        entire stream is accessible immediately)
+      <li> what the length of a stream should ultimately be (which an
+        http fetcher should know from the Content-Length header),
+    </ul>
+
+    <p>
+    With this information MuPDF can decide whether to use its normal
+    object reading code, or whether to make use of a linearized
+    object. Knowing the length enables us to check with the length
+    value given in the linearized object - if these differ, the
+    assumption is that an incremental save has taken place, thus the
+    file is no longer linearized.
+
+    <p>
+    When data is pulled from a progressive stream, if we attempt to
+    read data that is not currently available, the stream should
+    throw a FZ_ERROR_TRYLATER error. This particular error code
+    will be interpreted by the caller as an indication that it
+    should retry the parsing of the current objects at a later time.
+
+    <p>
+    When a MuPDF call is made on a progressive stream, such as
+    fz_open_document_with_stream, or fz_load_page, the caller should
+    be prepared to handle a FZ_ERROR_TRYLATER error as meaning that
+    more data is required before it can continue. No indication is
+    directly given as to exactly how much more data is required, but
+    as the caller will be implementing the progressive fz_stream
+    that it has passed into MuPDF to start with, it can reasonably
+    be expected to figure out an estimate for itself.
+
+  <li> Cookie
+
+    <p>
+    Once a page has been loaded, if its contents are to be 'run'
+    as normal (using e.g. fz_run_page) any error (such as failing
+    to read a font, or an image, or even a content stream belonging
+    to the page) will result in a rendering that aborts with an
+    FZ_ERROR_TRYLATER error. The caller can catch this and display
+    a placeholder instead.
+
+    <p>
+    If each pages data was entirely self-contained and sent in
+    sequence this would perhaps be acceptable, with each page
+    appearing one after the other. Unfortunately, the linearization
+    procedure as laid down by Adobe does NOT do this: objects shared
+    between multiple pages (other than the first) are not sent with
+    the pages themselves, but rather AFTER all the pages have been
+    sent.
+
+    <p>
+    This means that a document that has a title page, then contents
+    that share a font used on pages 2 onwards, will not be able to
+    correctly display page 2 until after the font has arrived in
+    the file, which will not be until all the page data has been
+    sent.
+
+    <p>
+    To mitigate against this, MuPDF provides a way whereby callers
+    can indicate that they are prepared to accept an 'incomplete'
+    rendering of the file (perhaps with missing images, or with
+    substitute fonts).
+
+    <p>
+    Callers prepared to tolerate such renderings should set the
+    'incomplete_ok' flag in the cookie, then call fz_run_page etc
+    as normal. If a FZ_ERROR_TRYLATER error is thrown at any point
+    during the page rendering, the error will be swallowed, the
+    'incomplete' field in the cookie will become non-zero and
+    rendering will continue. When control returns to the caller
+    the caller can check the value of the 'incomplete' field and
+    know that the rendering it received is not authoritative.
+
+</ul>
+
+<h2>Progressive loading using byte range requests</h2>
+
+<p>
+If the caller has control over the http fetch, then it is possible
+to use byte range requests to fetch the document 'out of order'.
+This enables non-linearized files to be progressively displayed as
+they download, and fetches complete renderings of pages earlier than
+would otherwise be the case. This process requires no changes within
+MuPDF itself, but rather in the way the progressive stream learns
+from the attempts MuPDF makes to fetch data.
+
+<p>
+Consider for example, an attempt to fetch a hypothetical file from
+a server.
+
+<ul>
+	<li><p>
+	The initial http request for the document is sent with a "Range:"
+	header to pull down the first (say) 4k of the file.
+
+	<li><p>
+	As soon as we get the header in from this initial request, we can
+	respond to meta stream operations to give the length, and whether
+	byte requests are accepted.
+
+	<ul>
+		<li><p>
+		If the header indicates that byte ranges are acceptable the
+		stream proceeds to go into a loop fetching chunks of the file
+		at a time (not necessarily in-order). Otherwise the server
+		will ignore the Range: header, and just serve the whole file.
+
+		<li><p>
+		If the header indicates a content-length, the stream returns
+		that.
+	</ul>
+
+	<li><p>
+	MuPDF can then decide how to proceed based upon these flags and
+	whether the file is linearized or not. (If the file contains a
+	linearized object, and the content length matches, then the file
+	is considered to be linear, otherwise it is not).
+
+	<p>
+	If the file is linear:
+
+	<ul>
+		<li><p>
+		We proceed to read objects out of the file as it downloads.
+		This will provide us the first page and all its resources. It
+		will also enable us to read the hint streams (if present).
+
+		<li><p>
+		Once we have read the hint streams, we unpack (and sanity
+		check) them to give us a map of where in the file each object
+		is predicted to live, and which objects are required for each
+		page. If any of these values are out of range, we treat the
+		file as if there were no hint streams.
+
+		<li><p>
+		If we have hints, any attempt to load a subsequent page will
+		cause MuPDF to attempt to read exactly the objects required.
+		This will cause a sequence of seeks in the fz_stream followed
+		by reads. If the stream does not have the data to satisfy that
+		request yet, the stream code should remember the location that
+		was fetched (and fetch that block in the background so that
+		future retries will succeed) and should raise an
+		FZ_ERROR_TRYLATER error.
+
+		<p>
+		[Typically therefore when we jump to a page in a linear file
+		on a byte request capable link, we will quickly see a rough
+		rendering, which will improve fairly fast as images and fonts
+		arrive.]
+
+		<li><p>
+		Regardless of whether we have hints or byte requests, on every
+		fz_load_page call MuPDF will attempt to process more of the
+		stream (that is assumed to be being downloaded in the
+		background). As linearized files are guaranteed to have pages
+		in order, pages will gradually become available. In the absence
+		of byte requests and hints however, we have no way of getting
+		resources early, so the renderings for these pages will remain
+		incomplete until much more of the file has arrived.
+
+		<p>
+		[Typically therefore when we jump to a page in a linear file
+		on a non byte request capable link, we will see a rough
+		rendering for that page as soon as data arrives for it (which
+		will typically take much longer than would be the case with
+		byte range capable downloads), and that will improve much more
+		slowly as images and fonts may not appear until almost the
+		whole file has arrived.]
+
+		<li><p>
+		When the whole file has arrived, then we will attempt to read
+		the outlines for the file.
+	</ul>
+
+	<p>
+	For a non-linearized PDF on a byte request capable stream:
+
+	<ul>
+		<li><p>
+		MuPDF will immediately seek to the end of the file to attempt
+		to read the trailer. This will fail with a FZ_ERROR_TRYLATER
+		due to the data not being here yet, but the stream code should
+		remember that this data is required and it should be prioritized
+		in the background fetch process.
+
+		<li><p>
+		Repeated attempts to open the stream should eventually succeed
+		therefore. As MuPDF jumps through the file trying to read first
+		the xrefs, then the page tree objects, then the page contents
+		themselves etc, the background fetching process will be driven
+		by the attempts to read the file in the foreground.
+
+		<p>
+		[Typically therefore the opening of a non-linearized file will
+		be slower than a linearized one, as the xrefs/page trees for a
+		non-linear file can be 20%+ of the file data. Once past this
+		initial point however, pages and data can be pulled from the
+		file almost as fast as with a linearized file.]
+	</ul>
+
+	<p>
+	For a non-linearized PDF on a non-byte request capable stream:
+
+	<ul>
+		<li><p>
+		MuPDF will immediately seek to the end of the file to attempt
+		to read the trailer. This will fail with a FZ_ERROR_TRYLATER
+		due to the data not being here yet. Subsequent retries will
+		continue to fail until the whole file has arrived, whereupon
+		the whole file will be instantly available.
+
+		<p>
+		[This is the worst case situation - nothing at all can be
+		displayed until the entire file has downloaded.]
+	</ul>
+
+	<p>
+	A typical structure for a fetcher process (see curl-stream.c in
+	mupdf-curl as an example) might therefore look like this:
+
+	<ul>
+		<li><p>
+		We consider the file as an (initially empty) buffer which we are
+		filling by making requests. In order to ensure that we make
+		maximum use of our download link, we ensure that whenever
+		one request finishes, we immediately launch another. Further, to
+		avoid the overheads for the request/response headers being too
+		large, we may want to divide the file into 'chunks', perhaps 4 or 32k
+		in size.
+
+		<li><p>
+		We can then have a receiver process that sits there in a loop
+		requesting chunks to fill this buffer. In the absence of
+		any other impetus the receiver should request the next 'chunk'
+		of data from the file that it does not yet have, following the last
+		fill point. Initially we start the fill point at the beginning of
+		the file, but this will move around based on the requests made of
+		the progressive stream.
+
+		<li><p>
+		Whenever MuPDF attempts to read from the stream, we check to see if
+		we have data for this area of the file already. If we do, we can
+		return it. If not, we remember this as the next "fill point" for our
+		receiver process and throw a FZ_ERROR_TRYLATER error.
+	</ul>
+
+</ul>
+
+</article>
+
+<footer>
+<a href="http://artifex.com"><img src="artifex-logo.png" align="right"></a>
+Copyright &copy; 2006-2017 Artifex Software Inc.
+</footer>
+
+</body>
+</html>
author	Tor Andersson <tor.andersson@artifex.com>	2017-04-07 16:18:53 +0200
committer	Tor Andersson <tor.andersson@artifex.com>	2017-04-13 13:52:16 +0200
commit	c88941abc6f0fe91a41dc35dcaa1874d4de2c429 (patch)
tree	51ff32d773f80219922a0458ec76f401ba21e138 /docs/coding-progressive.html
parent	9c805cc9f934cd12e89014db8ad70e3191cdaf2d (diff)
download	mupdf-c88941abc6f0fe91a41dc35dcaa1874d4de2c429.tar.xz