MuPDF Progressive Loading

How to do progressive loading with MuPDF.

What is progressive loading?

The idea of progressive loading is that as you download a PDF file into a browser, you can display the pages as they appear.

MuPDF can make use of 2 different mechanisms to achieve this. The first relies on the file being "linearized", the second relies on the caller of MuPDF having fine control over the http fetch and on the server supporting byte-range fetches.

For optimum performance a file should be both linearized and be available over a byte-range supporting link, but benefits can still be had with either one of these alone.

Progressive download using "linearized" files

Adobe defines "linearized" PDFs as being ones that have both a specific layout of objects and a small amount of extra information to help avoid seeking within a file. The stated aim is to deliver the first page of a document in advance of the whole document downloading, whereupon subsequent pages will become available. Adobe also refers to these as "Optimized for fast web view" or "Web Optimized".

In fact, the standard outlines (poorly) a mechanism by which 'hints' can be included that enable the subsequent pages to be found within the file too. Unfortunately this is very poorly supported with many tools, and so the hints have to be treated with suspicion.

MuPDF will attempt to use hints if they are available, but will also use a linear search of the file to discover pages if not. This means that the first page will be displayed quickly, and then subsequent ones will appear with 'incomplete' renderings that improve over time as more and more resources are gradually delivered.

Essentially the file starts with a slightly modified header, and the first object in the file is a special one (the linearization object) that a) indicates that the file is linearized, and b) gives some useful information (like the number of pages in the file etc).

This object is then followed by all the objects required for the first page, then the "hint stream", then sets of object for each subsequent page in turn, then shared objects required for those pages, then various other random things.

[Yes, really. While page 1 is sent with all the objects that it uses, shared or otherwise, subsequent pages do not get shared resources until after all the unshared page objects have been sent.]

The Hint Stream

Adobe intended Hint Stream to be useful to facilitate the display of subsequent pages, but it has never used it. Consequently you can't trust people to write it properly - indeed Adobe outputs something that doesn't quite conform to the spec.

Consequently very few people actually use it. MuPDF will use it after sanity checking the values, and should cope with illegal/ incorrect values.

So how does MuPDF handle progressive loading?

MuPDF has made various extensions to its mechanisms for handling progressive loading.

Progressive loading using byte range requests

If the caller has control over the http fetch, then it is possible to use byte range requests to fetch the document 'out of order'. This enables non-linearized files to be progressively displayed as they download, and fetches complete renderings of pages earlier than would otherwise be the case. This process requires no changes within MuPDF itself, but rather in the way the progressive stream learns from the attempts MuPDF makes to fetch data.

Consider for example, an attempt to fetch a hypothetical file from a server.