Basic Structure of Portable Document Format (PDF)

Basic Structure of Portable Document Format (PDF)

Cover Image Credit: https://www.techradar.com/in/best/best-pdf-reader-for-mac

Portable Document Format (PDF) is the go-to standard when you want to ensure your document looked exactly the same everywhere, regardless of which device opened the file.

PDF had been standardized as ISO-32000 in 2008. The PDF specification that is used to write this article is available here.

Before we begin, let’s clarify some of the terms that will be used in this article

  • Objects — It is not the objects in the OOP sense, but the basic data type of the PDF standard. There are 9 types of objects (null, boolean, integer, real, name, string, array, dictionary and stream)

  • Dictionary — A key-value pair object that is unordered. They are denoted by << and >> at the beginning and the end.

  • Indirect Objects — Objects that are referred to by reference

  • Direct Objects — Objects that appear inline and are obtained directly

  • Conforming Reader — Application that parses a PDF file according to the PDF Standard

To follow along with the article, you could open a PDF file using a text editor and try to view the structure of the PDF file.

Structure of PDF File

For Simplicity, a PDF file is made up of 4 parts.

Overall Structure of PDF File

Header

The first line of the PDF file is a Header . This denotes the version of the PDF file with the following format

  • %PDF- followed by the version number in the form of 1.N

However, beginning with PDF 1.4 , the Version entry in the document’s catalog dictionary (Within Root entry of Trailer ), if present, will be used instead of the Header.

Furthermore, if a PDF File contains binary data (Most likely to have, as most modern PDF File contains stream object of some sort), the Header line shall be immediately followed by a line containing at least four binary characters (character codes of 128 or greater) like following

Body

Body of a PDF File consist of Indirect Objects representing the contents of a document. Indirect Objects starts with a unique object identifier that allows other objects to refer to it. The identifier is made up of the following

  • Object Number (Positive Integer, can be in any arbitrary order)

  • Generation Number (Non-negative Integer)

The Indirect Objects can be referred to from elsewhere by an Indirect Reference which consist of Object Number, Generation Number and keyword R (For example 4 0 R).

After the identifier is the keyword obj (start of the object) and endobj (end of the object), anything in between is a key-value pair that describes the object.

Following is a simple example of an Indirect Objects

Sample Indirect Object

Simply put, Body of a PDF file is a tree of objects linked together, ultimately coming down to the Root Object (Defined by Root entry in Trailer, is a catalog dictionary)

Cross-Reference Table

Basically, a table that contains a list of byte offset pointing to the indirect objects . Conforming reader uses Cross-Reference Table as a lookup table to access certain objects quickly when needed.

The format for entries in Cross-Reference Table can be summarized as follows:

  • In the following format nnnnnnnnnn ggggg n eol, a total of 20 bytes

  • nnnnnnnnnn is a 10-digit byte offset in the decoded stream

  • ggggg 5-digit generation number

  • n keyword for in-use entry or f keyword for free entry

  • eol 2 character end-of-line sequence (Like CR LF)

Cross-Reference Table always start with a special entry 0000000000 65535 f that never changes

Trailer

Trailer shows the location of Cross-Reference Table and certain special objects. A conforming reader always read a PDF file from its end, hence it will be able to access Cross-Reference Table and others Indirect Objects quickly without parsing the entire file.

The Trailer is basically another key-value pair Dictionary with the following format:

Following is an example of a Trailer dictionary.

The Trailer dictionary has the following keys:

  • /Size — Total number of entries in Cross-Reference Table

  • /Prev — Only used if there is more than one Cross-Reference Table . It is the byte offset in the decoded stream from the beginning of the file to the beginning of the previous Cross-Reference Section

  • /Root — Catalog dictionary for the PDF (Can be a dictionary or indirect reference )

  • /Encrypt — Encryption dictionary, if exists, the PDF file is encrypted (Can be a dictionary or indirect reference )

  • /Info — Information dictionary containing general information on the file. (Must be indirect reference )

  • /ID — Array of two byte-strings constituting a file identifier. (If Encrypt entry exists, then this entry is required and shall be direct objects )

Conclusion

Looking at the overall structure of a PDF file, we can see that the PDF standard is somewhat complicated, with plenty of consideration on being quick and memory efficient.