Skip to content
Go back

Basic Structure of Portable Document Format (PDF)

PDF Cover Image source: TechRadar

Portable Document Format (PDF) is the go-to standard when you want to ensure your document looked exactly the same everywhere, regardless of which device opened the file.

PDF had been standardized as ISO-32000 in 2008. The PDF specification that is used to write this article is available here.

Before we begin, let’s clarify some of the terms that will be used in this article

To follow along with the article, you could open a PDF file using a text editor and try to view the structure of the PDF file.

Structure of PDF File

For Simplicity, a PDF file is made up of 4 parts.

Overall Structure of PDF File

The first line of the PDF file is a Header . This denotes the version of the PDF file with the following format

However, beginning with PDF 1.4 , the Version entry in the document’s catalog dictionary (Within Root entry of Trailer ), if present, will be used instead of the Header.

Furthermore, if a PDF File contains binary data (Most likely to have, as most modern PDF File contains stream object of some sort), the Header line shall be immediately followed by a line containing at least four binary characters (character codes of 128 or greater) like following

Body

Body of a PDF File consist of Indirect Objects representing the contents of a document. Indirect Objects starts with a unique object identifier that allows other objects to refer to it. The identifier is made up of the following

The Indirect Objects can be referred to from elsewhere by an Indirect Reference which consist of Object Number, Generation Number and keyword R (For example 4 0 R).

After the identifier is the keyword obj (start of the object) and endobj (end of the object), anything in between is a key-value pair that describes the object.

Following is a simple example of an Indirect Objects

Sample Indirect Object

Simply put, Body of a PDF file is a tree of objects linked together, ultimately coming down to the Root Object (Defined by Root entry in Trailer, is a catalog dictionary)

Cross-Reference Table

Basically, a table that contains a list of byte offset pointing to the indirect objects . Conforming reader uses Cross-Reference Table as a lookup table to access certain objects quickly when needed.

The format for entries in Cross-Reference Table can be summarized as follows:

Cross-Reference Table always start with a special entry 0000000000 65535 f that never changes

Trailer

Trailer shows the location of Cross-Reference Table and certain special objects. A conforming reader always read a PDF file from its end, hence it will be able to access Cross-Reference Table and others Indirect Objects quickly without parsing the entire file.

The Trailer is basically another key-value pair Dictionary with the following format:

Following is an example of a Trailer dictionary.

The Trailer dictionary has the following keys:

Conclusion

Looking at the overall structure of a PDF file, we can see that the PDF standard is somewhat complicated, with plenty of consideration on being quick and memory efficient.


Share this post on:

Previous Post
Implementing Encryption Feature in pdf-lib
Next Post
Microsoft Cloud Skill Challenge is a Perfectly Balanced Challenge