Cover Image Credit: https://www.techradar.com/in/best/best-pdf-reader-for-mac
Portable Document Format (PDF) is the go-to standard when you want to ensure your document looked exactly the same everywhere, regardless of which device opened the file.
PDF had been standardized as ISO-32000 in 2008. The PDF specification that is used to write this article is available here.
Before we begin, let’s clarify some of the terms that will be used in this article
Objects — It is not the objects in the OOP sense, but the basic data type of the PDF standard. There are 9 types of objects (null, boolean, integer, real, name, string, array, dictionary and stream)
Dictionary — A key-value pair object that is unordered. They are denoted by
<<
and>>
at the beginning and the end.Indirect Objects — Objects that are referred to by reference
Direct Objects — Objects that appear inline and are obtained directly
Conforming Reader — Application that parses a PDF file according to the PDF Standard
To follow along with the article, you could open a PDF file using a text editor and try to view the structure of the PDF file.
Structure of PDF File
For Simplicity, a PDF file is made up of 4 parts.
Header
The first line of the PDF file is a Header
. This denotes the version of the PDF file with the following format
%PDF-
followed by the version number in the form of1.N
However, beginning with PDF 1.4
, the Version
entry in the document’s catalog
dictionary (Within Root
entry of Trailer
), if present, will be used instead of the Header
.
Furthermore, if a PDF File contains binary data (Most likely to have, as most modern PDF File contains stream object
of some sort), the Header
line shall be immediately followed by a line containing at least four binary characters (character codes of 128 or greater) like following
Body
Body
of a PDF File consist of Indirect Objects
representing the contents of a document. Indirect Objects
starts with a unique object identifier that allows other objects to refer to it. The identifier is made up of the following
Object Number (Positive Integer, can be in any arbitrary order)
Generation Number (Non-negative Integer)
The Indirect Objects
can be referred to from elsewhere by an Indirect Reference
which consist of Object Number, Generation Number and keyword R
(For example 4 0 R).
After the identifier is the keyword obj
(start of the object) and endobj
(end of the object), anything in between is a key-value
pair that describes the object.
Following is a simple example of an Indirect Objects
Sample Indirect Object
Simply put, Body
of a PDF file is a tree of objects linked together, ultimately coming down to the Root Object
(Defined by Root
entry in Trailer
, is a catalog
dictionary)
Cross-Reference Table
Basically, a table that contains a list of byte offset pointing to the indirect objects
. Conforming reader uses Cross-Reference Table
as a lookup table to access certain objects quickly when needed.
The format for entries in Cross-Reference Table
can be summarized as follows:
In the following format
nnnnnnnnnn ggggg n eol
, a total of 20 bytesnnnnnnnnnn
is a 10-digit byte offset in the decoded streamggggg
5-digit generation numbern
keyword for in-use entry orf
keyword for free entryeol
2 character end-of-line sequence (LikeCR LF
)
Cross-Reference Table
always start with a special entry 0000000000 65535 f
that never changes
Trailer
Trailer
shows the location of Cross-Reference Table
and certain special objects. A conforming reader always read a PDF file from its end, hence it will be able to access Cross-Reference Table
and others Indirect Objects
quickly without parsing the entire file.
The Trailer
is basically another key-value pair Dictionary
with the following format:
Following is an example of a Trailer
dictionary.
The Trailer
dictionary has the following keys:
/Size
— Total number of entries inCross-Reference Table
/Prev
— Only used if there is more than oneCross-Reference Table
. It is the byte offset in the decoded stream from the beginning of the file to the beginning of the previousCross-Reference Section
/Root
—Catalog
dictionary for the PDF (Can be adictionary
orindirect reference
)/Encrypt
—Encryption
dictionary, if exists, the PDF file is encrypted (Can be adictionary
orindirect reference
)/Info
—Information
dictionary containing general information on the file. (Must beindirect reference
)/ID
— Array of two byte-strings constituting a file identifier. (IfEncrypt
entry exists, then this entry is required and shall bedirect objects
)
Conclusion
Looking at the overall structure of a PDF file, we can see that the PDF standard is somewhat complicated, with plenty of consideration on being quick and memory efficient.