Resources

How To Articles

by Thom Parker of WindJack Solutions.
Copyright © 2004 by WindJack Solutions

Navigate the Internal Structure of a PDF Document

This article discusses how you can use PDF CanOpener to gain a better understanding of how the low level PDF Objects are grouped together to form the higher level structures that make up a PDF Document.

A PDF Document is a many layered thing, that is, it has many layers of abstraction. You can look at one from many different perspectives, each with its' own advantages and disadvantages. At the lowest level, the PDF File contains the raw document data. Next up, the COS Layer organizes this data into a tree of simple objects. At the PD layer, these simple objects are put together to implement useful intermediate level structures like Fonts and Images. These are in turn organized into higher level constructs like Annotations and Pages. Some of these objects are also used to impose logical structure, like paragraphs and article threads. And there are more layers still.

Each of these layers of abstraction has its' own independent set of rules. For example, what constitutes a legal file format may not contain any useful objects. The COS Object Tree may contain many objects that do not contribute to the document display or are completely unintelligible to Acrobat, but still form a legal object tree. Knowing how to navigate these structures is essential to any PDF related development effort. PDF CanOpener makes this task easier than it ever has been before by providing a concurrent display of the document's COS Object Tree and navigation tools that show the relationship between objects in the tree and graphics displayed on the screen.

PDF File Structure:

The PDF File Format is text with some binary data mixed in. If you open it in a text editor you'll see the raw objects that define the structure and content of the document. Explicit object definitions are prefixed with some text that looks like this '12 0 obj' , the number 12 is the object reference. This is called an indirect object since it can be referenced by its' number. You will also see objects without this reference prefix. These objects are called Direct Objects and are always contained inside other objects. A container object that references another object does so with the syntax '12 0 R' , to include the previous object defined with '12 0 obj' . There are only 8 low level, or COS, object types.

The first 5 are scaler (single value) types:

Integer - in the file as a number without a decimal point.
Boolean - in the file as the text 'true' or 'false'.
Real Number - in the file as a number with a decimal point.
Name - in the file as '/text' i.e. a forward slash, '/', followed by some text, no white space or punctuation allowed.
String - in the file as either '(...characters...)' or '<...hexadecimal character codes...>' .

The next 3 are container types:
Dictionary - in the file as '<<...other objects...>>' . Dictionary entries are always in pairs, a Name Object followed by any other object type.
Array - in the file as '[...other objects...]'. Just a list of other object types.
Stream - in the file as '20 0 obj<<...stream attribute objs...>>stream...binary data...endstream' . This is the most complex type. It's actually a Dictionary Object mated with a string a bytes. The Dictionary contains information necessary for accessing the data in the string of bytes. Streams are always indirect objects, so they always begin with an object reference.

Getting tired yet? Looking at the raw data in the file is really not very useful. The structure of a PDF File does not match the structure of the PDF Document it describes. For the most part the file looks like a list of unconnected object definitions. To get a sense of how it all fits together you need to look at it another way.

The COS Object Tree:

The COS Object Tree is the real meat of a PDF Document. Except for some security info and the Info dictionary (neither of which contribute to the document display), everything in the document is in the COS Object Tree. Many of the nodes in this tree are the root nodes for a higher level object type. Typically this root node will be a COS Dictionary. Two notable exceptions are the XObject, which has a COS Stream as the root, and the Rectangle Object, which uses the COS Array as its root object. Let's take a look at some of these objects. For the purposes of this article I'll be using screen shots of PDF CanOpener to display different parts the COS Object Tree. This discussion will focus on the most important part of the tree, the pages. Articles on navigating other parts of the COS Object Tree will come later.

If you have PDF CanOpener just activate it and it will show you the root of the current Document (if you don't have it, download the free trial copy and install it now). Expand this node to get a look at the first level of objects in the document. Already this is easier than looking at the raw data in the file.

There are a lot of objects here, but for the minimal PDF Document all you need is the "Page Tree," the root of which is the first Pages object. The functions of some of the other objects are obvious from the names, like OpenAction and PageLayout, others are a bit cryptic until you know more about the PDF Spec. What they all have in common is that they all encapsulate properties that are global to and extend the basic functionality of this document. The entries shown above are fairly typical and all of them, and a few more, are explained in the PDF spec, except for the last two. I used PDF CanOpener to create these custom objects and placed them in the document root. PDF CanOpener uses the Acrobat SDK so you can do the same. Acrobat will save them with the document even though it has no idea what they are. It will do this because they are part of a well formed object tree and it doesn't have to know about them. It just ignores what it can't use, but will preserve these tree nodes as long as they don't interfere with anything else it wants to do. All third party PDF editing and viewing tools should do the same. This object tree is a very powerful and flexible thing. It allows a PDF file to be both backward and forward compatible. A PDF file's capabilities can be extended not just by Adobe, but by anyone who defines a custom object type and adds it to the COS Object Tree. Of course you also have to write the software to take advantage of the new objects and many people have, so some of the things you may see in an object tree will not be in the PDF Specification.

Objects defined by the PDF Spec. tend to have a similar structure. This similarity is most evident in PDF Objects that were added to the spec at the same time. For example, expand the Pages Dictionary.

There are 3 members, Type, Count, and Kids. Most, but not all, PD layer objects will have a Type member. In this case it's "Pages" for a Pages Object. The Count member indicates how many total pages there are in this branch of the Page Tree. The Kids Array is a list of either Pages or Page Objects. The Pages Objects are the intermediate nodes of the Page Tree and the Page Objects are the leaves. The depth of the tree depends on how many pages are in a document. This document has 3 pages so it has a one layer Page Tree. Acrobat starts adding layers of Pages Objects to this tree as the number of pages in a document grows, mostly keeping the number of objects in the Kids Array between 4 and 10 to create a balanced tree.

The Page Object encapsulates all the information needed to draw a single document page. Expand a Page Object..

As with the Pages Object, the Page dictionary has a Type member. The main content of a page is stored in the Contents member. This Stream Object contains a list of PDF marking (drawing) operators. These operators are the first items drawn on the page, followed by the Annotations, in the Annots Array. The visible coordinate space (or clipping area) for the page is defined by a "Rectangle" Object. The Rectangle Object is an array of 4 fixed point numbers, they are the user coordinates for the left, bottom, right, and top sides of the rectangle. This page has two Rectangle Objects, MediaBox and CropBox, the intersection of these two rectangles will determine the visible drawing area of the page. There are 3 other types of rectangular objects that a page might have, the BleedBox, TrimBox, and ArtBox, these objects have special printing functions and don't affect the visible display of the page in Acrobat.

The Content Stream may contain references to a number of resources it needs during the drawing process. These resources are of course, listed in the Resources Dictionary. The page content can also obtain resources from its' page's parent's Resources Dictionary (rarely) if it does not have a Resources Dictionary of its' own. Open the Resources dictionary.

This dictionary contains lists of each type of resource referenced in the content stream. Except for the "Pattern" Resource, this one contains every type of resource defined in the PDF Spec. . Each of these resources is used to control some drawing attribute in a section of the page content.

The ColorSpace and Font have obvious functions. However, they can both be horrendously complex internally.
XObjects are a way of both abstracting drawing elements out of the stream and representing raster images. They are very useful in page content, and are also used as the graphical representation of all Annotations.
The ProcSet list tells Acrobat what operator procedure sets will be needed to render this content stream in PostScript.
The ExtGState object contains parameters for line width and style as well as a whole host of parameters for controlling the fine details of how things are drawn.
The Shading object provides a way of controlling color transitions across a drawing area.

Each of these resource objects, except for the ProcSet, is a complex construction of both the simple COS object types and higher level objects. For example, expand the XObject list. If the page you're looking at doesn't have one then just follow along with this example. The root object for the XObject is a COS Stream. What makes it an XObject are the entries in its' Attributes Dictionary. Open up one of the XObjects and take a look inside the Attributes Dictionary.

A couple of these entries are already familiar. The Type entry, of course, has the value "XObject" and the Resources Dictionary has the same format as the Resources Dictionary in the Page Object. In fact, it can contain more XObject resources that contain other XObject resources.

XObjects come in two flavors, "Form" and "Image," indicated by the value of the SubType entry. This one is a "Form" XObject, which means that the stream data has a format identical to the page content stream, but unlike the page content, its' resources are inside its' attribute dictionary. An XObject's resources can also be in the page's Resource dictionary, but the PDF Specification discourages this practice, so it is not common.

XObjects have their own coordinate space, which is clipped by the BBox entry. The Matrix entry provides the translation between the XObject's coordinate space and the coordinate space of the containing object, in this case it's the page's drawing area, called the User Space in the PDF Spec.

We have traversed the COS Object Tree from the Document Root down one path through several high level objects until we reached the last object on this branch. The XObject shown above is on the bottom of the drawing stack since its' resources are empty. If it weren't empty we could keep pushing down through the objects in the Resources until we found the last thing that Acrobat needed to draw this object on the page, which is what Acrobat has to do. Sometimes is seems as if the hierarchy is endless. Partly this is because with PDF CanOpener you are looking at the lowest level of abstraction in the PDF Document. A higher level view would hide the endless parameters some of the objects have. Acrobat can also build some very complex structures for what seem like simple things. As an example, create a Text Box Annotation on a document and then look at its' COS Object Tree. It is represented in the Annotation list as a single Annotation. Now right click on it in the page and set the status. Refresh the page's Annotation list in the PDFCanOpener display. There are now 3 Annotations associated with the original Text Box. Change the status again and it becomes 5 Annotations. Acrobat uses these extra Annotations to save the status history, adding a huge number of COS Objects to the tree in the process.

Sometimes you may have to really look hard for the data you want. The last example shows the depths to which some data is buried. This screen shot is of a Multimedia Object that plays a QuickTime movie. The Multimedia Object lives inside of an Annotation. With the PDF Spec alone it would be very difficult to understand the structure of this object and locate the actual movie data. PDF CanOpener provides the visualization of the COS Object Tree you need to get to the bottom of the document, literally.

The inspiration for this image was provide by Leonard Rosenthal, the PDF Guru Extraordinaire.

We hope this material was helpful to you. If you have any questions or comments for us or want more info on PDF CanOpener, please send email to info@windjack.com.

Check back regularly for new articles.

How to Articles

Resources