Lossless Methods: Optimization of Document Content
The first article of our PDF Optimization In-depth Series is available here.
This blog post was first published on the GdPicture blog.
The PDF format is interactive.
During the release cycle of a PDF document, different people will use tools such as forms, annotations, attachments, and more.
Archived PDF documents generally do not have the same use as those in circulation in a collaborative context. The set of data generated by the interactivity instruments will no longer be useful once the document is archived.
Even if the document is pending, it may be interesting to optimize it before distribution. The reasons are, for example, not to exceed the size limit of the attached files on some platforms or to speed up the opening from mobiles or tablets.
Deleting content deemed unnecessary
The most obvious approach here is to remove the interactive content that is not required by the audience of the document.
At the same, once the document is archived, it may not be useful at all to keep some data from a file. The best candidates for such a lossless optimization are:
- File attachments — attached files definitively increase the file size. Removal of those not relevant for viewing does not affect the document’s content.
- Bookmarks, hyperlinks — these elements are convenient and allow easy navigation, but they are not essential to view the PDF file correctly.
- Annotations, form fields, JavaScript actions — the contents of such elements, which aren’t in use anymore, can be deleted from the document.
- Page thumbnails — thumbnails images, when stored inside the document, enable faster navigation. But they are still rendered in real-time after removal.
- Metadata — these can sometimes be very bulky as they also include any type of data like photos, files, and more. Here you should be more careful. Metadata may contain information useful for indexing the document if necessary.
- Color profiles — it is the content intended for the printing chain used by the printers.
You can select which content you will remove depending on your needs.
These options are independent of each other.
With the GdPicturePDFReducer class, it is easy and straightforward.
Deleting unused objects and unused content
Version 1.4 of the PDF specification introduces support for incremental updates.
It is a method for saving new updates to a PDF document without completely rewriting it. The content of a document is updating gradually without the need to regenerate existing data. The changes are added to the end of the file, leaving the original content unaltered.
This type of backup is widely used in collaborative management systems to allow commenting, annotating, and modifying data very quickly.
The goal is to preserve earlier versions of the document within the same file.
The counterpart of this approach is the continuously increasing weight of the resulting material. Even in the context of data deletion.
For example, if a user wants to delete the page of an existing document, some parts of the page content can remain in a newly produced document. It is also not uncommon to end up with files containing unexploited objects even if no incremental save has been performed. It happens quite often that shared document resources are transferred to single pages while splitting the source document.
In such context it may be very efficient to regenerate the resulting document as long as:
- Preservation of previous versions is not required.
- The file does not contain an electronic certificate intended to validate the integrity of the document.
For such purposes, the GdPicturePDF::RemoveUnusedResources() method will help to optimize the document without loss. It eliminates redundant resources and removes all the objects not used in the generated document anymore.
The file size reduction in such cases is significant, while the document content for users is without any visual change.
In the next article, we will talk about another lossless method: compression of streams and fonts.
Loïc & Elodie