Why we ditched PDF notarization

Published in

S1SEVEN

9 min readJan 9, 2023

Introduction

In June 2022, we at S1SEVEN decided to ditch a feature of our product we were proud of and had previously promoted in conversations with customers, prospects, and partners:

As an issuer of material certificates, I want to notarize the PDF rendering of digital material certificates on the Blockchain so that all PDF recipients can easily verify the integrity of its content.

Let’s skip the “notarization of a PDF” for now — it will be explained during the discussion. It wasn’t an easy decision; instead, it resulted from two years of thinking, discussions, experiments, and failures. In this post, we want to share the essential insights that have helped us make the decision.

To start with, let’s look back to the early days of PDF.

Idea of PDF

John Warnock, the co-founder of Adobe, was working on a project named “Camelot” in the early 1990s:

This project’s goal is to solve a fundamental problem that confronts today’s companies. The problem concerns our ability to communicate visual material between different computer applications and systems. The issue is that most programs print to a wide range of printers, but there is no universal way to communicate and view this printed information electronically.

He solved the problem successfully by developing the Portable Document Format, short PDF. It became an open standard in 2008 when the International Organization for Standardization published it as ISO 32000–1:2008, revised in 2020 as ISO 32000–2:2020.

As you are undoubtedly aware, PDF has become a massive success — Adobe counts more than 250 billion PDF documents opened in their Acrobat Reader application annually!

Application of PDF

S1SEVEN’s mission is to digitize material quality certificates, also called mill inspection reports, certificates of analysis, or similar. They are part of the delivery of materials such as steel, aluminum, copper, or industrial plastics, and they are regulated in Europe by the norm EN 10204–2:2004.

When the predecessor of EN10204, DIN 50049, was published in 1951, nobody could have imagined that sometime in the future, paper would be replaced by some form of “electronic document.” The problem in those days was that storing large quantities of electronic data was expensive, unreliable, and hard to access. Material certificates were created with typewriters, stamped and signed, and sent to customers by mail for decades.

EN 10204 was published originally in 1995, obviously still relying on paper and mail. Computers were still relatively rare, and the exchange of data via networks was in its infancy. But computers and text processing programs started to substitute typewriters, so material certificates were created on computers, printed on paper, and sent by mail. Its revision in 2004 explicitly allows the transmission of CoAs electronically without sending a copy by mail. With the availability of PDF and email, it was apparent to create PDF documents and send them by email instead of printing and sending them by mail:

The PDF document looks exactly like the printout.
It is not so easy to tamper with.
Everybody can access them using free applications.

Using PDF was the most natural migration path from mail to electronic transmission: cheap and almost frictionless.

Limitations of PDF

Computers have become much more potent in the last two decades, and society started to understand the value data brings to the table. British scientist Clive Humby stated in 2006, “Data is the new oil.” This became famous a decade later when the Economist titled “The world’s most valuable resource is no longer oil, but data. “.

The new problem: certificates contain a large quantity of data, but they can’t be accessed directly by computer — it has to be extracted. The tedious and ultimately unfeasible approach is to extract the data manually; the smarter one is to use computers. In the last two years, we have learned of many initiatives to extract data in material quality certificates with the help of OCR, ML, and AI. The general insights are always:

Each supplier creates a document with a slightly different structure.
Each supplier may reference the exact measurements slightly differently, which must be considered.
Manual reconciliation is still required due to recognition errors.
Even if standards for the structure and content exist, such as EN 10168 for steel certificates, everybody applies these in different ways.

After spending a substantial amount of money, the results are usually relatively meager.

Let us summarize the situation in the following way: PDFs are very efficient for visually representing data in a way that humans can process in seconds, but they are hard to process using machines. To make material data accessible for everybody easily and affordably, new technology has to be applied.

JavaScript Object Notation — JSON

In the short history of computing machines, many technologies to exchange data between computers have been developed. Today’s standard is JSON:

JSON is a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate.

Its development started in the early days of browsers in 1999 and has evolved over the years, becoming an ISO standard in 2017. We choose JSON to define certificate data structures because it is simple and supported by many programming languages and tools.

The simplicity of JSON brings another challenge to it. Technically, JSON is a convention on how to write data in a structured form in a text file. This can be done with the help of a text editor or even Microsoft Word, so manipulation of JSON is open to everybody.

This is great from a usability point of view. Still, it also has one major drawback: it is impossible to determine if a file has been altered accidentally or intentionally. The question to be answered was: how to make JSON documents immutable — or rephrased: how is it possible to detect changes? This problem has existed in one or the other form since the early days of computing. The paper How To Time-Stamp a Digital Document gives a neat introduction to the ideas and challenges to do so. Nowadays, all components are available thanks to modern cryptography:

Hash algorithms come in handy to calculate a collision-free checksum — the inputs are the same resulting in the same hash value. Any modification can be detected by simply calculating the hash and comparing it with previous results.
Blockchain, a recent development utilizing cryptographic tools, serves as an immutable medium to store the hash calculated by the issuer of the material certificate.
Any receiver of the JSON material certificate along the value chain can detect any modification simply by calculating the hash and querying for the hash on the Blockchain. If it doesn’t exist, the file was modified at some point.

Migration from PDF to JSON

After finding a solution to give electronic data the immutability feature, we started to think about a migration path from PDFs to JSON documents for issuers and receivers of material quality certificates. It should be as natural as the switch from paper printouts to PDFs was 20 years earlier so the industry can quickly realize the gains of machine-readable data.

The solution is simple:

Create a PDF certificate document from the JSON certificate file.
Continue to send the PDF certificate document and the JSON certificate by email.
Open source the library that renders JSON documents as PDFs so that everybody can audit the code and verify the correctness of the renderings.

The benefits are on both sides:

The issuer can migrate to a new process by sending a nice-looking PDF with the same information as in the past. The JSON is just a second attachment to the email.
The receiver continues to get a PDF certificate document that fulfills all current regulatory requirements.
The receiver can use or ignore the JSON certificate file.

At this point, let us emphasize that companies are keen to get the JSON; the PDFs are only documentation going into the long-term archive to comply with current regulations. But more critical for us is that we have found a solution allowing the industry to migrate from PDF to JSON at minimal cost and friction as 20 years earlier from paper to PDF.

Improving PDF

Now that we had solved the migration challenge, we came up with the idea to calculate a hash from the PDF document and store it on Blockchain in the same way as the JSON document. We assumed that we could improve the “quality” of the PDF document:

A PDF document would become effectively “immutable” as modifications could be detected by simply calculating the hash from the PDF document in hands and looking it up on the Blockchain. If the calculated hash is not found, the file contains a modification that would be challenging to detect otherwise.

In addition, PDFs would be secured against attacks described in recent reports such as Processing Dangerous Paths– On Security and Privacy of the Portable Document Format. The implementation is straightforward — calculate the PDF hash and store it on a blockchain. Storing the hash of the PDF together with the hash of the JSON even creates a perfect link between the two of them. The owner of both the JSON and PDF documents can verify the integrity of both documents by again calculating the hashes, looking them up on the Blockchain, and even verifying the link between the documents.

We thought this was a great feature to implement, adding value for the users. However, at some point in time, we realized that there was something we had overlooked.

New challenges

We open-sourced the tools to render a PDF document so that anybody can audit the code, and developers can use the code to build new solutions on top of our developments. As a side effect, we enable everybody to create a PDF from a JSON document anytime they want. However, the hash of the PDF document will be different as there are additional data in the PDF file: the most obvious one is the creation time of the PDF.

Suppose we try to verify the JSON and the new PDF document. In that case, the result will be disappointing: the verification of the JSON will succeed, and the verification of the PDF will fail for the reason given above. This doesn’t make sense from a user’s perspective: the PDF was generated from the JSON! The user will be frustrated and try to compare the contents of the JSON document with what he finds on the PDF — the data matches! What’s going on?

The dear reader might consider this a hypothetical problem, but everybody knows this will happen. It might happen only rarely, which is even worse because humans tend to forget the reasons for technical issues — users and support staff. Every time it happens, a search for the cause will start frustrating all involved stakeholders.

The development team of S1SEVEN has the clear objective to reduce support effort as much as possible by delivering well-designed and tested solutions and making users happy. Another challenge to tackle.

We came up with what we thought would be an uncomplicated process:

Extract an image from each page of the PDF document (PNG as it is lossless).
Calculate the hash from the images.
Build a Merkle tree from the hashes.
Store the Merkle tree root together with the hash of the JSON on Blockchain.

It is built on the assumption that the images extracted from the PDF document will always be the same. Sounds straightforward. Unfortunately, it is not.

Technical problems

It starts with the selection of a library that offers the functionality to extract images from PDFs:

Which ones are available?
Which functionality do they offer?
Which ones are maintained currently?
Which ones will be supported for a long time in the future with high probability?

We tried pdf2image, which depends on famous open-source libraries (GraphicsMagick and Ghostscript), which worked until we noticed different renderings based on the OS. Finally, we chose pdf.js from Mozilla, which does not directly depend on OS libraries. This library seemed the most promising one based on the criteria listed above worked very well.

But over time, it became clear that with each new version of the library, the extracted images will differ — it is simply not the objective of the Mozilla developers to keep the results consistent. The only option would be to fork a library or develop our own to deliver the same images consistently! This would be a difficult task that we are not even sure we could tackle. It might even be impossible — we can’t control the results of PDF rendering libraries either: everybody can fork our tools using different versions of PDF renders or build their own.

Conclusion

We could not find a solution for PDF notarization that met quality and cost requirements. Still, we utilize one of the outcomes in automated tests to generate PDF documents; we use image extraction and hashing to detect breaking changes. If you can benefit from that functionality, we would like to draw your attention to https://github.com/s1seven/schema-tools/blob/main/packages/generate-pdf/test/generate-pdf.spec.ts. Steal like a great artist and give it a star — we would appreciate that!

Finally, we realized that no one was trying to make paper better than it was when the future materialized. We should not try to improve PDF either, especially as we are building the future of material certificates and their data exchange. We believe digital material certificates in JSON format will become the standard the same way PDF became the standard in 2004 — it opens up so many new possibilities everybody wants to benefit from.

As the author of this article, I want to express my deepest gratitude to Edouard Maleix for trying very hard to tackle the challenges, the S1SEVEN team for comprehensive cooperation in this matter, and our customers for accepting the status quo on the merits of the details shared here.

References

https://materialidentity.org
https://github.com/s1seven
https://npmjs.com
https://s1seven.com