How Doctrine leverages Puppeteer to export content to PDF

Published in

Inside Doctrine

5 min readSep 26, 2023

Doctrine’s mission is to make the law more accessible and understandable. Although we do our best to provide accurate information at the right time, sometimes lawyers and legal professionals prefer to read a paper version. If we go back a bit in Doctrine’s history, we’ve been offering the possibility of exporting court decisions as PDFs since the company was founded. We’ll take a look at the different solutions we’ve used over the years before Puppeteer’s release in 2018.

Disclaimer: There are several solutions on the market for generating PDFs, and this article mentions some of them, but does not make a comparison to help you choose. We only present our experience in our particular context.

The beginning with a single export

To meet the user’s need to be able to consult content in the form of a PDF, we wanted the PDF export version to be as close as possible to the print version without having to code a second version.

As browsers evolved, we started looking at what could be done without the intervention of a server. We quickly came across the open source solution html2pdf.js, which uses html2canvas and jsPDF under the hood.

We managed to generate a decision page quite quickly, as the mechanism is very simple.

target a container element
all the nodes are rewritten with their style in a canvas injected into the page.
export the canvas as an images
a file is created with these images, using the DOM’s Blob and URL.createObjectURL features.

.from() -> .toContainer() -> .toCanvas() -> .toImg() -> .toPdf() -> .save()

We now have a PDF file that we can download. Unfortunately, as described above, it contains only images of the content, so users can’t select text or highlight, which is a practice much appreciated by lawyers.

We therefore looked for alternative solutions.

Headless browsers to the rescue

We were looking for a solution that would allow us to keep our existing HTML code without having to create a dedicated export page.

We were at the very beginning of the company’s history, and we were still looking to do things as simply as possible in order to move quickly.

We started by looking at the headless browser reference of the time, PhantomJs. This scriptable Headless Browser based on the Qt WebKit engine allowed us to directly expose an end-point that took a URL and returned a PDF.

const instance = await phantom.create();
const page = await instance.createPage();
await page.property("paperSize", {
  format: "A4",
  orientation: "portrait",
  margin: "1.5cm",
});

await page.open(url);
await page.render(outputFile);

Unfortunately we wanted to display content in 2 columns to condense the amount of information displayed, and Phantom.js did not allow this.

That’s why we looked at wkhtmltopdf, a similar tool based on the same engine that can fix it with a little CSS hack.

wkhtmltopdf(url, {
  pageSize: "A4",
  enableJavascript: true,
  javascriptDelay: 500,
  printMediaType: true,
}).pipe(fs.createWriteStream(outputFile));

We kept this solution in production for 2 years until the arrival of a newcomer on the market. An arrival that intrigued us because it prompted the creator of Phantom.js to stop his project.

The era of Chrome headless

In 2017, Google introduced the concept of a Headless Browser in Chrome, making it possible to create automated testing and server environments where you don’t need a visible UI shell.

A year later, Puppeteer was released, a Node library that provides complete automation for Chrome.

The combination of these tools immediately aroused our curiosity. Especially as Node.js is one of the languages of importance to Doctrine. We decided to stop using Qt WebKit engine based solution.

Even without the UI part, a headless chrome is a rather bulky and resource-hungry piece of software, so we tried to isolate it from the rest of our web app.
With our past experience, we also knew the content export function, while important, was not used all that often.
We therefore decided to host puppeteer in Serverless computing approach with a AWS lambda in order to benefit from pay-as-you-run.

Now let’s break down the steps involved in generating the file from a click in the interface.

The first step is to call our server with the resource to be generated. At this stage we check whether the resource is part of the available content, as not all our pages have a print version. As generation has a cost, we also check that the request respects the user’s quotas.

The server then invokes the lambda with the url to visit.
The lambda starts Chrome in headless mode, visits the URL and exports it as a PDF. In fact, this is Chrome’s print to PDF feature. This means that the two features render in the same way.

At the end of execution, the function sends the content to an S3 bucket and notifies the server that it has completed its task.

Some content is large, so we use a polling system to display a progression on the interface. Once we have received received notification of the function, we’re able to transmit the file to the browser for downloading.

You may be skeptical about the performances of such a complex system. It sure takes a long time to open a browser, and even on a powerful laptop, exporting a PDF of a hundred pages can take few minutes.

If we look at the stats for the last 12 months, we see that the average wait at P95 is around 5s. That’s a long time for a transaction, but with the feedback loop enabled by polling, it’s tolerable.

If you want to go further you can test the code or see the results of the generation of these 4 versions you can visit this repo.

Conclusion

We’ve been using this system for 5 years now and we’re very satisfied with what it enables us to do:

code a single version of pages and have a consistent experience between print and export
export a list of resources in zip format, enabling parallel generation

We may have to challenge that implementation in the future to improve performance or budget, but we’re confident that this system will enable us to do so smoothly and transparently for our users.

If you are interested in helping us with our future challenges, please feel free to apply for our job offers.

How Doctrine leverages Puppeteer to export content to PDF

The beginning with a single export

Headless browsers to the rescue

The era of Chrome headless

Conclusion

Written by Samuel Martineau