HTML to PDF Conversion with Headless Chrome using Go

Andrew Zamler-Carhart
Compass True North
Published in
9 min readMay 1, 2018

--

Thanks to Matthew Molnar for developing this service.

The Compass platform has several products which generate content that we want to export from HTML to PDF. For example, our Marketing Center product lets users design print-ready marketing materials such as flyers, brochures, postcards and notecards. They are designed in HTML / JavaScript / CSS, and need to be converted to a PDF that we can send to our print fulfillment vendor.

Motivation

Previously, we were using a SaaS service called DocRaptor for this purpose. We decided to build our own internal service for several reasons:

  • Rendering Fidelity: DocRaptor uses the PrinceXML library, which renders HTML somewhat differently than a browser and doesn’t support the latest HTML + CSS standards. This was by far the main reason — we wanted to render the PDF as close as possible to what the user was seeing.
  • Performance: by keeping requests inside our own cloud infrastructure, we could avoid sending several megabytes of data across the open internet.
  • Cost: we could avoid paying DocRaptor’s monthly fee.
  • Security: keeping requests inside our own cloud also has the advantage of avoiding sending our data to third parties, as well as exposing our data in flight.

Headless Chrome Server Architecture

At a high level, we created an Export Service to abstract HTML to PDF conversion. It is written in Go to control Chrome in headless mode. Chrome 59 and above has a new headless mode that lets you run the app programmatically, and has a feature that prints to PDF. It works like Chrome because it is the Chrome app, just without the user interface.

We use a Go library that wraps the Chrome DevTools Protocol for controlling the browser, and a tool called pm2 for making sure that Chrome stays running on the server. Our service has an endpoint that takes the URL of a web page, tells Chrome to navigate to that page, waits for the page to load (this was the hardest part!), prints the page to PDF, and returns the PDF in binary form.

Controlling Headless Chrome using Go

You can launch Headless Chrome from the command line, but how can you communicate with it while it is running? Enter the Chrome DevTools Protocol (CDP). It allows external programs to communicate with a running instance of Chrome to inspect, profile and control it. Headless Chrome supports tabs just like regular Chrome; CDP offers complete control over opening and closing tabs.

Even though the official library for controlling headless chrome is puppeteer, written for node.js, Go is one of our first-class server languages, so we decided to implement the backend service in Go. We needed a library to use CDP. We decided to use the github.com/mafredri/cdp library, which provides a convenient Go wrapper for all CDP functions, such as navigating to a page and printing to PDF.

Our service will communicate with Chrome on the remote debugging port. It’s simply a matter of starting the Chrome executable with the --headless flag, and specifying the remote debugging port with --remote-debugging-port=9222.

Export Service Workflow

Here’s a sequence diagram that shows the complete workflow for creating a PDF, including how the Export service and Chrome interact with other components:

  1. The client makes a request to the frontend that it would like a PDF of the current page.
  2. The frontend makes a request to the Export service.
  3. The Export service tells Chrome to load the page from the frontend and create a PDF. That’s the focus of the rest of this article!
  4. Chrome returns the PDF data to the Export service.
  5. The Export service saves the PDF data to S3, which returns the URL where the file has been saved.
  6. The Export service returns returns the PDF URL to the frontend.
  7. The client downloads the PDF at the given URL from S3.

Code Walkthrough

The process for creating a PDF works like this:

  1. Connect to Chrome
  2. Open a new tab
  3. Connect to the tab
  4. Defer closing the tab
  5. Load the page
  6. Wait for the response to complete
  7. Print the PDF

Here is a simplified version of the function for creating a PDF. For brevity, errors are not handled in these examples (but you should handle them in your code).

Authentication & Working With Multiple Tabs

The export server needs to load a page on behalf of an existing logged in user, a page that normally should not be available to anonymous users. So we needed a way to transfer the current user’s session to the export service. Normally we use a session token stored in the cookies to identify the current user. In this case, we pass the session token as part of the request to the export service, and set the cookie when making a request to the frontend.

cookieArgs := network.NewSetCookieArgs(cookieName, cookieValue).
SetDomain(urlParsed.Host)
_, _ = c.Network.SetCookie(ctx, cookieArgs)

Once we got that working, we ran into another issue when testing the new export service for concurrency. We realized that when we have multiple requests coming to the server on behalf of different users, Chrome was sharing the session information between the tabs. This is the expected behavior for web users, since when opening a new tab on the same website, you want to still be logged in. However, for serving requests on behalf of different users, this was a blocking issue. Rather than opening a tab to the desired page, the solution was to create a new blank tab, then set the cookies, then navigate to the page.

Page Size

Chrome assumes that the page size is 8.5 x 11 inches, and does not automatically pull the page size from the page’s print styles. If our document has a different size, the dimensions need to be passed to the server as part of the API request, and then passed to Chrome.

printToPDFArgs := page.NewPrintToPDFArgs().
SetPaperWidth(width).
SetPaperHeight(height)

Waiting for Page Resources to Finish Loading

One of the biggest challenges we needed to resolve was how to definitively know that a page has been fully loaded. The app is written in Angular 1.5, and we need to account for the Angular loading lifecycle. The browser loads some images, fonts and other resources asynchronously, so the page may not be ready for printing even after Chrome reports that it is done loading the page. For example, we have some Angular components that pull in their own images, and those get loaded by JavaScript after the initial page has been reported as loaded.

We solved this by modifying our web application code. Components that need extra time for loading declare themselves by adding an attribute:

this.$element[0].setAttribute('loadable-component', '');

When a component is done loading, it fires an event:

this.$scope.$emit('LoadableComponentReady');

The page that contains the components finds the loadable components, waits for them to finish loading, and then fires an event when they are all done loading:

When the Go service runs, it injects a small script into the client page which returns a promise that resolves when image loading is complete.

Asynchronous Execution

All requests to our frontend in production are routed through a reverse proxy that has a request timeout of 30 seconds. When processing very large web pages, it may take longer than that to create a PDF, which could lead to request timeout errors.

We have a lightweight digital asset manager that stores URLs of image resources. When it makes a request to the export server to create a PDF, it can run synchronously or asynchronously. Normally, the request hangs until the PDF has been created and saved, and the PDF URL is returned in the response. When background mode is turned on, it creates a resource without a URL and returns immediately with a 201 Created status.

The client can then poll the resource once per second until the URL is available, even if it takes longer than 30 seconds. In this example, the createAsset() function creates the PDF resource, and then the getAsset() function polls for updates until the URL is available. We still set a timeout of 120 seconds to avoid infinite recursion.

Deployment Concerns

Keeping Chrome running

For performance reasons, it’s not practical to start an instance of Chrome every time we want to print a page. For the same reason that you only run one copy of Chrome on your computer, we can keep one copy of Chrome running on the server.

We chose the pm2 process manager to launch Chrome and keep it running in the background. For example, to launch Chrome in headless mode on macOS using pm2, you could run:

pm2 start /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--headless \
--remote-debugging-port=9222 \
--disable-gpu \
--disable-translate \
--disable-extensions \
--disable-background-networking \
--safebrowsing-disable-auto-update \
--disable-sync \
--disable-default-apps \
--hide-scrollbars
--metrics-recording-only \
--mute-audio \
--no-first-run

The most important argument is of course --headless. You’ll notice that many features like extensions and translation are turned off because we don’t need them. If you’re curious about the other options, here’s an explanation of all of Chrome’s command line switches.

Service Security

In theory, the client could specify the URL of any page on the internet. A hacker could even specify the URL of a local file such as file://etc/passwd and then “print” it to a PDF. Yikes! For security, the service will only load URLs matching a specific pattern.

Handling SSL errors

Our website uses HTTPS for security, but the servers we use for internal testing do not have valid SSL certificates. When loading a page with an invalid certificate, the browser normally displays a warning message that can be bypassed. However, it is not possible to click past this message with Headless Chrome. Instead, we can tell Chrome to ignore these errors.

Local Developer Experience

Running the Server Locally for Development

With DocRaptor, we were sending an HTML string to their service, while Chrome expects to receive the URL of a web page. You can think of this as the difference between passing data by copy and by reference. This makes the API request simpler because the URL is much shorter than an HTML string. However, this has some implications as far as reachability goes.

In production, the web server has a publicly accessible hostname, so it’s possible for the backend to make a request to the frontend. In local development, when a developer is running both the export service and the frontend locally, they can connect on localhost. However, we run into problems when a frontend developer wants to connect to an instance of the export server in the cloud while running the frontend on their computer. The developer’s computer is typically not reachable from the cloud because it does not have a public hostname and is behind a firewall.

To solve this problem, we use a service called ngrok. The developer runs a small program on their computer that creates a tunnel to the ngrok service. Requests to a public hostname are routed through the tunnel and forwarded to a particular port on the local computer. This effectively makes the local frontend reachable on the public internet.

Visual Debugging

So it turns out that Headless Chrome can have a head after all! While it doesn’t have a window of its own on the computer where it is running, you can point a browser to the debugging port (http://localhost:9222) and view the open tabs. When you click on a tab you’ll see a live preview, and you can even use the developer tools to view the console and debug the page. (The service normally closes a tab when it is done creating the PDF, but you can skip that when you’re debugging.)

Conclusion

As you can see, using the Chrome Debugging Protocol to create a PDF using Headless Chrome is fairly straightforward, but there are a number of issues that need to be resolved in order to have a fully operational service. Hopefully you’ve learned something useful about writing a Go service and the inner workings of Chrome.

Appendix: Other Approaches for HTML-PDF Conversion

Here are some other libraries worth considering. Some use Headless Chrome, and most are based on Node.js.

--

--