Word to PDF + Serverless = đź’•
How should one tackle the general problem of converting Word documents to PDF at scale?
My earlier article spoke primarily to Javascript developers; it showed how docx templating is now the easiest way to create PDFs in Javascript. You generate a docx, then use our docx-wasm to convert it to PDF.
Here I’ll show you how serverless enables conversion from docx at scale; think 1 million PDFs per hour. Easily and at low cost. Although we’ll again use node.js here, speaking as a Java developer myself, this serverless approach also works great for developers of other persuasions.
This is a sneak peak at more of what I will cover in my upcoming talk at the PDF Association conference in Seattle in June. See you there?
Conversion is CPU intensive
A characteristic of docx to PDF conversion is that it is inherently CPU intensive. You can basically convert one input file per CPU core at a time. Try to do 2, and it will just take twice as long. “How long” depends on the complexity of the document. Number of pages is a rough proxy for complexity, but what it contains also matters: think tables, images, page breaks, table of contents etc.
Historically, more cores meant more servers, so you scaled by throwing more servers at the problem. But this is expensive: in hardware, software licenses, and managing complexity (aka people). And either you over-provisioned the hardware (throwing money away), or risked over-load (ie delayed or lost jobs).
Serverless to the rescue!
I’m not going to explain the basics here; suffice to say that all the major cloud providers (AWS, Azure, Google etc) have serverless/FaaS (function as a service) offerings now, and there are also more open/portable alternatives, such as the Serverless Framework, OpenWhisk, and options for serverless on Kubernetes.
Here we’ll use AWS Lambda, but our approach is readily transferable to the other serverless environments. This is largely because these other environments also support functions written in node.js.
Cutting to the chase, here’s the sort of scalability you can expect performing conversions on Lambda:
Here we created 200,000 PDFs. It took 10 mins.
A single PDF took around 1 second, and AWS fired up 591 function instances in total to do the job. Firing up an instance running our function takes about 8 seconds, so there’s 591 dots just under the 10,000ms mark (that time includes an actual conversion).
Further down below, I explain how I used SQS to generate this load. The main point to make here though is how easy (and cheap!) it is to fire up those 590 “cores”. We didn’t have to do anything really; AWS Lambda did all that for us. The essence of serverless is that you don’t have to think about servers.
How does it work?
Basically, what happens is that our docx to PDF function executes in response to some trigger event. In AWS, there is a long list of possible trigger events, but notable ones include a REST API call, an S3 object created event, SQS (messages) and step functions.
We’ll want our trigger event to tell the function what docx to convert, and what to do with the result. Since AWS imposes limits on the size of an invocation payload, our design will be around an input docx in some S3 bucket, with the resulting PDF to be written to S3.
How to tell the function which S3 docx object to convert?
In the sample code at https://github.com/NativeDocuments/docx-to-pdf-on-AWS-Lambda, we support triggering when the docx object is created in S3, or when we receive a step function event, or an SQS message identifying the object. How the input docx and output PDF is identified depends on the trigger event:
See the README for details. The S3 object created event is useful for dev testing, but the others are better for real serverless applications because you have more control over the document being converted, and the PDF to write to. If you need some other Lambda trigger, adding it should be straightforward: just clone or fork the GitHub repo to get started.
Once the lambda has the docx, performing the actual conversion is a simple API call using docx-wasm. The PDF will be written to an S3 bucket, with key as per the table above.
To try it now, we have an app in the serverless repo: https://serverlessrepo.aws.amazon.com/applications/arn:aws:serverlessrepo:us-east-1:992364115735:applications~docx-to-pdf or search for “docx-to-pdf” in the Serverless Application Repository. Or you can get the source from GitHub.
To finish off this article, let me touch on 3 points.
Load Testing
A simple way to generate an S3 created event is to copy/paste a docx in the S3 web console. The good thing about this is that once you have a docx in S3, you don’t need to upload it again. You can simply copy/paste the object to generate the event. The same thing happens if you copy/paste a folder in S3: you get an event for each docx in it.
That’s handy for basic testing (say, with a directory of 1000 documents).
But its not suitable for triggering the lambda 200,000 times quickly. How to do that?
SQS helped here. When the lambda finishes, I had it generate an SQS event, which triggered the lambda to run another conversion. Actually, not one SQS event, but 9 or 10, so each lambda effectively kicks off 9 or 10 additional conversions. For example, in generation 1, there are 9 conversion requests; in generation 2, there are 10 x 9 = 90, and so on, so that in gen 5 there are 90,000. The SQS message contains a value saying which generation it belongs to, and after a set number of generations (like 5), we stop generating events.
I found this to be a quick and easy way to generate load: Lambda itself generates the load, with the help of SQS. Note that AWS does caution you to avoid recursive code, saying:
This could lead to unintended volume of function invocations and escalated costs. If you do accidentally do so, set the function concurrent execution limit to 0 immediately to throttle all invocations to the function, while you update the code.
Lambda Memory (MB) setting
You should pay attention to your Lambda Memory config setting, since it affects how much you pay, and how long things take.
I mentioned earlier that conversion to PDF is CPU intensive. But it doesn’t use much memory (at least as Native Documents has implemented it). Unfortunately, cloud providers invariably tie CPU and memory together: want more CPU? You have to take more RAM. With Lambda, they just have a slider for RAM, and it is this which determines how much CPU you get.
Since more RAM (CPU) costs more (per 100ms of execution time), ideally we want the sweet spot where the job gets done faster and pay less. Confirming this is a matter of experimentation with your particular documents, but experience suggests 2048MB results in enough processor that conversion is both faster and cheaper than lesser amounts. Be aware that docx-wasm uses a minimum of around 500MB RAM, and WebAssembly under Node can use a maximum of 2GB RAM, so these are the upper and lower bounds.
Cloud APIs and Sensitive Documents don’t mix
Companies tend to be worried — rightly — about the fate of documents sent across the Internet to some 3rd party API end point.
Performing the conversions yourself on AWS Lambda can help you to sleep easy at night. You control the endpoint, and you also get easy scalability which is not opaque (how much can that 3rd party endpoint really scale anyway?).
What’s not to like? :-)