PDF Generation With AWS Lambda

Published in

Tata 1mg Technology

9 min readOct 5, 2020

PDF Generation With AWS Lambda

Serverless Computing is a great way of utilising resources of the cloud. It enables services to scale without the overhead of provisioning and managing servers with reduced cost. We have deployed one such service, PDF generation service on AWS Lambda which we will be discussing in this article. PDF generation process is done in many domains to create reports, bills, invoices etc. At the same time this process can take up resources such as CPU and memory. To make the process more efficient and scalable we used a Serverless approach for generating PDFs.

Serverless pattern encourages development of well-defined units of business logic without taking decision on how it’s deployed or scaled. It frees the user from deployment concerns, cost is done based on the execution of your programme, Auto scales are per traffic.

Serverless is mainly, of two types: BAAS (Backend as a service) that incorporates third party, cloud hosted applications and services such as single page web apps or mobile apps, FAAS (Function as a service) programmes run in stateless compute containers. They can be triggered, last only by one invocation. AWS Lambda is one of the most popular FAAS platforms at present.

AWS Lambda follows serverless architecture also known as serverless computing or function as a service, FaaS

To explain why we used Serverless approach AWS Lambda, for PDF generation, we will take an example of shipping invoice generation on each order delivery.

CPU Utilization: On each order delivery, user is communicated with shipping invoice PDF. On an average there is about ~20-30K requests per day on order service for generating PDFs. Generating each PDF on the backend service is a CPU expensive task, which leads to high cpu consumption and high latency in requests. Also, there are service restarts in case of high load due to PDF generation. To decouple the PDF generation process from backend server we shifted it to AWS Lambda. PDF file will be stored in AWS S3 Bucket and the file link is then shared with user over email. So hereby, we distributed the load by shifting the process to AWS Lambda.

Fig 1.1 EC2 Backend Service CPU Utilisation Metric

After releasing PDF generation service on 28th April, 2020 there is a ~40% reduction in maximum CPU utilisation metric of the backend server. This is after decoupling PDF generation logic from the backend and generating PDFs through PDF generation service on AWS Lambda .

Cost Effective: If we wanted to distribute the load only then this code could be deployed on EC2 instance instead of AWS Lambda but here’s why we didn’t go for it.

In terms of scalability and cost, lambda provided efficient solution. AWS Lambda is auto scalable so effort in scaling and managing EC2 instances is saved. Even smallest EC2 instance t2 nano would be costlier for two reasons; First, we need an ALB (Application Load Balancer) for load balancing between instances, which will add to the cost. Second, traffic is not evenly distributed always so we would need more EC2 instances than planned and EC2 will consume some memory being allocated. Lambda can handle the load balancing internally so no extra cost is added while scaling. Following shows the AWS Lambda Calculation based on inputs observed in production metrics and lambda settings.

On getting ~0.6 Million requests per month with each invocation duration as 2.5 sec and memory allocated as 3008 MB, the final calculated amount is 76.50 USD.

Serverless : Usually PDF generation process is a background task in backend services. Incase of high load on the service, results to cpu spikes as high number of backgrounds tasks accumulates. This can be managed by queuing these tasks and running them on intervals to prevent cpu spikes. But overall CPU consumption is still high and if a heavy PDF generation task is picked, then it will increase CPU utilization further. Since, it is possible to decouple PDF generation function from a backend server and it can be used as a common utility then it can deployed as a FAAS service. In order to achieve this we integrated it with API gateway and S3 to serve requests from multiple services. Currently, in 1mg multiple services uses PDF Generation for generating PDFs.

Fig 1.4 PDF Generation Service Flow Diagram

The process of generating PDF follows these steps, take input as HTML template, render it and create an HTML string which is passed to PDF library and string is converted into PDF file. Code supports jinja templates for rendering HTML and converting the HTML string to PDF. In order to make PDF generation service generic, it is templatized so that different backend services will share template S3 location and dynamic data to Lambda function through API.

We have used html-pdf library to convert HTML to PDF because:

We are passing the HTML content as buffer to the library and creating PDF instead of directly creating PDFs on the disk such as PDFKit which increase high I/O ops
We used templates for PDF generation to make process generic. Services will pass there template S3 location which will be used for generating PDFs. To create PDF we used Jinja templates which html-pdf library supports.

Library uses PhantomJS module internally which is a headless browser. This executable must be installed on your system. Follow the steps to install PhantomJS(https://phantomjs.org/download.html).

Prerequisite:

Node js 12.x — AWS Lambda supports Nodejs 12 version
Install Serverless — Module to deploy service on AWS Lambda

npm i -g serverless

Layer:

A layer is ZIP archive to configure Lambda function to pull in additional code. You can move runtime dependencies out of your function code by placing them in a layer. Lambda runtimes include paths in the /opt directory to ensure that your function code has access to libraries that are included in layers.

service: executables-layer

provider:
  name: aws
  stage: ${opt:stage, 'dev'}
  region: ${opt:region, 'ap-south-1'}

layers:
  pdfGenerator:
    path: executables
    name: pdfGenerator-${self:provider.stage}
    description: Executable binaries required to convert html to pdf

resources:
  Outputs:
    PDFGeneratorLayerExport:
      Value:
        Ref: PDFGeneratorLayer
      Export:
        Name: PDFGeneratorLayerLayer-${self:provider.stage}

Structure your layer so that function code can access libraries without additional configuration. We have created the layer previously, so all the executables that were deployed are accessible in function like /opt/phantomjs_linux-x86_64.

Code:

We will be creating a simple function that takes HTML template file and dynamic data for rendering as input and converts it to the PDF. In this example, we will be using Jinja template engine. We will be uploading the generated PDF file to S3. We are using the html-pdf library for converting HTML to PDF. To get more info. about the configuration of html-pdf visit (https://www.npmjs.com/package/html-pdf)

Let’s get started with the function.

mkdir pdfGenerator
cd pdfGenerator
touch handler.js

Write the following code in the handler.js file

We have imported following libraries for PDF generation function. AWS-SDK library is for using S3 client.

import pdf from 'html-pdf'
import AWS from 'aws-sdk'
import nunjucks from 'nunjucks'

In the code, you can see that we have set some environment variables before the function. It is very important to set this environment variables to work properly. You can find more info. about AWS Environment Variables.

process.env.PATH = `${process.env.PATH}:/opt`
process.env.FONTCONFIG_PATH = '/opt'
process.env.LD_LIBRARY_PATH = '/opt'

It is important that we initialise all the argument’s default values prior to avoid getting exception within the function.

let OUT_PDF_OPTIONS = {"format":"Letter", "orientation": "landscape", "border": '15mm', "zoomFactor": "0.6"};let PDF_UPLOAD_ARGS = {ContentType: 'application/pdf', ACL:'public-read'};let OUTPUT_PDF_NAME_POSTFIX = ".pdf"

One of the best practices in lambda functions is to initialise all client objects globally so that client object creation on each invocation is avoided to reduce average duration of the invocation.

const s3 = new AWS.S3();

Here we are formatting input data before its used for PDF generation process.

const transform_inputs = payload => {
    let template_dynamic_data = payload.template_dynamic_data
    let template_s3_bucket_details = payload.template_s3_bucket_details
    let pdf_s3_bucket_details = payload.pdf_s3_bucket_details
    let version = payload.version
    let resource_lock_id = payload.resource_lock_id
    return {'template_dynamic_data': template_dynamic_data,
            'template_s3_bucket_details': template_s3_bucket_details,
            'pdf_s3_bucket_details': pdf_s3_bucket_details,
            'version':version, 'resource_lock_id': resource_lock_id}
    }

PDF Generator function:

export const pdfGenerator = async event => {
    try {
        let payload = event        let transform_payload = transform_inputs(payload);
         
        // template bucket details        let template_s3_bucket = transform_payload.template_s3_bucket_details.BUCKET_NAME
        let template_s3_key = transform_payload.template_s3_bucket_details.OBJECT_KEY
 
        // Bucket details for storing PDF generated        let pdf_bucket = transform_payload.pdf_s3_bucket_details.BUCKET_NAME;
        let pdf_file_info = transform_payload.pdf_s3_bucket_details.PDF_FILE_INFO;
        let pdf_file_path = pdf_file_info.PATH
        
        if (pdf_file_path && pdf_file_path.split('.').length > 1 && (!pdf_file_path.endsWith(OUTPUT_PDF_NAME_POSTFIX))){
            console.log('Incorrect pdf file extension')
            return {
            'statusCode': 400,
            'body': JSON.stringify({"message": "Incorrect pdf file extension"})
            }
        }
        if (pdf_file_path && pdf_file_path.split('.').length == 1){
            pdf_file_path = pdf_file_path + OUTPUT_PDF_NAME_POSTFIX
        }
        let pdf_generation_options = pdf_file_info.PDF_GENERATION_OPTION;
        let pdf_upload_extra_args = pdf_file_info.PDF_UPLOAD_EXTRA_ARGS;        // Dynamic data for rendering PDF
        let render_data = transform_payload.template_dynamic_data
        
        // Data for queuing purpose
        let version = transform_payload.version
        let resource_lock_id = transform_payload.resource_lock_id

        // template Object
        let Data = await s3.getObject({ Bucket: template_s3_bucket, Key: template_s3_key }).promise();
        
        // Body will be a buffer type so need to convert it to string before converting to pdf
        let html = Data.Body.toString();
        let template = nunjucks.compile(html);        // Dynamic data rendered into the template
        let content = template.render(render_data);        let options = OUT_PDF_OPTIONS;
        if (pdf_generation_options && Object.keys(pdf_generation_options).length){
            options = pdf_generation_options;
        }        // PDF generation
        let file = await exportHtmlToPdf(content, options);

        // PDF upload to s3
        let upload_args = PDF_UPLOAD_ARGS;
        if (pdf_upload_extra_args && Object.keys(pdf_upload_extra_args).length){
            upload_args = pdf_upload_extra_args
        }
        upload_args.Bucket = pdf_bucket
        upload_args.Key = pdf_file_path
        upload_args.Body = file
        let file_upload_data = await s3.upload(upload_args).promise();

        // Response formatting
        let message = ''
        let url = ''
        let status
        if (file_upload_data.Location){
            status = 200
            url = file_upload_data.Location
            message = 'PDF generated successfully'
        }
        else{
            status = 400
            url = ''
            message = 'Error in generating pdf'
        }
        let response_message = {
            'message': message,
            'version': version,
            'resource_lock_id': resource_lock_id
        }
        let body = {"message": response_message,
            "url": url}
    return {
            'statusCode': status,
            'body': JSON.stringify(body)
    }
    } catch (error) {
        return {
         'statusCode': 500,
         'body': JSON.stringify(error)
        }
    }
}

exportHtmlToPDF function:

const exportHtmlToPdf = async (html, options) => {
    return new Promise((resolve, reject) => {
        options.phantomPath= "/opt/phantomjs_linux-x86_64";
        pdf.create(html, options).toBuffer((err, buffer) => {
            if (err) {
                console.log('Error in exportHtmlToPdf')
                reject(err)
            } else {
                resolve(buffer)
            }
        });
    })
}

Another important configuration inside exportHtmlToPdf function is the phantomPath is set to /opt/phantomjs_linux-x86_64. This path is important else you will get an error saying PhantomJS not found.

Our function is now ready. Let us now set up the serverless.yml file for deployment purpose of the script

touch serverless.yml

Use the following code in serverless.yml

service: pdfGenerator

provider:
  name: aws
  runtime: nodejs12.x
  stage: ${opt:stage, 'dev'}
  region: ${opt:region, 'ap-south-1'}
  environment:
    S3_BUCKET: file-upload-bucket

functions:
  pdfGenerator:
    handler: handler.pdfGenerator
    layers:
      - ${cf:executables-layer-${self:provider.stage}.PDFGeneratorLayerExport}

# serverless optimization
package:
  individually: true

custom:
  webpack:
    webpackConfig: ../webpack.config.js
    includeModules:
      forceExclude:
        - aws-sdk
      packagePath: ../package.json

plugins:
  - serverless-webpack
  - serverless-offline

We are now ready for deployment. Deploy the function with the following command:

sls deploy --stage dev

The function is now ready to use. After deployment, you will get an endpoint like https://xxxxxxxx.execute-api.ap-south-1.amazonaws.com/dev/api/pdfGenerator. The region can be different for you. Try to invoke the API with any tool you like.

Production Metrics:

Over the time we have monitored Lambda Production Metrics and refactored the PDF Generator lambda function to improve the average duration of invocations and high success rate. To Further improve the efficiency of lambda function we are integrating it to AWS SQS to support queuing mechanism. Thanks for reading the article and I hope this has helped you. Stay tuned for further article from 1mg.

Written by Anushka Rustagi