Convert Doc or Docx to pdf using AWS Lambda

word — > pdf with zero administration!

Kuharan Bhowmik
Analytics Vidhya

--

Table of contents:

Introduction
Idea
Prebuilt Layers
Brotlipy
Final step
Quick walkthrough

Introduction

We have all converted doc to pdf in local but the question is to convert on the fly using AWS serverless compute engine AWS Lambda running on Amazon Linux. Unlike windows which have win32com , there is no way to install Microsoft office tools for the APIs!

A lot of other libraries I searched on the internet uses Microsoft office in the backend and coverts to pdf. Also, Microsoft office comes with a license cost.

Thankfully there was one open-source tool, LibreOffice but again how to install it on a serverless system? Without wasting further time, I followed the steps over here to create my own LibreOffice layer. It generated a brotli package lo.tar.br file which is LibreOffice v6.4.0.1. You can also choose to create a gzip.

Additionally, there is a command tool associated with LibreOffice that can be used to convert to pdf.

libreoffice [--accept=accept-string] [--base] [--calc] [--convert-to output_file_extension[:output_filter_name] [--outdir output_dir] file]... [--display display] [--draw] [--global] [--headless] [--help|-h|-?] [--impress] [--invisible] [--infilter="<filter>"] [--math] [--minimized] [-n file]... [--nodefault] [--nolockcheck] [--nologo] [--norestore] [-o file]... [-p file...] [--print-to-file [--printer-name printer_name] [--outdir output_dir] file]... [--pt printername file...] [--show Impress file]... [--unaccept=accept-string] [--terminate_after_init] [--view file]... [--web] [--writer] [file...]

For docker users: Put the file lo.tar.br in /opt directory. The rest of the things would be taken care of by the code while processing.

# Required for LibreOffice
COPY ./path/lo.tar.br /opt/

Idea

Now the idea was to do the following (roughly):

  1. This layer just adds /opt/lo.tar.br or /opt/lo.tar.gz file to your Lambda runtime.
  2. Unpack /opt/lo.tar.br or /opt/lo.tar.gz file during Lambda execution into /tmp the folder which has 512 MB of free space.
  3. LibreOffice binary will be located available at /opt/instdir/program/soffice.bin

Also, there are some prebuilt arns that you can use if you don't have time to build your layer.

Prebuilt Layers

AWS RegionLayer ARN (brotli)us-east-1arn:aws:lambda:us-east-1:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:us-east-1:764866452798:layer:libreoffice-gzip:1eu-west-1arn:aws:lambda:eu-west-1:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:eu-west-1:764866452798:layer:libreoffice-gzip:1eu-central-1arn:aws:lambda:eu-central-1:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:eu-central-1:764866452798:layer:libreoffice-gzip:1us-west-2arn:aws:lambda:us-west-2:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:us-west-2:764866452798:layer:libreoffice-gzip:1us-east-2arn:aws:lambda:us-east-2:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:us-east-2:764866452798:layer:libreoffice-gzip:1ap-southeast-2arn:aws:lambda:ap-southeast-2:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:ap-southeast-2:764866452798:layer:libreoffice-gzip:1eu-west-2arn:aws:lambda:eu-west-2:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:eu-west-2:764866452798:layer:libreoffice-gzip:1ap-southeast-1arn:aws:lambda:ap-southeast-1:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:ap-southeast-1:764866452798:layer:libreoffice-gzip:1ap-south-1arn:aws:lambda:ap-south-1:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:ap-south-1:764866452798:layer:libreoffice-gzip:1ca-central-1arn:aws:lambda:ca-central-1:764866452798:layer:libreoffice-brotli:1 or
arn:aws:lambda:ca-central-1:764866452798:layer:libreoffice-gzip:1sa-east-1arn:aws:lambda:sa-east-1:764866452798:layer:libreoffice-brotli:1 o

Brotlipy

This library contains Python bindings for the reference Brotli encoder/decoder, available here. This allows Python software to use the Brotli compression algorithm directly from Python code.

The next step is to build a layer for brotlipy. I have created one for myself and saved it here for reuse—

Add this to the layer as well. So we have two layers for our function.

Final step

Here is a quick code in python 3.8 for the conversion.

import os
from io import BytesIO
import tarfile
import boto3
import subprocess
import brotli
libre_office_install_dir = '/tmp/instdir'def load_libre_office():
if os.path.exists(libre_office_install_dir) and os.path.isdir(libre_office_install_dir):
print('We have a cached copy of LibreOffice, skipping extraction')
else:
print('No cached copy of LibreOffice, extracting tar stream from Brotli file.')
buffer = BytesIO()
with open('/opt/lo.tar.br', 'rb') as brotli_file:
d = brotli.Decompressor()
while True:
chunk = brotli_file.read(1024)
buffer.write(d.decompress(chunk))
if len(chunk) < 1024:
break
buffer.seek(0)
print('Extracting tar stream to /tmp for caching.')
with tarfile.open(fileobj=buffer) as tar:
tar.extractall('/tmp')
print('Done caching LibreOffice!')
return f'{libre_office_install_dir}/program/soffice.bin'def download_from_s3(bucket, key, download_path):
s3 = boto3.client("s3")
s3.download_file(bucket, key, download_path)
def upload_to_s3(file_path, bucket, key):
s3 = boto3.client("s3")
s3.upload_file(file_path, bucket, key)
def convert_word_to_pdf(soffice_path, word_file_path, output_dir):
conv_cmd = f"{soffice_path} --headless --norestore --invisible --nodefault --nofirststartwizard --nolockcheck --nologo --convert-to pdf:writer_pdf_Export --outdir {output_dir} {word_file_path}"
response = subprocess.run(conv_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if response.returncode != 0:
response = subprocess.run(conv_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if response.returncode != 0:
return False
return True
def lambda_handler(event, context):
bucket = "xxxx"
key = "xxxx/xxxx/xxxx/xxxx/SampleDoc.docx"
key_prefix, base_name = os.path.split(key)
download_path = f"/tmp/{base_name}"
output_dir = "/tmp"
download_from_s3(bucket, key, download_path)soffice_path = load_libre_office()

is_converted = convert_word_to_pdf(soffice_path, download_path, output_dir)
if is_converted:
file_name, _ = os.path.splitext(base_name)
upload_to_s3(f"{output_dir}/{file_name}.pdf", bucket, f"{key_prefix}/{file_name}.pdf")
return {"response": "file converted to PDF and available at same S3 location of input key"}
else:
return {"response": "cannot convert this document to PDF"}

Quick walkthrough

load_libre_office will install the libre office by decompressing it in tmp if it is not present and reuse it every time till the function is hot.

The function download_from_s3 will download the word file in tmp, convert_word_to_pdf will convert it and upload_to_s3 will upload it to s3.

convert_word_to_pdf uses a subprocess to run the conversion command.

Currently, this thing puts the pdf into the same location, but you can always modify this to put it in some other location.

The important part is the whole package size is still 1.1 KB.

Hope this helps resolve the big mess. Thank you for reading. Give it a clap if you like it. In case there are issues let me know in the comments below. I’ll be happy to help. Until next time. Bye!

--

--