Running Pyppeteer on AWS Lambda with serverless

Published in

limehome-engineering

3 min readOct 14, 2020

At Limehome, I work as a Senior Data Engineer in our Business Intelligence team. To ensure the scalability of the company it is crucial to reduce the amount of manual work we have to do. We try to automate a repetitive task as much as possible. Sometimes this includes an interaction with a website through a browser, for example, if you want to collect some data or test a feature. We achieve this by using the puppeteer library.

What is Pyppeteer?

Puppeteer is a javascript library, that allows you to remote control a chrome browser. You can do most things that a real manual user could do. Taken straight from the readme, here are some scenarios where this might be helpful:

Generate screenshots and PDFs of pages.
Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. “SSR” (Server-Side Rendering)).
Automate form submission, UI testing, keyboard input, etc.
Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
Capture a timeline trace of your site to help diagnose performance issues.

Pyppeteer

If you’d rather program in python than in javascript you can use the excellent pyppeteer library. If you just want to try it out, you can install the library with:

pip install pyppeteer

From there you follow the simple example from the readme and you will see a browser launching and executing your commands. When you start pyppeteer for the first time, it will download a matching chromium executable for you.

Running it on AWS Lambda

If you want to run pyppeteer on AWS Lambda in headless mode, there are a couple of obstacles that you have to overcome.

The binary included in the pyppeteer project won’t work in the lambda environment, since it is missing a couple of shared libraries
There is a size limit on how big your deployed lambda can be. This means that we can’t just include another binary and execute it directly.

In the following section, you will find a step by step guide on how to use pyppeteer on AWS lambda with serverless. This setup only works on python < 3.8. It was tested with the following versions:

pyppeteer ==0.2.2
python 3.7
serverless-chrome=v1.0.0–55

Step 1:

As I mentioned before, the binary supplied by pyppeteer will not work in the lambda environment. Luckily there is another project which offers a compatible binary. You can find it here: serverless-chrome. Download this version and upload it somewhere to s3, where your lambda is allowed to read from. As the binary is too big to be included in the deployed package.

Step 2:

The first thing we have to do when executing our lambda is to download the chrome binary from s3. You only have to do this if your lambda is “cold”, meaning it hasn’t run in a while and was killed by AWS. Please notice that you cannot write to all paths. I choose /tmp since I read and write rights there. Here is some sample code to download the binary and unzip it:


import zipfile
import boto3s3_bucket = boto3.resource("s3").Bucket("YOUR_BUCKET")
zip_file_path = "/tmp/chrome.zip"
if not os.path.exists(zip_file_path):
    s3_bucket.download_file("S3_PATH", zip_file_path)
    with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
        zip_ref.extractall("/tmp")
        subprocess.run(["chmod", "+x", "/tmp/headless-chromium"], check=True)

Step 3:

In the last step, we have to configure pyppeteer to use our custom library. Just paste the following code snippet and you should be good to go!

from pyppeteer import launchbrowser = await launch(
    args=[
        "--no-sandbox",
        "--single-process",
        "--disable-dev-shm-usage",
        "--disable-gpu",
        "--no-zygote",
    ],
    executablePath="/tmp/headless-chromium",
    userDataDir="/tmp",
)

Step 4:

Configure your lambda to provide enough memory. This step is very important since the default value is too low for chrome to start. I went with 2 GB and it worked fine.

Conclusion

Automising tasks with the help of puppeteer can be a real productivity gain. Instead of manually browsing websites, we can write a script to automise the task. Combining all three technologies was not trivial as there was no documentation out there on how to do so. But following my proposed steps you can get started easily and quickly.

Running Pyppeteer on AWS Lambda with serverless

Pyppeteer

Running it on AWS Lambda

Conclusion

Written by Raphael Brand