How to web scrape with Puppeteer in Google Cloud Functions

In this article, I will use Javascript (Node.js) for the code, Yarn as a package manager for Node, and apt-get for OS dependencies.

When you need data from a source that doesn’t provide an API, you have to do web scraping. That’s why you can consider using Puppeteer combined with Google Cloud Functions. Puppeteer is a library that uses Chromium to automate browser interactions. However, this is a time-consuming process, heavy for CPU and memory. So in order to keep your app light, you may want to execute this code into a cloud environment like Google Cloud Functions (the equivalent of AWS Lambda).

Basic configuration

Let’s start by initializing a node project:

$ yarn init -y

Then, cd to your new project and install Puppeteer:

$ yarn add puppeteer

This will download the most recent stable version of Chromium on your machine, about ~200MB depending on your OS.

In order to test and deploy your functions, you will need to install the Google Cloud SDK and the Google Cloud Functions Emulator. To get the SDK, run the following command (on Ubuntu):

$ sudo apt-get install google-cloud-sdk

This SDK will allow you to deploy your functions. But before that, you will need to test them locally with the functions emulator:

$ yarn global add @google-cloud/functions-emulator --ignore-engines

The --ignore-engines option will very likely be required. Currently, the Google Cloud Functions Emulator is fully compatible with Node 6. If your Node version is higher than that, the dependency won’t work unless you choose to ignore it with this option.

So basically, your project only needs two files:

  • index.js for your Javascript code
  • package.json for the Puppeteer dependency and your scripts

Here, package.json contains the basic scripts to test your function locally and deploy it:

This file contains the main dependency of this project, puppeteer, and two scripts to test and deploy your function. Both scripts rely on scrapingExample, the name used in the example below with exports.scrapingExample.

  • The deploy script is used to put your function on a remote cloud environment. --trigger-http associates an HTTP verb (by default POST) to our function. --runtime is the runtime used here (others are available like Node 6, Go and Python). The complete list of options is available here.
  • The start script launches the functions emulator and locally deploys the function with the same --trigger-http flag described above.

The following code is a basic configuration for index.js:

There is a lot of boilerplate here: the only important lines are lines 38-41! However, we’ll go through the rest of the code to understand what happens.

First, we import puppeteer and declare its options:

  • headless is one of the most important options. When you test your function locally, put it to false to see what happens in your browser. Every action of your script will be visible. Nevertheless, you must put it to true before deploying it to Google Cloud Functions. Otherwise, the execution will crash because the service cannot execute the GUI of Chromium.
  • args contains a list of useful options. Some of them are pretty explicit like --disable-gpu or --timeout=30000 and some others like--no-sandbox are here to prevent crashes in some environments. The complete list of arguments can be found here.

Then finally comes the code, split into 3 functions:

  • openConnection initializes all the necessary objects to browse with Puppeteer. It also sets a few parameters like the user agent and the viewport, necessary for some websites.
  • closeConnection destroys the objects initialized before and must be called at the end of every execution, regardless of the results of the execution. I’ll explain why in the Tips and tricks section.
  • scrapingExample is the main function, which is going to be called by the functions emulator and deployed in Google Cloud. The exports. before the function name makes it available for Google Cloud Functions. In order to keep this example simple, it only does a simple thing: go to the Medium homepage, get its first article title, and return it.

Interactions with Google Cloud Storage

At some point, you may need to have persistent data. To do that, you cannot use the execution environment of your Google Cloud Function. A storage in fact exists, but it is temporary and very limited. To store a large number of files, you can use a cloud storage service like Google Cloud Storage or AWS S3. Just know that with the Google Cloud’s Free Plan, you cannot send data to another IP, so in this case, forget about Amazon S3, and go for Google Cloud Storage.

There are several ways to upload files to a cloud storage. The most elegant one (not always possible), is to download your file (through axios for example), and pipe it to your remote bucket. This way, you never store anything in your Cloud Function environment, and avoid a lot of potential problems, like available storage or file naming. You can see an example of this method here.

But sometimes, piping directly is not possible so you need to store your files in a temporary directory before uploading them. There is a simple way to initialize and use Google Cloud Storage with Puppeteer:

Here, we do several things:

  1. We import Puppeteer and Google Cloud Storage.
  2. We initialize our bucket and Puppeteer.
  3. We allow Puppeteer to download files and we define the storage location. In the context of a Google Cloud Function, you would only be able to write in the /tmp/ directory.
  4. We scrape our file: Puppeteer goes to the page, clicks the link (which will download the file to /tmp/) and upload it to Google Cloud Storage.

Handling bad website design

As a programmer, it’s a common thing to say it’s someone else’s fault. And when you do web scraping… this may be true! In fact, a website can be very poorly designed at several levels, making it difficult to scrape.

One problem you may encounter is related to page loading. Puppeteer provides several functions to wait for events. For example, if you need to navigate to a page and get an element from it, you can use the following function: await page.waitForNavigation({ waitUntil: 'load' }). However, bad website design can make this instruction crash if you try to get an unexisting HTML element on the new page. Some websites trigger the load event when the new page is loaded, but it only contains a loader element. You have to be careful, and it’s sometimes preferable to use await page.waitForSelector('.mySelector'). The good thing about these two functions is that they have an optional timeout argument. This can be useful on websites with a long loading time: the default timeout is 500ms.

You also need to be careful with navigation links. Sometimes the information you want to scrape won’t be on a page directly accessible by URL. Some websites load data as you navigate, and you may need to reproduce a full “human” browsing to get the information you need.

Finally, be very precise with your CSS selectors! Some websites use the same id on several elements. This can make you select the wrong element in your code. When possible, use the > selector (or other selectors) to prevent any ambiguity.

Tips and tricks

Memory management

Your Google Cloud Function can run out of memory if you are not careful. Puppeteer launches Chromium, and you need to instantiate big objects (like browser or page) to use it. In the example above titled Basic configuration, you can see that closeConnection is called in the finally block. This is to destroy the objects and clean up the memory as you exit the function. In many Puppeteer examples, you don’t destroy anything in case of error. After several executions, your environment memory can then become full, and the first instruction puppeteer.launch(PUPPETEER_OPTIONS) will crash.

Debugging

In the Google Cloud Management Console, you have access to logs that give you information about the remote execution of your functions. But for your local logs, you can use:

$ functions logs read

To clear them, just execute (sudo may be required here):

$ functions logs clear

DOM interactions

In order to get information on DOM elements, you can use the Puppeteer function page.evaluate(). Inside its callback, you have access to DOM elements (through CSS selectors for example), but the rest of your code is not accessible. As a second argument after the callback, you can pass it a serializable object. This means that a function defined outside evaluate() cannot be used inside of it.

Another problem with page.evaluate() is that it’s hard to debug. In fact, if you try to use console.log inside of it, you won’t see anything in your local logs. To solve this issue, add the following instruction just after you initialize the page object:

await page.on(‘console’, obj => console.log(obj.text()));

Using headless

When you test your function locally, you almost always put the headless option to false to see what happens in your browser. But when you deploy your function, you want the headless option to be set to true (otherwise it won’t work). So here is the perfect place to use an environment variable as the value of headless.

Optimization

Finally, a very easy way to reduce the execution time of your cloud function is to parallelize text inputs in forms. If you have forms to fill, instead of doing several await page.type('.selector', fieldValue), parallelize them in a Promise.all. Of course, the submitting of the form must be done outside of this Promise.all to have valid field values.

Sources

I hope you found this article useful! Feel free to give me your feedback and ask any questions :)