How to web scrape with Puppeteer in Google Cloud Functions
In this article, I will use Javascript (Node.js) for the code, Yarn as a package manager for Node, and apt-get for OS dependencies.
When you need data from a source that doesn’t provide an API, you have to do web scraping. That’s why you can consider using Puppeteer combined with Google Cloud Functions. Puppeteer is a library that uses Chromium to automate browser interactions. However, this is a time-consuming process, heavy for CPU and memory. So in order to keep your app light, you may want to execute this code into a cloud environment like Google Cloud Functions (the equivalent of AWS Lambda).
Basic configuration
Let’s start by initializing a node project:
$ yarn init -y
Then, cd
to your new project and install Puppeteer:
$ yarn add puppeteer
This will download the most recent stable version of Chromium on your machine, about ~200MB depending on your OS.
In order to test and deploy your functions, you will need to install the Google Cloud SDK and the Google Cloud Functions Emulator. To get the SDK, run the following command (on Ubuntu):
$ sudo apt-get install google-cloud-sdk
This SDK will allow you to deploy your functions. But before that, you will need to test them locally with the functions emulator:
$ yarn global add @google-cloud/functions-emulator --ignore-engines
Note: The @google-cloud/functions-emulator package seems to be deprecated in favor of the Functions Framework and firebase-tools. For more information, go here.
The --ignore-engines
option will very likely be required. Currently, the Google Cloud Functions Emulator is fully compatible with Node 6. If your Node version is higher than that, the dependency won’t work unless you choose to ignore it with this option.
So basically, your project only needs two files:
index.js
for your Javascript codepackage.json
for the Puppeteer dependency and your scripts
Here, package.json
contains the basic scripts to test your function locally and deploy it:
This file contains the main dependency of this project, puppeteer
, and two scripts to test and deploy your function. Both scripts rely on scrapingExample
, the name used in the example below with exports.scrapingExample
.
- The
deploy
script is used to put your function on a remote cloud environment.--trigger-http
associates an HTTP verb (by default POST) to our function.--runtime
is the runtime used here (others are available like Node 6, Go and Python). The complete list of options is available here. - The
start
script launches the functions emulator and locally deploys the function with the same--trigger-http
flag described above.
The following code is a basic configuration for index.js
:
There is a lot of boilerplate here: the only important lines are lines 38-41! However, we’ll go through the rest of the code to understand what happens.
First, we import puppeteer
and declare its options:
headless
is one of the most important options. When you test your function locally, put it tofalse
to see what happens in your browser. Every action of your script will be visible. Nevertheless, you must put it totrue
before deploying it to Google Cloud Functions. Otherwise, the execution will crash because the service cannot execute the GUI of Chromium.args
contains a list of useful options. Some of them are pretty explicit like--disable-gpu
or--timeout=30000
and some others like--no-sandbox
are here to prevent crashes in some environments. The complete list of arguments can be found here.
Then finally comes the code, split into 3 functions:
openConnection
initializes all the necessary objects to browse with Puppeteer. It also sets a few parameters like the user agent and the viewport, necessary for some websites.closeConnection
destroys the objects initialized before and must be called at the end of every execution, regardless of the results of the execution. I’ll explain why in the Tips and tricks section.scrapingExample
is the main function, which is going to be called by the functions emulator and deployed in Google Cloud. Theexports.
before the function name makes it available for Google Cloud Functions. In order to keep this example simple, it only does a simple thing: go to the Medium homepage, get its first article title, and return it.
Interactions with Google Cloud Storage
At some point, you may need to have persistent data. To do that, you cannot use the execution environment of your Google Cloud Function. A storage in fact exists, but it is temporary and very limited. To store a large number of files, you can use a cloud storage service like Google Cloud Storage or AWS S3. Just know that with the Google Cloud’s Free Plan, you cannot send data to another IP, so in this case, forget about Amazon S3, and go for Google Cloud Storage.
There are several ways to upload files to a cloud storage. The most elegant one (not always possible), is to download your file (through axios for example), and pipe it to your remote bucket. This way, you never store anything in your Cloud Function environment, and avoid a lot of potential problems, like available storage or file naming. You can see an example of this method here.
But sometimes, piping directly is not possible so you need to store your files in a temporary directory before uploading them. There is a simple way to initialize and use Google Cloud Storage with Puppeteer:
Here, we do several things:
- We import Puppeteer and Google Cloud Storage.
- We initialize our bucket and Puppeteer.
- We allow Puppeteer to download files and we define the storage location. In the context of a Google Cloud Function, you would only be able to write in the
/tmp/
directory. - We scrape our file: Puppeteer goes to the page, clicks the link (which will download the file to
/tmp/
) and upload it to Google Cloud Storage.
Handling bad website design
As a programmer, it’s a common thing to say it’s someone else’s fault. And when you do web scraping… this may be true! In fact, a website can be very poorly designed at several levels, making it difficult to scrape.
One problem you may encounter is related to page loading. Puppeteer provides several functions to wait for events. For example, if you need to navigate to a page and get an element from it, you can use the following function: await page.waitForNavigation({ waitUntil: 'load' })
. However, bad website design can make this instruction crash if you try to get an unexisting HTML element on the new page. Some websites trigger the load
event when the new page is loaded, but it only contains a loader element. You have to be careful, and it’s sometimes preferable to use await page.waitForSelector('.mySelector')
. The good thing about these two functions is that they have an optional timeout
argument. This can be useful on websites with a long loading time: the default timeout is 500ms.
You also need to be careful with navigation links. Sometimes the information you want to scrape won’t be on a page directly accessible by URL. Some websites load data as you navigate, and you may need to reproduce a full “human” browsing to get the information you need.
Finally, be very precise with your CSS selectors! Some websites use the same id on several elements. This can make you select the wrong element in your code. When possible, use the >
selector (or other selectors) to prevent any ambiguity.
Tips and tricks
Memory management
Your Google Cloud Function can run out of memory if you are not careful. Puppeteer launches Chromium, and you need to instantiate big objects (like browser
or page
) to use it. In the example above titled Basic configuration, you can see that closeConnection
is called in the finally
block. This is to destroy the objects and clean up the memory as you exit the function. In many Puppeteer examples, you don’t destroy anything in case of error. After several executions, your environment memory can then become full, and the first instruction puppeteer.launch(PUPPETEER_OPTIONS)
will crash.
Debugging
In the Google Cloud Management Console, you have access to logs that give you information about the remote execution of your functions. But for your local logs, you can use:
$ functions logs read
To clear them, just execute (sudo
may be required here):
$ functions logs clear
DOM interactions
In order to get information on DOM elements, you can use the Puppeteer function page.evaluate()
. Inside its callback, you have access to DOM elements (through CSS selectors for example), but the rest of your code is not accessible. As a second argument after the callback, you can pass it a serializable object. This means that a function defined outside evaluate()
cannot be used inside it.
Another problem with page.evaluate()
is that it’s hard to debug. In fact, if you try to use console.log
inside it, you won’t see anything in your local logs. To solve this issue, add the following instruction just after you initialize the page
object:
await page.on(‘console’, obj => console.log(obj.text()));
Using headless
When you test your function locally, you almost always put the headless
option to false
to see what happens in your browser. But when you deploy your function, you want the headless
option to be set to true
(otherwise it won’t work). So here is the perfect place to use an environment variable as the value of headless
.
Optimization
Finally, a very easy way to reduce the execution time of your cloud function is to parallelize text inputs in forms. If you have forms to fill, instead of doing several await page.type('.selector', fieldValue)
, parallelize them in a Promise.all
. Of course, the submitting of the form must be done outside this Promise.all
to have valid field values.
Sources
- Puppeteer documentation
- Google Cloud SDK documentation
- Google Cloud Functions Quickstart
- Github Puppeteer issues: sometimes better than the documentation!
- A list of 30 useful CSS selectors, good to have precise DOM selectors
- Yarn documentation
- Node documentation
- Two other great articles about the Puppeteer and Google Cloud Functions : here and here
- My personal gist, containing the code examples of this article
I hope you found this article useful! Feel free to give me your feedback and ask any questions :)