Getting started with Puppeteer and Chrome Headless for Web Scraping

Emad Ehsan
Aug 25, 2017 · 7 min read
Image for post
Image for post

[Update]: You can read Chinese version of this article here.

For sure, Chrome being the market leader in web browsing, Chrome Headless is going to be industry leader in Automated Testing of web applications. So, I have put together this starter guide on how to get started with Web Scraping in Chrome Headless.

Puppeteer is the official tool for Chrome Headless by Google Chrome team. Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. Including PhantomJS. Selenium IDE for Firefox has been discontinued due to lack of maintainers.

Update

TL;DR

Getting Started

Project setup

$ mkdir thal
$ cd thal

Initiate NPM. And put in the necessary details.

$ npm init

Install Puppeteer. Its not stable and repository is updated daily. If you want to avail the latest functionality you can install it directly from its GitHub repository.

$ npm i --save puppeteer

Puppeteer includes its own chrome / chromium, that is guaranteed to work headless. So each time you install / update puppeteer, it will download its specific chrome version.

Coding

Screenshot

If its your first time using Node 7 or 8, you might be unfamiliar with async and await keywords. To put async/await in really simple words, an async function returns a Promise. The promise when resolves might return the result that you asked for. But to do this in a single line, you tie the call to async function with await. Save this in index.js inside project directory.

Also create the screenshots directory.

$ mkdir screenshots

Run the code with

$ node index.js

The screenshot is now saved inside screenshots/ dir.

Image for post
Image for post

Login to GitHub

Image for post
Image for post

Some of them have made their emails publicly visible and some have chosen not to. But the thing is you can’t see these emails without logging in. So, lets login. We will make heavy use of Puppeteer documentation.

Add a file creds.js in project root. I highly recommend signing up for new account with a new dummy email because you might end up getting your account blocked.

Add another file .gitignore and put following content inside it:

node_modules/
creds.js

Launch in non headless

const browser = await puppeteer.launch({
headless: false
});

Lets navigate to login

await page.goto('https://github.com/login');

Open https://github.com/login in your browser. Right click on input box below Username or email address and select Inspect. From developers tool, right click on the highlighted code and select Copy then Copy selector.

Image for post
Image for post

Paste that value to following constant

const USERNAME_SELECTOR = '#login_field';

Repeat the process for Password input box and Sign in button. You would have following

Logging in

Up on top, require creds.js file.

const CREDS = require('./creds');

And then add this code to the function to fill in credentials and login

Search GitHub

Rearranging a bit

Lets navigate to this page and wait to see if it actually searched?

await page.goto(searchUrl);
await page.waitFor(2*1000);

Extract Emails

You can see that I also added LENGTH_SELECTOR_CLASS above. If you look at the page's code inside developers tool, you will observe that divs with class user-list-item are actually housing information about a single user each.

Currently one way to extract text from an element is by using evaluate method of Page or ElementHandle. When we navigate to page with search results, we will use page.evaluate method to get the length of users list on the page. The evaluate method evaluates the code inside browser context.

Let’s loop through all the listed users and extract emails. As we loop through the DOM, we have to change index inside the selectors to point to the next DOM element. So, I’ve put the INDEX string at the place where we want to place the index as we loop through.

The loop and extraction

Now if you run the script with node index.js you would see usernames and their corresponding emails printed.

Go over all the pages

Image for post
Image for post

Fun Fact: If you compare with the previous screenshot of the page, you will notice that 6 more john s have joined GitHub in the matter of a few hours.

Copy its selector from developer tools. We would write a new function below the run function to return the number of pages we can go through.

At the bottom of the search results page, if you hover the mouse over buttons with page numbers, you can see they link to the next pages. The link to 2nd page with results is https://github.com/search?p=2&q=john&type=Users&utf8=%E2%9C%93. Notice the p=2 query parameter in the URL. This will help us navigate to the next page.

After adding an outer loop to go through all the pages around our previous loop, the code looks like

Save to MongoDB

$ npm i --save mongoose

MongoDB is a Schema-less NoSQL database. But we can make it follow some rules using Mongoose. First we would have to create a Model which is just representation of MongoDB Collection in code. Create a directory models. Create a file user.js inside and put the following code in it, the structure of our collection. Next whenever we insert something into users collection with mongoose, it would have to follow this structure.

Let’s now actually insert. We don’t want duplicate emails in our database. So, we only insert a user’s information if the email is not already present. Otherwise we would just update the information. For this we would use mongoose’s Model.findOneAndUpdate method.

At the top of index.js add the imports

const mongoose = require('mongoose');
const User = require('./models/user');

Add the following function at bottom of index.js to upsert (update or insert) the User model

Start MongoDB server. Put following code inside the for loops at the place of comment // TODO save this user in order to save the user

upsertUser({
username: username,
email: email,
dateCrawled: new Date()
});

To check if you are actually getting users saved, get inside mongo shell

$ mongo
> use thal
> db.users.find().pretty()

You would see multiple users added there. This marks the crux of this guide.

Conclusion

  • While scraping, you might be halted by GitHub’s rate limiting
Image for post
Image for post
  • Another thing I noticed, you cannot go beyond 100 pages on GitHub.

Here is the accompanying GitHub repository containing complete code.

End note

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store