An Introduction to Web Scraping with Puppeteer

Published in

The Startup

6 min readOct 4, 2018

Learn Puppeteer with me in this article.

I saw a video a few days ago on DevTips where they attempted to use Puppeteer, I’ve never used it myself and thought it looked really cool. So I gave it a try and I’m sharing what I’ve learned here.

— Prerequisites
— What is Puppeteer?
— — A Headless Browser
— — An API
— Why use any of this?
— On with the code!
— — Prerequisites
— — Project Setup
— — A Simple Example
— — Grabbing Data — Preparations
— — Notes for the data I want
— — My Selectors
— — Grabbing Data — In Code
— — Saving Data to a File
— More Advanced Scraping
— References and Links

Prerequisites

This tutorial is beginner friendly, no advanced knowledge of code is required. If you’re following along with the project then more requirements will be listed below in the code section.

All code will be available in a repository on GitHub linked here.

CodeDraken/puppeteer-example

An example for my Puppeteer tutorial. Contribute to CodeDraken/puppeteer-example development by creating an account on…

github.com

What is Puppeteer?

Before we just dive into the code it’s important to understand what a technology we’re using is and why it exists.

A Headless Browser

Puppeteer comes with Chromium and runs “headless” by default. What is a headless browser? A headless browser is a browser for machines. It has no UI and allows a program — often called a scraper or a crawler — to read and interact with it.

An API

Headless browsers are great and all, but they can be a pain to use sometimes. Puppeteer, however, provides a really nice API or set of functions for interacting with it.

Why use any of this?

There’s so much you can do with Puppeteer and web scraping in general!

Make automated tests on a real web page,
Generate PDFs
Take screenshots
Grab data from websites and save it
Automate boring tasks
Puppeteer specifically is perhaps the best tool you can use IMO

On with the code!

let’s get started!

Prerequisites

If you’re following along you’ll need NodeJS installed, basic knowledge of the command line, knowledge of JavaScript and knowledge of the DOM.

Note: Your scraper code doesn’t have to be perfect. When doing your own projects don’t overthink it.

Project Setup

Make a folder ( name it whatever )
Open the folder in your terminal / command prompt
In your terminal run, npm init -y This will generate a package.json for managing project dependencies.
Then run npm install puppeteer This will install puppeteer which includes Chromium so don’t be surprised if it’s large.
Finally, open the folder in your favorite code editor and create an index.js file. You’ll also need these folders;screenshots, pdfs, and json if you’re following my example exactly.

A Simple Example

Now let’s try something simple ( but really cool! ) to verify that our setup is working. We’re going to take a screenshot of a web page and generate a PDF file. ( yes this is simple to do )

For most of my examples, I’ll be using scrapethissite.com. You can use any site you want as long as they allow you to scrape them. Search for their policy and try looking at site/robots.txt for example https://medium.com/robots.txt

This is all the code that’s required to start the headless browser, navigate to a web page then take a screenshot and generate a pdf of it.

Click here for more information on screenshots and here for more information on pdf generation.

Screenshots and pdfs are fun but how does that help me grab data faster?

Those features are good if you want pdfs and screenshots specifically. When you want to grab and possibly manipulate data there are other tools at your disposal.

Grabbing Data — Preparations

Using the same site from the example above we will grab some data and save it to a file. Let’s say in this scenario we only want the team name, year, wins and losses. The first step is to create some selectors.

A selector is just a path to the data. ( think CSS selectors ) We’ll come up with the paths here by using our browser’s developer tools. Open them on the page by opening your browser menu and looking for “developer tools”. I’ll be using Chrome and you can just press CTRL + Shift + I to open them.

On the site open the elements tab in your developer tools and find what data you want to grab. Take note of its structure, classes, etc.

If you happen to have a specific unique piece that you want to grab then you can just right click on the node and choose “copy selector”.

Notes for the data I want

It’s inside a table
The rows with team data have a class named team
Inside tr.team are multiple td with the class names: name, year, wins and losses. These contain the data I want.

My Selectors

The selectors I came up with for this example are:

Team Row: tr.team
Data: teamRow > td.${dataName} ( replace ${dataName} with the name )

Read more about CSS selectors here if you’re new to them.

Grabbing Data — In Code

Time to apply this to our code.

The main part of this is page.evaluate() this lets us run JS code in the browser and communicate back any data we want. This is all it takes to fetch data.

You may have noticed that we have access to the DOM here — this is the very nice and familiar API that Puppeteer provides!

Saving Data to a File

As a final touch, we’ll save this data to a file. In my case, I want the data in JSON format because that’s most easily used with JS.

Load the file system module from node
Convert the data to JSON with JSON.stringify()
Write the file with fs.writeFile()

File System | Node.js v10.11.0 Documentation

On POSIX systems, for every process, the kernel maintains a table of currently open files and resources. Each open file…

nodejs.org

JSON.stringify()

The JSON.stringify() method converts a JavaScript value to a JSON string, optionally replacing values if a replacer…

developer.mozilla.org

More Advanced Scraping

Puppeteer supports things like single page applications ( SPA ), simulating input, tests and more. They’re beyond the scope of this tutorial, but you can find examples in the Puppeteer documentation ( listed below ) and also in this other article.

References and Links

Getting Started with Headless Chrome | Web | Google Developers

Getting started with Headless Chrome

developers.google.com

GoogleChrome/puppeteer

Headless Chrome Node API. Contribute to GoogleChrome/puppeteer development by creating an account on GitHub.

github.com

If you found this article too difficult then I’d recommend this one. It covers the same stuff, but in more detail.

Thanks for reading! Leave any feedback or questions in the comments below.

An Introduction to Web Scraping with Puppeteer

Table of Contents

Prerequisites

CodeDraken/puppeteer-example

An example for my Puppeteer tutorial. Contribute to CodeDraken/puppeteer-example development by creating an account on…

What is Puppeteer?

A Headless Browser

An API

Why use any of this?

On with the code!

Prerequisites

Project Setup

A Simple Example

Grabbing Data — Preparations

Notes for the data I want

My Selectors

Grabbing Data — In Code

Saving Data to a File

File System | Node.js v10.11.0 Documentation

On POSIX systems, for every process, the kernel maintains a table of currently open files and resources. Each open file…

JSON.stringify()

The JSON.stringify() method converts a JavaScript value to a JSON string, optionally replacing values if a replacer…

More Advanced Scraping

References and Links

Getting Started with Headless Chrome | Web | Google Developers

Getting started with Headless Chrome

GoogleChrome/puppeteer

Headless Chrome Node API. Contribute to GoogleChrome/puppeteer development by creating an account on GitHub.

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by + 375,985 people.

Subscribe to receive our top stories here.

Written by CodeDraken