An Introduction to Web Scraping with Puppeteer

Learn Puppeteer with me in this article.

I saw a video a few days ago on DevTips where they attempted to use Puppeteer, I’ve never used it myself and thought it looked really cool. So I gave it a try and I’m sharing what I’ve learned here.

Prerequisites

This tutorial is beginner friendly, no advanced knowledge of code is required. If you’re following along with the project then more requirements will be listed below in the code section.

All code will be available in a repository on GitHub linked here.

What is Puppeteer?

Before we just dive into the code it’s important to understand what a technology we’re using is and why it exists.

A Headless Browser

Puppeteer comes with Chromium and runs “headless” by default. What is a headless browser? A headless browser is a browser for machines. It has no UI and allows a program — often called a scraper or a crawler — to read and interact with it.

An API

Headless browsers are great and all, but they can be a pain to use sometimes. Puppeteer, however, provides a really nice API or set of functions for interacting with it.

Why use any of this?

There’s so much you can do with Puppeteer and web scraping in general!

  • Make automated tests on a real web page,
  • Generate PDFs
  • Take screenshots
  • Grab data from websites and save it
  • Automate boring tasks
  • Puppeteer specifically is perhaps the best tool you can use IMO

On with the code!

let’s get started!

Prerequisites

If you’re following along you’ll need NodeJS installed, basic knowledge of the command line, knowledge of JavaScript and knowledge of the DOM.

Note: Your scraper code doesn’t have to be perfect. When doing your own projects don’t overthink it.

Project Setup

  1. Make a folder ( name it whatever )
  2. Open the folder in your terminal / command prompt
  3. In your terminal run, npm init -y This will generate a package.json for managing project dependencies.
  4. Then run npm install puppeteer This will install puppeteer which includes Chromium so don’t be surprised if it’s large.
  5. Finally, open the folder in your favorite code editor and create an index.js file. You’ll also need these folders;screenshots, pdfs, and json if you’re following my example exactly.

A Simple Example

Now let’s try something simple ( but really cool! ) to verify that our setup is working. We’re going to take a screenshot of a web page and generate a PDF file. ( yes this is simple to do )

For most of my examples, I’ll be using scrapethissite.com. You can use any site you want as long as they allow you to scrape them. Search for their policy and try looking at site/robots.txt for example https://medium.com/robots.txt

generate a screenshot and pdf

This is all the code that’s required to start the headless browser, navigate to a web page then take a screenshot and generate a pdf of it.

generated pdf file
generated screenshot

Click here for more information on screenshots and here for more information on pdf generation.

Screenshots and pdfs are fun but how does that help me grab data faster?

Those features are good if you want pdfs and screenshots specifically. When you want to grab and possibly manipulate data there are other tools at your disposal.

Grabbing Data — Preparations

Using the same site from the example above we will grab some data and save it to a file. Let’s say in this scenario we only want the team name, year, wins and losses. The first step is to create some selectors.

A selector is just a path to the data. ( think CSS selectors ) We’ll come up with the paths here by using our browser’s developer tools. Open them on the page by opening your browser menu and looking for “developer tools”. I’ll be using Chrome and you can just press CTRL + Shift + I to open them.

On the site open the elements tab in your developer tools and find what data you want to grab. Take note of its structure, classes, etc.

inspecting the DOM ( click to enlarge )

If you happen to have a specific unique piece that you want to grab then you can just right click on the node and choose “copy selector”.

Notes for the data I want

  • It’s inside a table
  • The rows with team data have a class named team
  • Inside tr.team are multiple td with the class names: name, year, wins and losses. These contain the data I want.

My Selectors

The selectors I came up with for this example are:

  • Team Row: tr.team
  • Data: teamRow > td.${dataName} ( replace ${dataName} with the name )

Read more about CSS selectors here if you’re new to them.

Grabbing Data — In Code

Time to apply this to our code.

grabbing team data

The main part of this is page.evaluate() this lets us run JS code in the browser and communicate back any data we want. This is all it takes to fetch data.

You may have noticed that we have access to the DOM here — this is the very nice and familiar API that Puppeteer provides!

Saving Data to a File

As a final touch, we’ll save this data to a file. In my case, I want the data in JSON format because that’s most easily used with JS.

  1. Load the file system module from node
  2. Convert the data to JSON with JSON.stringify()
  3. Write the file with fs.writeFile()
save JSON data

More Advanced Scraping

Puppeteer supports things like single page applications ( SPA ), simulating input, tests and more. They’re beyond the scope of this tutorial, but you can find examples in the Puppeteer documentation ( listed below ) and also in this other article.

References and Links

If you found this article too difficult then I’d recommend this one. It covers the same stuff, but in more detail.

Thanks for reading! Leave any feedback or questions in the comments below.

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by + 375,985 people.

Subscribe to receive our top stories here.