A Brief Introduction to Web Scraping with Node.js & Puppeteer

Published in

The Startup

6 min readJan 1, 2020

Recently I wanted to use some data displayed on a website for my own web application. After a bit of research, it became clear that this is doable with the Node.js package Puppeteer. Puppeteer could help you to navigate a page for you, though in this tutorial we will mainly be focussing on the scraping part of this package.

In this this tutorial we will see how to to scrape the data from an original webpage, which does not offer an (easy-to-use) API.

For this tutorial I assume you will have at least a basic knowledge of Node.js, though most people should be able to follow the steps below.
We will use the webpage displaying the agenda of the Amsterdam ArenA, which recently was renamed to the Johan Cruiff Arena (after the famous soccer player passed away in 2016). In this example, I am going to create an array of dictionaries, whereby per dictionary we will be storing the name, year, month and day of all the events listed on this website.

Calendar

The Johan Cruijff ArenA is the stage for football matches and concerts. See what happend in the past months.

www.johancruijffarena.nl

Foundation of Puppeteer

First step here would be to create a new folder where we create a new JavaScript file. Via the terminal find your new-created folder and install the Puppeteer package with the line below.

npm install --save puppeteer

Next will be to open a code editor of your choice and to start writing the code necessary to scrape the data shown on the webpage.

const puppeteer = require(‘puppeteer’);let scrape = async () => {
   const browser = await puppeteer.launch();
   const page = await browser.newPage();   await page.goto(‘https://www.johancruijffarena.nl/calendar.htm');   const result = await page.evaluate(() => {
      var data = [];
      
      var tables = document.querySelectorAll(‘table’);
      data = tables.length;      return data;
   }, );   browser.close();
   return result
}scrape().then((value) => {
   console.log(value);
});

I would encourage you to manually type out the code above to learn the basics syntax of the package and thereafter to run your script file via the terminal.

node script.js

If your terminal is also providing you with an integer, good job! Let’s discuss what the code and the integer mean before we continue with printing actual data to the console.

const puppeteer = require(‘puppeteer’);

Line 1
Here we are including the downloaded package Puppeteer in the script file and therewith in the project.

let scrape = async () => {
   const browser = await puppeteer.launch();
   const page = await browser.newPage();   await page.goto(‘https://www.johancruijffarena.nl/calendar.htm');   const result = await page.evaluate(() => {
      var tables = document.querySelectorAll(‘table’);
      var data = tables.length;
   
      return data;
   }, );   browser.close();
   return result
}

Lines 3-20
A regular arrow function including the in ES8 introduced async keyword, which will help a lot in further lines when we scrape the data.

const browser = await puppeteer.launch();
const page = await browser.newPage();await page.goto(‘https://www.johancruijffarena.nl/calendar.htm');

Lines 4–7
From here on we are using the features Puppeteer is providing us. A constant named browser is created in which we open a new page just like you and I would do when we open an internet browser.
Then (you see what I did there?) the browser will go the page specified and not continue with the code below this statement before the page has been opened, thanks to the async/await keywords.

const result = await page.evaluate(() => {
   var tables = document.querySelectorAll(‘table’);
   var data = tables.length;
   
   return data;
}, );

Lines 9–16
In these lines we are using a function offered by Puppeteer, namely page.evaluate. We are setting this equal to the constant result, so we could use the data we return later on by calling result.
The actual scraping (and I think the fun) starts here! We are saving all table elements of the earlier defined website in the tables variable. I assume when you use Node.js, you understand the basics of HTML and JavaScript and therefore this should be doable. The variable named data is then initialised and set to the number of tables collected. This integer is returned and accessible via the result constant.

browser.close();
return result

Line 18–19
After the scraping part is done, the browser is closed and the result constant discussed above is returned, so this can be used once the scrape variable (let) is called.

scrape().then((value) => {
   console.log(value);
});

Line 22–24
The code on these lines will wait before executing until the scrape function is finished because of the then keyword and the actual integer found in line thirteen will be logged in the terminal.

Scraping & Saving Data

Now we have discussed the basics of this package, you might wonder, why did he choose to select all the table elements of this website?
In my experience with Puppeteer (and web scraping in general) it is easier to take an element which includes all the data you would like to cover, rather than scraping different element containing parts of the total you are searching for.
If you would open the website and its Developer Tools (in Chrome), you would see that every table element includes a header with a year and month and a body with a n number of events with weekday, day in the month and a title. Let me show you with the following code how convenient this can be when scraping data.

const result = await page.evaluate(() => {   var data = [];   var tables = document.querySelectorAll(‘table’);   for (a = 0; a < tables.length; a++) {
      let monthYear =
      tables[a].children[0].children[0].children[0].innerText;      for (b = 0; b < tables[a].children[1].childElementCount;    
         b++) {            let day =      tables[a].children[1].children[b].children[0].children[0].innerText;            let title = tables[a].children[1].children[b].children[1].children[0].innerText;            let event = { title, monthYear, day };
            data.push(event);
      }
   }   return data;
}, );

Replace the code above (either copy-paste it in, though preferably type it yourself) with the original page.evaluate function we had earlier. Run your script and see what it logs in your terminal.
Great right? Without an API, we are still able to get the data we want by using Puppeteer. Let me guide you through my adjusted code.

var data = [];

First of all, I have created an empty array, which later will be returned with all the dictionaries including the scraped data.

for (a = 0; a < tables.length; a++) {
      let monthYear =
      tables[a].children[0].children[0].children[0].innerText;   for (b = 0; b < tables[a].children[1].childElementCount;    
      b++) {      let day =      tables[a].children[1].children[b].children[0].children[0].innerText;      let title = tables[a].children[1].children[b].children[1].children[0].innerText;      let event = { title, monthYear, day };
      data.push(event);
   }
}

In the code above you will see two nested for in-range loops. The first loop is covering all the tables collected earlier.
If you would open your Developer Tools once more, you would see that for every table the year and month text is covered within three elements, a td, a tr, and thead tag. In order to dig to the text we actually want to scrape, you use the children keyword to enter the next layer. The integer you fill in between the brackets then defines what number element you would like to enter (starting at 0, just like with an array) seen from the selected table. After reaching the needed element, the innerText method can get us the text of this element. As the loop will cover every single table element, all tables and therewith all months and years will be covered as shown on the original website.

In the second for in-range loop we loop over the single events in every month defined. As this is different per month, the childElementCount keyword is used, which helps to define the number of elements within the actual table.
Just like we used the children keyword above, the day of the month and title will be saved in a variable.
Eventually every single event is then stored in an event dictionary with its title, year, month and day of the month and pushed to the data array.

As seen before, this data is returned again and thereafter logged to the console in which we find our array with scraped data.
Of course, this data might not always look super pretty right away, but with some data manipulation you will be able to present this for whatever purpose you might need it (e.g. your own website).

Thanks for reading! Make sure you give this post some claps via the button on the left if you enjoyed this post and want to see more. I publish articles on web development each week. Please consider to leave a comment and follow me here on Medium.