Having fun with puppeteer JS

Published in

The happy lone guy

11 min readDec 20, 2017

Little bit about puppeteer

I reckon that if you're reading this you probably have already known what's puppeteer. If so you can immediately jump to the next section, otherwise, allow me to give you a short introduction about puppeteer.

According to puppeteer's github repository

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.

So here is the important stuff that we have learnt from the description above

Puppeteer is a NodeJS library
It help developers to control the Chrome or Chromium browser
It can control both headless and non-headless Chrome or Chromium

So what's headless and non-headless? Short answer is headless Chrome means using chrome without visually open chrome and vice versa, non-headless is what you're doing right now, interacting with chrome with the ability to see what's going on like color, text and stuff.

If you want a longer answer feel free to have a look here

What are we doing here

So I have recently discovered puppeteer and I was wanted to spend some of my time with it. Fortunately, I also recently made a website to practice the IELTS speaking test alone and I really need some cue card (it's like a question card with some cues about the question on it) to add to my database. After searching for while, I found this website

https://www.ielts-mentor.com/cue-card-sample

It got all the that I want but I will need to extract data from it, so I decided to use puppeteer to create a scraper that can get all the data from the site and that’s exactly what we’re going to do.

Let’s prepare all the ingredients

First, we need something to hold our code, a folder with a nice name should do it! So I created a new folder called scapers using the command bellow.

mkdir scapers

Because of this folder will hold our code and packages we will need to create a file to record all of them, that’s right an package.json file! Down here is the command to create it.

npm init

After answering all the question, a new file will be created and now we can start installing our packages.

Our main subject here is puppeteer so of course we will install it first.

npm install puppeteer

I also use firebase to store my data so let’s install it too.

npm install firebase

With all of our packages installed let’s move to the next step!

Get to know our target !

So first let’s navigate to the target and see what’s in there

HTML structure of the list page

After navigate to our target, we will see a list of link, they are all link to different cue cards.

HTML structure of the cue card page

After navigate to one of the links, we will see that the title of the cue card (the question) is located in a span tag in a h2 tag and the list of cues are in a ul with no id or class name on it.

The plan

So now we know the structure of our taget, I have came up with an impeccable plan!

First, look at the URL of the list page when go to the second page

https://www.ielts-mentor.com/cue-card-sample?start=20

Notice that there’s a new query param called start and its value is 20. Go back to the first list page, we can see there are total of 20 links in the list so we can conclude that start is the number of links being displayed in each page. Thus, we can imagine their SQL query would be like this

SELECT * FROM cue_card_table LIMIT start_param_value, 20

The SQL query will return 20 cue cards in the table start on the cue card start_param_value

To ensure that our thought is correct, I clicked on the link to the 3rd page and the start param went up to 40 which means our hypothesis is correct!

https://www.ielts-mentor.com/cue-card-sample?start=40

Now we’ve known how to navigate between pages, we only need to change the start param to get what we want. On each list page, we take all the links to the cue cards and store it in an array, after that we loop through this array and navigate to each page to extract the data of the cue card. When we are finished getting the data, we will push it into our firebase database, simple.

Pseudo code:

list_page = goto (list_page_url)links = extract_from(list_page)for each link in links:  cue_card_page = goto (link)  cue_card_data = extract_from(cue_card_page)  save_to_firebase(cue_card_data)

Stop !

Before I go any further, I recommend that you should skim through the puppeteer’s api document which located here

Beside that, Emad Ehsan have written an amazing post about getting started with Puppeteer and Chrome Headless for Web Scraping.

Launch the browser

The first thing to do before any puppeteer project is to launch the browser using puppeteer.

So basically, the procedure of our code is first launch the browser using the launch function, this function take and object for options and you can decide to use headless or not by passing an object with the property headless equals to true or false to the launch function.

After that we will open and connect to a new tab using the newPage function.

The your code... part will be where we do stuff with the tab like go to an URL and extract data from HTML.

The last thing that need to be done is to close the tab and the browser using the close function.

Now go ahead, create a new JS file and paste it in your file.

Please ensure that you’re using NodeJS version 7.6.0 or greater because they support async/await which is a part of ES7.

For those who aren’t familiar with async/await . It’s a new feature in ES7, it help to handle asynchronous task using a more cleaner way. For example, in ES6 here is how we handle promises

delivery_pizza = () => {
  // all number are in minute
  prepare_pizza(3).then(() => {
    console.log("off we go !");
  }).then(() => {
    console.log("starting the bike")
    return start_the_bike(3);
  }).then(() => {
    console.log("move !");
    return trafic_jam(5);
  }).then(() => {
    console.log("Pizza has arrived");
  });
}

Yes, this is something called the promise hell, although there are a few way to mitigate it but in order to facilitate async tasks, ES7 async/await was created to handle tasks like that. Here is how the function above is rewrited with async/await

delivery_pizza = async () => {
  // all number are in minute
  await prepare_pizza(3);
  console.log("off we go !");  console.log("starting the bike");
  await start_the_bike(3);
  
  console.log("move !");
  await trafic_jam(5);
    
  console.log("Pizza has arrived");
}

Much cleaner isn’t it? If you want to know more about this, here is a great post to read

Get the list of links

In the aforementioned plan, there's a step where we will take all the links in the list page and store it in an array, this section will describe how that step work.

Go to the page

After we launch our browser and connect to a tab, the next step is to go to the list page which is achievable by using this line of code

await page.goto(
  'https://www.ielts-mentor.com/cue-card-sample',
  {waitUntil: 'networkidle2'}
);

As you can see, the goto function take an URL and an object. In the code above, I specified an object with the property waitUntil this is a crucial part of our code because it will help puppeteer to determine how a page is considered as loaded.

If the waitUntil property value is load then puppeteer will wait until the load event is fired. Else if the value was domcontentloaded it will consider the navigation is finished when the DOMContentLoaded event is fired. The value networkidle0 and networkidle2 will tell puppeteer to wait until there is no more than 0 and 2 network request for at least 500ms, respectively.

Extract data from HTML

First let try to extract the data from the browser console.

The initial step is to select the DOM using its selector string and the document.querySelectorAll function. One tip to quickly obtain the selector string of a DOM is to use the copy selector feature in chrome dev tool.

So in the HTML structure of the list page that we saw above, all the links is stored in a table and each tr contains a td with an <a> reside within it. We can grab the selector to the <a> by using the copy selector.

After that hit ctrl + v to paste the clipboard value, which should be

#adminForm > table > tbody > tr:nth-child(1) > td.list-title > a

Because we need all the links not just the first one so we can safely remove the :nth-child(1) so that our selector will point to all the <a> in the table’s tbody

#adminForm > table > tbody > tr > td.list-title > a

Switch to the console tab, we can write a simple script to select all the <a> using the selector above

The querySelectorAll function return a NodeList so we need to convert it into an array to use the map function using Spread Operator. Then we transform all the DOMs in the array into its href value.

Now take a look at this code

const cue_card_links = await page.evaluate((selector) => {  const anchors_node_list = document.querySelectorAll(selector);  const anchors = [...anchors_node_list];  return anchors.map(link => link.href);}, '#adminForm > table > tbody > tr > td.list-title > a');

The evaluate function takes a function and after it is a list of args to be passed to the previous function. In this case I passed the selector to the function and from that I used the same code that we executed in the browser’s console in the previous step. After that, the array of links that the function return will be assigned to the cue_card_links variable.

Get each cue card data

After we have all the links to different cue cards, we can now navigate to each link and extract cue card data.

for (let i = link_start; i < cue_card_links.length; i++) {
  
  let link = cue_card_links[i];  await page.goto(link, {waitUntil: 'networkidle2'});

Extract the cue card question

After the navigation is finished, we can now first retrieve the question. Repeat the same process, we will have the selector to the question

#main > article > h2:nth-child(5) > span

It seem that they put the question inside the <span> wrapped by the <h2>

However when I look at the HTML structure of other cue cards, some of them store the question not in <h2> but <h3> so we have to also check that too !

const question = await page.evaluate((selector1, selector2) => {
let question_dom = document.querySelector(selector1);
  if (!question_dom) {
    question_dom = document.querySelector(selector2);
  }
  return question_dom.textContent.trim();  
},  
  "#main > article > h2:nth-child(5) > span", 
  "#main > article > h3:nth-child(5) > span"
);

The function above takes 2 selector. First it use the first selector which is the selector of the span located inside the h2 , if the querySelector return null that means they put the question in side the span in the h3 . In that case, it will use the second selector. Afterwards, the DOM’s textContent will be returned and assigned to the question variable.

Get all the cues

Getting all the cues is simple, just like the step before when we get all the links to the cue cards. Here is the selector to the ul that hold all the cues

#main > article > ul:nth-child(8)

Just like the previous step, we can use querySelectorAll function to get all the DOMs then map it but this time we’re not using the href property but the textContent property.

const cues = await page.evaluate((selector) => {  let cue_doms = [...document.querySelectorAll(selector)];  return cue_doms.map(cue => cue.textContent.trim());}, "#main > article > ul:nth-child(8) > li");

Saving to firebase

This is the easiest stage of the whole process, with some simple code pushing data to firebase is easy.

const firebase = require('firebase');if(!firebase.apps.length) {
  let config = {
    apiKey: "xxxxxxxxxxxxxxxx",
    authDomain: "xxxxxxxxxxxxxxxxxxxx",
    databaseURL: "xxxxxxxxxxxxxxxxxxxx",
    projectId: "xxxxxxxxxxxx",
    storageBucket: "xxxxxxxxxxxxxx",
    messagingSenderId: "xxxxxxxxxxxxxx"
  };firebase.initializeApp(config);
}const db = firebase.database();let saveToFirebase = (question, cues) => {
  let questionRef = db.ref('/questions');
  let newQuestionRef = questionRef.push();
  let newQuestionKey = newQuestionRef.key;
  newQuestionRef.set({
    question: question,
    cues: cues
  });
  console.log("[#] Success => Id: " + newQuestionKey + " | Title: " + question);
}

Upgrade the code !

When we’re done taking all the cue card on page 1 how do we navigate to page 2? According to our plan, the way that we can use to navigate between pages is to modify the start param. So how do we calculate that value?

Here is the formula that I came up with to calculate the start value regard to the page number.

start = (current_page_number - 1) * cue_card_per_page

We all know that there are 20 cue cards on each pages so if our current page is 1 the start value will equals

start = (1 - 1) * 20 = 0

Similarly, if our page is 2 the value will be

start = (2 - 1) * 20 = 1 * 20 = 20

So now the plan is to loop from 1 to the max number of page which is 31 in this case.

const page_start = 2;
const max_page = 31;
const item_per_page = 20;for (let current_page = page_start; current_page <= max_page; current_page++) {
  let start = (current_page - 1) * item_per_page;
  await page.goto(
    'https://www.ielts-mentor.com/cue-card-sample?start=' + start,
    {waitUntil: 'networkidle2'}
  );

Putting them together

Here is the final result of our code.

In conclusion

That’s how I made a scraper to take all the data for my website using puppeteer. I hope this simple scraper can help you get started in using puppeteer and please do tell me if I can do anything to improve my code because I’ve just found out puppeteer like 3 days ago so it will be extremely helpful if you can leave some tips and tricks down in the comment section.

Good bye and merry christmas :)