Having fun with puppeteer JS
Little bit about puppeteer
I reckon that if you're reading this you probably have already known what's puppeteer. If so you can immediately jump to the next section, otherwise, allow me to give you a short introduction about puppeteer.
According to puppeteer's github repository
Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
So here is the important stuff that we have learnt from the description above
- Puppeteer is a NodeJS library
- It help developers to control the Chrome or Chromium browser
- It can control both headless and non-headless Chrome or Chromium
So what's headless and non-headless? Short answer is headless Chrome means using chrome without visually open chrome and vice versa, non-headless is what you're doing right now, interacting with chrome with the ability to see what's going on like color, text and stuff.
If you want a longer answer feel free to have a look here
What are we doing here
So I have recently discovered puppeteer and I was wanted to spend some of my time with it. Fortunately, I also recently made a website to practice the IELTS speaking test alone and I really need some cue card (it's like a question card with some cues about the question on it) to add to my database. After searching for while, I found this website
https://www.ielts-mentor.com/cue-card-sample
It got all the that I want but I will need to extract data from it, so I decided to use puppeteer to create a scraper that can get all the data from the site and that’s exactly what we’re going to do.
Let’s prepare all the ingredients
First, we need something to hold our code, a folder with a nice name should do it! So I created a new folder called scapers
using the command bellow.
mkdir scapers
Because of this folder will hold our code and packages we will need to create a file to record all of them, that’s right an package.json
file! Down here is the command to create it.
npm init
After answering all the question, a new file will be created and now we can start installing our packages.
Our main subject here is puppeteer so of course we will install it first.
npm install puppeteer
I also use firebase to store my data so let’s install it too.
npm install firebase
With all of our packages installed let’s move to the next step!
Get to know our target !
So first let’s navigate to the target and see what’s in there
HTML structure of the list page
After navigate to our target, we will see a list of link, they are all link to different cue cards.
HTML structure of the cue card page
After navigate to one of the links, we will see that the title of the cue card (the question) is located in a span
tag in a h2
tag and the list of cues are in a ul with no id or class name on it.
The plan
So now we know the structure of our taget, I have came up with an impeccable plan!
First, look at the URL of the list page when go to the second page
https://www.ielts-mentor.com/cue-card-sample?start=20
Notice that there’s a new query param called start
and its value is 20. Go back to the first list page, we can see there are total of 20 links in the list so we can conclude that start
is the number of links being displayed in each page. Thus, we can imagine their SQL query would be like this
SELECT * FROM cue_card_table LIMIT start_param_value, 20
The SQL query will return 20 cue cards in the table start on the cue card start_param_value
To ensure that our thought is correct, I clicked on the link to the 3rd page and the start
param went up to 40 which means our hypothesis is correct!
https://www.ielts-mentor.com/cue-card-sample?start=40
Now we’ve known how to navigate between pages, we only need to change the start param to get what we want. On each list page, we take all the links to the cue cards and store it in an array, after that we loop through this array and navigate to each page to extract the data of the cue card. When we are finished getting the data, we will push it into our firebase database, simple.
Pseudo code:
list_page = goto (list_page_url)links = extract_from(list_page)for each link in links: cue_card_page = goto (link) cue_card_data = extract_from(cue_card_page) save_to_firebase(cue_card_data)
Stop !
Before I go any further, I recommend that you should skim through the puppeteer’s api document which located here
Beside that, Emad Ehsan have written an amazing post about getting started with Puppeteer and Chrome Headless for Web Scraping.
Launch the browser
The first thing to do before any puppeteer project is to launch the browser using puppeteer.
So basically, the procedure of our code is first launch the browser using the launch
function, this function take and object for options and you can decide to use headless or not by passing an object with the property headless
equals to true or false to the launch function.
After that we will open and connect to a new tab using the newPage
function.
The your code...
part will be where we do stuff with the tab like go to an URL and extract data from HTML.
The last thing that need to be done is to close the tab and the browser using the close
function.
Now go ahead, create a new JS file and paste it in your file.
Please ensure that you’re using NodeJS version 7.6.0 or greater because they support async/await which is a part of ES7.
For those who aren’t familiar with async/await
. It’s a new feature in ES7, it help to handle asynchronous task using a more cleaner way. For example, in ES6 here is how we handle promises
delivery_pizza = () => {
// all number are in minute
prepare_pizza(3).then(() => {
console.log("off we go !");
}).then(() => {
console.log("starting the bike")
return start_the_bike(3);
}).then(() => {
console.log("move !");
return trafic_jam(5);
}).then(() => {
console.log("Pizza has arrived");
});
}
Yes, this is something called the promise hell, although there are a few way to mitigate it but in order to facilitate async tasks, ES7 async/await
was created to handle tasks like that. Here is how the function above is rewrited with async/await
delivery_pizza = async () => {
// all number are in minute
await prepare_pizza(3);
console.log("off we go !"); console.log("starting the bike");
await start_the_bike(3);
console.log("move !");
await trafic_jam(5);
console.log("Pizza has arrived");
}
Much cleaner isn’t it? If you want to know more about this, here is a great post to read
Get the list of links
In the aforementioned plan, there's a step where we will take all the links in the list page and store it in an array, this section will describe how that step work.
Go to the page
After we launch our browser and connect to a tab, the next step is to go to the list page which is achievable by using this line of code
await page.goto(
'https://www.ielts-mentor.com/cue-card-sample',
{waitUntil: 'networkidle2'}
);
As you can see, the goto
function take an URL and an object. In the code above, I specified an object with the property waitUntil
this is a crucial part of our code because it will help puppeteer to determine how a page is considered as loaded.
If the waitUntil
property value is load
then puppeteer will wait until the load event is fired. Else if the value was domcontentloaded
it will consider the navigation is finished when the DOMContentLoaded
event is fired. The value networkidle0
and networkidle2
will tell puppeteer to wait until there is no more than 0 and 2 network request for at least 500ms, respectively.
Extract data from HTML
First let try to extract the data from the browser console.
The initial step is to select the DOM using its selector string and the document.querySelectorAll
function. One tip to quickly obtain the selector string of a DOM is to use the copy selector feature in chrome dev tool.
So in the HTML structure of the list page that we saw above, all the links is stored in a table and each tr
contains a td
with an <a>
reside within it. We can grab the selector to the <a>
by using the copy selector.
After that hit ctrl + v to paste the clipboard value, which should be
#adminForm > table > tbody > tr:nth-child(1) > td.list-title > a
Because we need all the links not just the first one so we can safely remove the :nth-child(1)
so that our selector will point to all the <a>
in the table’s tbody
#adminForm > table > tbody > tr > td.list-title > a
Switch to the console tab, we can write a simple script to select all the <a>
using the selector above
The querySelectorAll
function return a NodeList so we need to convert it into an array to use the map function using Spread Operator. Then we transform all the DOMs in the array into its href value.
Now take a look at this code
const cue_card_links = await page.evaluate((selector) => { const anchors_node_list = document.querySelectorAll(selector); const anchors = [...anchors_node_list]; return anchors.map(link => link.href);}, '#adminForm > table > tbody > tr > td.list-title > a');
The evaluate
function takes a function and after it is a list of args to be passed to the previous function. In this case I passed the selector to the function and from that I used the same code that we executed in the browser’s console in the previous step. After that, the array of links that the function return will be assigned to the cue_card_links
variable.
Get each cue card data
After we have all the links to different cue cards, we can now navigate to each link and extract cue card data.
for (let i = link_start; i < cue_card_links.length; i++) {
let link = cue_card_links[i]; await page.goto(link, {waitUntil: 'networkidle2'});
Extract the cue card question
After the navigation is finished, we can now first retrieve the question. Repeat the same process, we will have the selector to the question
#main > article > h2:nth-child(5) > span
It seem that they put the question inside the <span>
wrapped by the <h2>
However when I look at the HTML structure of other cue cards, some of them store the question not in <h2>
but <h3>
so we have to also check that too !
const question = await page.evaluate((selector1, selector2) => {
let question_dom = document.querySelector(selector1);
if (!question_dom) {
question_dom = document.querySelector(selector2);
}
return question_dom.textContent.trim();
},
"#main > article > h2:nth-child(5) > span",
"#main > article > h3:nth-child(5) > span"
);
The function above takes 2 selector. First it use the first selector which is the selector of the span
located inside the h2
, if the querySelector
return null that means they put the question in side the span
in the h3
. In that case, it will use the second selector. Afterwards, the DOM’s textContent
will be returned and assigned to the question
variable.
Get all the cues
Getting all the cues is simple, just like the step before when we get all the links to the cue cards. Here is the selector to the ul
that hold all the cues
#main > article > ul:nth-child(8)
Just like the previous step, we can use querySelectorAll
function to get all the DOMs then map it but this time we’re not using the href
property but the textContent
property.
const cues = await page.evaluate((selector) => { let cue_doms = [...document.querySelectorAll(selector)]; return cue_doms.map(cue => cue.textContent.trim());}, "#main > article > ul:nth-child(8) > li");
Saving to firebase
This is the easiest stage of the whole process, with some simple code pushing data to firebase is easy.
const firebase = require('firebase');if(!firebase.apps.length) {
let config = {
apiKey: "xxxxxxxxxxxxxxxx",
authDomain: "xxxxxxxxxxxxxxxxxxxx",
databaseURL: "xxxxxxxxxxxxxxxxxxxx",
projectId: "xxxxxxxxxxxx",
storageBucket: "xxxxxxxxxxxxxx",
messagingSenderId: "xxxxxxxxxxxxxx"
};firebase.initializeApp(config);
}const db = firebase.database();let saveToFirebase = (question, cues) => {
let questionRef = db.ref('/questions');
let newQuestionRef = questionRef.push();
let newQuestionKey = newQuestionRef.key;
newQuestionRef.set({
question: question,
cues: cues
});
console.log("[#] Success => Id: " + newQuestionKey + " | Title: " + question);
}
Upgrade the code !
When we’re done taking all the cue card on page 1 how do we navigate to page 2? According to our plan, the way that we can use to navigate between pages is to modify the start
param. So how do we calculate that value?
Here is the formula that I came up with to calculate the start
value regard to the page number.
start = (current_page_number - 1) * cue_card_per_page
We all know that there are 20 cue cards on each pages so if our current page is 1 the start
value will equals
start = (1 - 1) * 20 = 0
Similarly, if our page is 2 the value will be
start = (2 - 1) * 20 = 1 * 20 = 20
So now the plan is to loop from 1 to the max number of page which is 31 in this case.
const page_start = 2;
const max_page = 31;
const item_per_page = 20;for (let current_page = page_start; current_page <= max_page; current_page++) {
let start = (current_page - 1) * item_per_page;
await page.goto(
'https://www.ielts-mentor.com/cue-card-sample?start=' + start,
{waitUntil: 'networkidle2'}
);
Putting them together
Here is the final result of our code.
In conclusion
That’s how I made a scraper to take all the data for my website using puppeteer. I hope this simple scraper can help you get started in using puppeteer and please do tell me if I can do anything to improve my code because I’ve just found out puppeteer like 3 days ago so it will be extremely helpful if you can leave some tips and tricks down in the comment section.
Good bye and merry christmas :)