Web scraping Node.js, Chrome, & Puppeteer

Ed Huang
Ed Huang
Jan 10, 2018 · 4 min read

Ever want to build your own bot? Now it’s easier than ever with Puppeteer!

So I was playing around with and and thought I’d share my simple program. This bot will get all the vin numbers off the tesla.com/used site. It’s basically three steps.

  1. Navigate to the site
  2. Click on a vehicle link
  3. Grab information off the tabbed page.

We’ll be using async/await for this so make sure your Node is up to date. I’m using version 8.0.0

Let’s jump in the code!

1. Navigating to the site

const puppeteer = require('puppeteer');async function run() {
let browser = await puppeteer.launch(); // { headless: false }
let pages = await browser.pages(); // To see the browser in
// action.
await pages[0].goto('https://tesla.com/used');
// More below...
}
run();

So this snippet here will create a new browser instance, and then return all the pages(tabs) that belong to it. We’ll grab the 0 indexed page that is created from the browser and go to the site with this page.

NOTE: In the docs it says to use browser.newPage(); What I didn’t like about this is that it creates a new tab in your browser, leaving you now with two tabs. One being empty.

2. Clicking all the cars

So once we land on the page, we want to click on the links of the vehicles.

let carHandles = await page.$$('.vehicle-link');for (let i = 0; i < carHandles.length; i++) {
carHandles[i].click();
}

page.$$(selector) will return an array of Element Handles. From the api docs we see there’s a click method that each handle has that will open up a new tab with the car details.

Unfortunately if you just ran this, it wouldn’t work. Because in order to click on an element it needs to be loaded first. Let’s go ahead and handle this.

We need to waitForSelector to be loaded.

for (let i = 0; i < carHandles.length; i++) {
await page.waitForSelector('.vehicle-link');
carHandles[i].click();
// a new tab is created after clicked.
}

3. Grab information off the tabbed page.

Next we need to go to each of the newly opened tabs and scrape information off the detail page.

We’ll go ahead and get pages again like above. We’ll see that we now have more tabs opened in the browser.

pages = await browser.pages();

The browser’s pages method will give us an array all of the opened tabs. Unfortunately like selectors we need to make sure that the tab has been created.

let count = 0;while (count < carHandles.length) {
pages = await browser.pages();
count = pages.length;
}

The number of tabs opened should be the number of car links plus the initial tab that was opened by the browser. This way we keep checking to see if all the tabs have been created until it is one more than the number of links.

NOTE: You could also use browser.on(‘targetCreated’) event to detect if a tab you has been created. As of now, I’m not too sure of any other ways of tracking when all the pages are loaded.

Once we have the opened detailed tabs, we can scrape the information we’re looking for off the page. We’ll use the page.evaluate method to get into the context of the browser, so we can use document.querySelector.

let vins = [];for (let i = 1; i < carHandles.length; i++) {
let vin = await pages[i].evaluate((sel) => {
return document.querySelector(sel).innerHTML;
}, VIN_SELECTOR);
vins.push(vin);
await pages[i].close();
}
for (let i = 0; i < vins.length; i++) {
console.log(`${i + 1}: ${vins[i]}`);
}
browser.close();

VIN_SELECTOR is a variable I created by inspecting the VIN numbers on the detail page. I it should look something like.

const VIN_SELECTOR = '#page > div > div.pane-content-constrain > main > div > div > div > div > section.side > aside > p.extra-info > span:nth-child(3)';
document.querySelector(sel).innerHTML;

I return the value to a variable called vin, and then push into the vins array.

After that we’ll go ahead and close the tab. Once we have all our vins stored in our array, we can console log them. You should then see all the vins from the used cars.

I hope that helps. Here’s a link to the actual code.
https://gist.github.com/dwerdo/9d6761d203964162561b3349b64c71df

Hope that helps. Have fun building bots!

This was inspired by Emad Ehsan’s Getting started with Puppeteer. Check it out for a great tutorial on puppeteer!

Ed Huang

Written by

Ed Huang

Senior Frontend Developer at Auditboard.com, Love Ember.js and all things javascript.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade