Creating a Scraper Using Headless Chrome

Assertis Tech Team
5 min readOct 6, 2017

--

As engineers, we are always trying to make our lives easier and more comfortable. Luckily technology is always on hand to help us in our quest for an easy life.

For some time I have been trying to find a solution to minimize the amount of time I spend looking for information on the Internet, or comparing offers in different online shops. I decided to automate this process — or just try to see what I can learn from it.

Traditional hardware

Headless Chrome

I needed a fast way of programmatically parsing web pages. You can say, “Hey, but you can just use some simple HTTP client library for it.” — that’s what I did at first… and I failed really fast. I found out that many web pages I wanted to parse are entirely dependant on JavaScript to generate their content.

After realising that I needed full browser capabilities I tried HtmlUnit, a JVM implementation of “GUI-Less browser”. This time I was closer to achieving my goal, but performance of this solution and the resources that it was using to parse single web pages was not acceptable. The main problem was with the web pages themselves, quite often the HTML and JavaScript were not well formed and would throw errors in the console.

This all happened around the time when Google announced it would be providing a headless option in its web browser. I immediately decided to try it out — and I found it really useful.

For those of you who don’t know what headless chrome is — it’s a way of using chrome browser without a graphical interface.

I want to show you how to use it and how fast you can do really nice apps with it.

Usage

As an example of using headless chrome we are going to create a simple node js app that will check prices for some Apple products in Amazon.

We will use puppeteer as our API for headless chrome, typescript as our main language (but you can just use classic es6), ts-node for running code without compilation and yarn as a dependency manager.

Configuration

Before starting you need to have installed node and yarn. Then we need to initialize our application.

In the command line type `yarn init` — this will create our `package.json` file.

After initialization, we need to install our dependencies. We can do it in one command `yarn add typescript ts-node puppeteer @types/node`.

When all dependencies are installed, we should add a start script in our `package.json`. Just add to it:

“scripts”: {
“start”: “node node_modules/ts-node/dist/bin.js ./src/index.ts”
}

Your `package.json` file should look very similar to this:

{
“name”: “headless-chrome-example”,
“version”: “1.0.0”,
“main”: “index.js”,
“license”: “MIT”,
“scripts”: {
“start”: “node node_modules/ts-node/dist/bin.js ./src/index.ts”
},
“dependencies”: {
“puppeteer”: “⁰.11.0”,
“ts-node”: “³.3.0”,
“typescript”: “².5.3”
},
“devDependencies”: {
“@types/node”: “⁸.0.31”
}
}

After configuration of our project, we should also configure typescript. To do that, create a `tsconfig.json` file with source:

{
"compilerOptions": {
"baseUrl": ".",
"outDir": "build",
"moduleResolution": "node",
"module": "commonjs",
"target": "es2017",
"allowJs": true
}
}

Writing the application

First we need to find out what products we want to check. I chose two Apple products:

In our app we declare a simple array containing the product URLs:

const productUrls = [
'https://www.amazon.co.uk/dp/B00UY2U93W/ref=cm_sw_r_tw_dp_x_HO5Yzb6HQ03ZN',
'https://www.amazon.co.uk/dp/B00OTHF8VG/ref=cm_sw_r_tw_dp_x_tU5YzbVAYJNY2',
];

Then we declare our main function, and we should call it.

async function main() {
}
main();

In our main function, we want to get access to the browser. To get it, we should first import our launcher. At the top of the file add:

import { launch } from ‘puppeteer’;

And then, we can start the browser in headless configuration:

const browser = await launch({ headless: true });

In our application, we want to repeat the same process for every product — that means, we should iterate our product URLs and open a tab for each of them:

for (const url of productUrls) {
const page = await browser.newPage();
await page.goto(url, { waitUntil: ‘load’ });
}

The second and third lines are what is interesting to us. The second line will open the new tab in Google Chrome, the third one will load the URL we want and wait until the `document.load` function is called.

Then for each opened tab we want to run some code to extract the information we need. To do that we can use the evaluate function from page object:

const data = await page.evaluate(() => ({
title: document.querySelector(‘#productTitle’).textContent.trim(),
price: document.querySelector(‘.offer-price’).textContent.trim(),
}));

For each product tab we are getting two small pieces of the web page. One is based in the HTML element with `productTitle` id, and the second one is the element with the `offer-price` class.

And that’s almost it — we just run two different javascript code pieces, on the browser — using the headless option. How cool is that?

After we do all of our computation on this webpage, we should close the tab:

await page.close();

And we can print our result:

console.log(`${data.title}: ${data.price}`);

Once we’ve done it for all of our URLs we should close our browser as we don’t need it anymore.

browser.close();

The whole app should look something like:

import { launch } from 'puppeteer';const productUrls = [
'https://www.amazon.co.uk/dp/B00UY2U93W/ref=cm_sw_r_tw_dp_x_HO5Yzb6HQ03ZN',
'https://www.amazon.co.uk/dp/B00OTHF8VG/ref=cm_sw_r_tw_dp_x_tU5YzbVAYJNY2',
];async function main() {
const browser = await launch({ headless: true });
for (const url of productUrls) {
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'load' });
const data = await page.evaluate(() => ({
title: document.querySelector('#productTitle').textContent.trim(),
price: document.querySelector('.offer-price').textContent.trim(),
}));
await page.close();
console.log(`${data.title}: ${data.price}`);
}
browser.close();
}
main();

Summary

As you can see it’s only 25 lines of relatively simple code. Code that runs your browser in an invisible way and gives you access to the JavaScript environment during the webpage runtime. You can do a lot with it, without having problems with poorly formed HTML and JavaScript or imperfect web engine implementation.

I had real fun with headless chrome and the puppeteer API. It was really simple, but very powerful. I would encourage you to experiment with it because you can do much more than we’ve covered today. You can create a PDF from the web page, evaluate whole JavaScript files with some computation — it really gives you a new, interesting way of using web browser capabilities in your app.

Author: Maciej Romański

--

--