How to do web scraping with Cheerio

This past weekend (13 August 2017) I started on a quest to get some data from a cinema website here in Accra, Ghana. I thought this would have been easy, since the data is available publicly. I immediately opened the Chrome web inspector to see some markup like I have not seen in years.

The Problem

There was no structure to this data, the listings were just a bunch of <p> tags with some nested <span> and <br> tags inside. This to me was a sign of a no go, I even went on to state that there was no way of getting this data in the DevCongress (you might be wondering what DevCongress is, more to come soon) slack group, along with a solution I wasn’t too sure would work.

The Old Solution

After a few minutes of thinking it through, I realised there was a pattern even in the <p> tags, when I did a count I noticed that each movie has around 12 nodes of <p> which contained the data I would need for the movie. So now I could do a loop over the <p> tags and count down from 12, then reset the counter once we hit 0.

The Actual Solution

Just when I finished writing this post, the data I was scraping changed and broke my solution, so I had to go back to the drawing board and come up with a new solution, which I think in turn has worked out to be a better and more robust solution.

Instead of counting the <p> tags, I have decided to use the <hr> tags on the page as the breaking point between each movie, I have also decided to not use the method I was before by counting down from 12 to get the movie information. I have instead opted for checking the actual string I am looping over to test if it contains a certain word where possible. In other places I am using some crazy thinking to get the information I need.

Its now a bit clearer to me as to how to approach the problem, I then decided its time to start writing some code, I was thinking of doing this in Python as I had used Beautiful Soup in the past to do this sort of thing, but lately I have been doing more work in JavaScript and Node. So I did a quick search and I found this article using Cheerio and the Request library, I quickly started writing some code and couldn’t believe how easy the API was to use.

Getting started with the necessary tooling

Lets start by installing the libraries we will need, also note I am using Node 8, so will be using new features of JavaScript where I see fit.

Requirements

For this tutorial you will need the following libraries. At the time of writing these are the versions I used.

  • Cheerio (1.0.0-rc.2)
  • Request (2.81.0)
npm install cheerio@^1.0.0-rc.2 request@^2.81.0

Now lets start requiring the libraries we need in order to get some data from the webpage.

let fs = require('fs'); 
let request = require('request');
let cheerio = require('cheerio');

You will notice I am also requiring the fs library, we are doing this so that later on we don’t hit the API more times than necessary, we can cache the data and easily read from cache and do our scraping from that data.

Now lets define a few variables to store the URL of the website we want to scrape and the name of the cache files.

const cinema = 'accra'; 
const apiUrl = `http://silverbirdcinemas.com/${cinema}/`;
const cacheFile = `cache/${cinema}-silverbird.html`;
const outputCacheFile = `cache/${cinema}-silverbird.json`;

Lets Go!

We can now start defining our data structure that we want to deliver to our end user.

// main movie listing 
let movieListings = {
address: '',
movies: []
};

// each individual movie
let newMovieObj = {
title: '',
showtimes: [],
synopsis: '',
cast: [],
runningTime: '',
genre: [],
rating: 'Unknown'
};

Here we have defined the properties our output data will conform to, so in the movieListings structure, we are currently only storing the address of the cinema and a list of movies. While in the newMovieObj we are storing all the attributes of the movie that we need.

Lets start writing our code to make a request to get the apiUrl and then cache it to the file system using the fs library. We will start off by wrapping the function so we can reuse it later on.

let requestPage = (url, cachePath) => { 
request(url, (err, response, html) => {
if (err) {
return console.log(err);
}
fs.writeFile(cachePath, html, err => {
if (err) {
return console.log(err);
}
console.log('The file was saved!');
});
});
};

Lets look at some of this code, we start off by defining our function called requestPage, which requires two parameters, one for the url we are making the request to and another for the cachePath we wish to save the response data to. We know what we are requesting is html so we will save it as html as defined in the cacheFile variable we set earlier. We call the request library with the url and a callback function with the parameters of err, response, html, with these we can determine the state of the request we’ve made. If there is an error, we just log it to the console for now, otherwise we can move on to starting to write to the filesystem. We now have some data, so lets move on to writing it to the filesystem for now with fs.writeFile, in this we will also check for error and log them to the console again.

Now that we have our function to request and write data to the filesystem, lets move on to reading the cache file we saved.

fs.stat(cacheFile, (err, stat) => { 
if (!err) {
fs.readFile(cacheFile, (err, data) => {});
} else if (err.code == 'ENOENT') {
requestPage(apiUrl, cacheFile);
}
});

We start by checking if the cacheFile exists, if it doesn’t we send a request and create one, otherwise we just read it using the fs.readFile function.

Inside our fs.readFile callback, lets start loading up the data (which we know is an html page) into cheerio so we can crawl the DOM (Document Object Model) and select the data we need.

fs.readFile(cacheFile, (err, data) => { 
let $ = cheerio.load(data);
let numLines = 10;
let synopsisNext = false;
});

Lets take a look at this line by line.

let $ = cheerio.load(data);

You might be wondering why are we assigning a $ variable to the loading of the DOM data, we are using the $ for no specific reason, except that its what jQuery uses and it became universal amongst most developers to represent the DOM.

let numLines = 10;

We assign the numLines variable to 10, because this is what we will use to figure out where our movie title is. So each time the numLines is reduced to 10 we know its the node of a movie title.

let movie = Object.assign({}, newMovieObj);

The movie variable is assigned to a new copy of the newMovieObj to get all the properties in that object.

let synopsisNext = false;

This synopsisNext variable is to make sure that we know when the synopsis information is coming up, since the actual information and the title word SYNOPSIS are stored in different <p> tags.

$('#content .page').children().each((i, elem) => {
let text = $(elem).text();
let html = $(elem).html();
if (i >= 1 && i < 3) {
movieListings.address += text.replace("\'", '');
}

// Movies start
if (i > 12) {
if (html == '') {
if (movie.title !== '') {
movieListings.movies.push(movie);
}
numLines = 11; // offset by 1 in order to get movie title
movie = Object.assign({}, newMovieObj);
}

// The movie title should be the first item in the loop
if (numLines == 10) {
movie.title = text.replace('\n ', '').trim();
}
if (checkDaysOfWeek(text, ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'])) {
movie.showtimes = text.split('\n').map(item => item.trim());
}
// Search for SYNOPSIS keyword and know that the next loop
// will be the actual synopsis
if (synopsisNext) {
movie.synopsis = text;
synopsisNext = false;
}
if (text.indexOf('SYNOPSIS') === 0) {
synopsisNext = true;
}
// Search for the CASTS keyword
if (text.indexOf('CASTS:') === 0) {
movie.cast = (text.replace('CASTS:', '').trim()).split(',').map(item => item.trim());
}
// Search for RUNNING TIME keyword
if (text.indexOf('RUNNING TIME:') === 0) {
movie.runningTime = text.replace('RUNNING TIME:', '').trim();
}
// Search for GERNE keyword
if (text.indexOf('GENRE:') === 0) {
movie.genre = (text.replace('GENRE:', '').trim()).split(',').map(item => item.trim());
}
// Search for RATING keyword
if (text.indexOf('RATING:') === 0) {
movie.rating = text.replace('RATING:', '').trim();
}
numLines--;
}
});

The code above is plenty, but lets break it down as to what each part is doing.

We will start off on line 1 which loops through all the p tags inside of a div with the id of content.

$('#content .page').children().each((i, elem) => {

On line 2 and 3 we are assigning the text of each p tag into a variable called text and a variable called html.

let text = $(elem).text(); 
let html = $(elem).html();

From line 4 we are then checking if the current p tag is situated in the first 3, as we have figured this is where the address for the cinema is located. We then append that text to the address property of the movieListings object. At this point we do some cleanup on the text with the replace string method.

if (i >= 1 && i < 3) { 
movieListings.address += text.replace("\'", '');
}

Next we can see that the actual movie listings start from line 9 onwards, this is because we know that after 12 p tags we have the movie listings starting.

if (i > 12) {

On line 10 to 16, we check if html is empty and reset numLines to 11, you might be wondering why 11 instead of 10, this is because we have to offset by 1 in order to get any subsequent title after the first time, now we add the current movie to the movieListings.movies. We then move on to creating a new movie object to make sure that our next loop is not updating an existing movie reference.

if (html == '') { 
if (movie.title !== '') {
movieListings.movies.push(movie);
}
numLines = 11;
// offset by 1 in order to get movie title
movie = Object.assign({}, newMovieObj);
}

On line 19 to 56, we use multiple if statements to decide which piece of movie information we are currently accessing. Here you will notice we are using different methods to check the data against. When we find the information we need, we are doing some manipulation and cleanup in order to create a format we are happy with. In this particular area we created a helper function to get the showtimes, by check a string to see if it contains any of the days in the week. That helper function is the checkDaysOfWeek function, which looks like the below.

let checkDaysOfWeek = (text, days) => { 
for (var i = 0; i < days.length; i++) {
if (text.toLowerCase().indexOf(days[i]) !== -1) {
return true;
}
}
return false;
};

The rest of the code below is just working out how best to find a particular piece of movie information.

// The movie title should be the first item in the loop
if (numLines == 10) {
movie.title = text.replace('\n ', '').trim();
}
if (checkDaysOfWeek(text, ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'])) {
movie.showtimes = text.split('\n').map(item => item.trim());
}
// Search for SYNOPSIS keyword and know that the next loop
// will be the actual synopsis
if (synopsisNext) {
movie.synopsis = text;
synopsisNext = false;
}
if (text.indexOf('SYNOPSIS') === 0) {
synopsisNext = true;
}
// Search for the CASTS keyword
if (text.indexOf('CASTS:') === 0) {
movie.cast = (text.replace('CASTS:', '').trim()).split(',').map(item => item.trim());
}
// Search for RUNNING TIME keyword
if (text.indexOf('RUNNING TIME:') === 0) {
movie.runningTime = text.replace('RUNNING TIME:', '').trim();
}
// Search for GERNE keyword
if (text.indexOf('GENRE:') === 0) {
movie.genre = (text.replace('GENRE:', '').trim()).split(',').map(item => item.trim());
}
// Search for RATING keyword
if (text.indexOf('RATING:') === 0) {
movie.rating = text.replace('RATING:', '').trim();
}

Once we hit line 57, we reduce by 1 the numLines left.

numLines--;

You can view the full source code and working copy on Glitch.

And this is how I went about scraping the movie data I needed from the cinema website. In the code there are a lot of places that can be refactored and simplified. I might write another post on refactoring the current codebase.

Thanks to Wendy Smith, Edmond Mensah, Emmanuel Lartey and David Oddoye for reviewing this post and giving feedback to improve it. If you need Front-end/NodeJS/PHP development done, please visit https://www.donielsmith.com and check out some of my work. Feel free to get in-touch with me on Twitter @silentworks with questions.


Originally published at www.donielsmith.com on August 29, 2017.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.