Web Scraping 101: Cheerio with Node to the Rescue

Tarique Ejaz
The Startup
Published in
7 min readNov 6, 2019

Web Scraping (.n)
In simple terms, it is the process to go through a website HTML code or rendered code and extract the concerned data being shown to the user.

It mainly comes in handy for the data-oriented folks as it helps in analysis and prediction based chores. Having already pointed that out, it is still a super useful thing to know. You might just need to know the latest price of your favorite product on a beloved e-commerce site. As you cringeworthily wait for the price to take a tumble. Just saying.

Scraping is kind of a tricky business. Read the terms of the site before you go gung ho. (Source: http://www.commitstrip.com/en/2015/05/19/data-wars/)

I have been doodling with an idea for some time and scraping is an essential part of it. That is how I stumbled (properly) upon it. In this article, I will show you how to create a simple web scraping function that visits a product page on Amazon India and fetches details related to the product. Nothing complicated. Super simple. However, you can always build on top of it. That I leave up to you.

Tech Stack At Play

There are so many tools at your disposal as far as web scraping is concerned. I have heard that Beautiful Soup is extensively popular as far as Python-based libraries are concerned. However, given that I have a certain affinity to Javascript I decided to settle upon a combination of the following frameworks/libraries:

  • Node: It is a JavaScript runtime built on Chrome’s V8 engine. It is used majorly for server-based applications and use-cases which require system-intensive operations.
  • Cheerio: It is a library that makes use of certain core jQuery tenets to parse HTML and help in manipulating DOM data. It is lean, fast and super easy to load and use.

The tech stack is decided. Let’s jump into the code now.

Let’s Code The Scrape Out Of This

We start off by creating an index.js file and download cheerio and request using Node Package Manager (npm).

We use request to make HTTPS calls to the concerned website and it returns back the DOM to us.

First of all, we create a function in which we would be performing our scraping operation. Let us call this function ScrapeWebsiteData and we pass in the URL of the website. On using request, you would see that its signature normally expects a URL, an error parameter, a response, and a respective body to be returned from the website — this is where we get our main sauce or in technical terms, the parsed HTML code. After receiving the HTML code, we would go ahead and initialize a variable — let’s call it $ for convenience — using cheerio.load(). This enables us to create a tree-like object which helps cheerio to easily traverse the parsed data.

Here is the page of the webpage we would be scraping.

Now, normally, if you send continuous requests to any amazon website from any server, it is very likely that you and your IP would be blocked pretty soon. Fair warning, people. (Psst. There are many workarounds to it but we would keep that conversation for later. Okay, Bubbaloo?)

Having fetched the HTML code, it is time to start scraping the data we need. Before we start writing some cheerio, let us find the HTML data we would be targeting for extraction. Most tags have an identifier attached to it. It gets trickier if the element you are trying to scrape has no distinguishing identifier. As a ground rule, the less vague a tag, the easier it is to retrieve the data.

We would require the following data for the product listed — Name of the Product, Product Price, and the Essential Features.

As far as the title is concerned we see that there is an id named productTitle attached to the concerned span tag.

Using cheerio’s id selector tag, we go ahead and extract the text related to the title.

We use the regex /\s\s+/g soon after in order to remove any spaces or newline escape sequences trailing the text.

In a similar manner, we now look at the price data on the webpage and we see that it also contains an id named priceblock_ourprice attached to the wrapping tag.

The ground rule remains valid still and we easily extract the price of the product by using the cheerio’s id selector.

However, we only receive the string value of the price, which is not in a state to be consumed for any kind of computation. So we go ahead and write some code to extract the integer equivalent for any future use.

We have scraped the title of the product and the price of it — both in terms of display and usage. Now, what remains are the essential features and in this case, we would notice that the scraping situation is slightly different from the previous two. Unlike the previous cases, the essential features are a set of items in a list with identical identifiers.

In such a case, we stick to the ground rule. We try to discover any unique identifier tied to this list. On doing so, we find that there is such an id named feature-bullets, however, it is not exactly an identifier of the list tag. It is an identifier of a parent tag enclosing the list. It is time to use our basic knowledge of CSS and combine it with cheerio’s .each — which works similar to a foreach usage— and extract the concerned data.

In order to clean up any escape sequences in the text extracted and also the trailing spaces, we apply a bit of transformation on the extracted text.

So let’s see here. We have the title, price (both in a readable and usable form) and the essential features of the products. Are we missing anything else? Oh yes. The product image! It is time to investigate the likely identifier for the product image.

You would notice there is no text to be extracted here. For an image, in general, we do get easily referencable CDN links embedded in the HTML tags. Exactly what we need to scrape. We clearly see that there is a unique id identifier available for the img tag named landingImage. Let’s make use of that. Cheerio exposes the .attr() API and we would utilize it to extract the URL provided in the data-old-hires attribute of the img tag.

If you are wondering as to why are we not trying to utilize the src attribute. Well, in the case of a parsed HTML, we normally have a base64 string embedded for an img source in the src attribute. This data is relatively more cumbersome to handle. You are free to use it though. It depends purely on your inclination and the use-case you are pursuing.

The final extracted output would appear as the following.

Therein, we have built a simple web scraping function making use of Cheerio and Node.

Easy-peasy.

Wrap Up

You can find the entire code in my repository on Github here.

Feel free to go through it. If you have any comments and improvements, I am all ears. Drop those feedback in the comments section or you can reach out to me through my contact details listed on Github.

You can use this simple scenario and build way more advanced scraping systems. There are so many tools available on the internet. Let go of the chains if required and start exploring. The world is your playground, my friend.

And lastly, use your code for you.

Keep Coding, Peeps.

--

--

Tarique Ejaz
The Startup

I code, read, write and simply aspire to be a better individual than the one I am today.