Weekly Webtips
Published in

Weekly Webtips

How To Create Web Page Scraper in NodeJs

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.

Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping etc.¹

In this article we’ll explore a way how we can extract some data from Amazon product page, using NodeJs. At the bottom of the article, you can find a source code of the application that we are going to build.

So let’s start by initializing a NodeJs project, by executing the following command:

npm init

After project initialization, we’ll add the library that will do the job of loading web page content:

npm i puppeteer

The main purpose of ‘puppeteer’ library is to fetch website content. It contains an open-source Chromium browser that is used for loading web pages. Library itself can also be used for creating PDF screenshots, automated form submissions, etc.

The library that we’ll use for extracting/parsing data from the fetched web page is called ‘cheerio‘ and we can add it to our project by invoking the following command:

npm i cheerio

The last dependency that we’ll use in the project is ‘express‘ which is a minimalistic and flexible NodeJs framework. It can be added by invoking the following command:

npm install express — save

Now we are ready to start the coding part. We are going to create a file that will contain a method for parsing website content. The attributes that we’ll try to extract from web page content are title, image, price, name of the seller, and features list.

So, let’s start to find the title of the product, by opening the web page:

and by right-clicking on the title and clicking on ‘Inspect Element’ button (or on similarly named button). There we can see that the title element has an id of ‘productTitle’ which will be important in order the extract the value out of it.

In a similar manner, we can try to find ‘id’ values of other elements that we want to extract from web page content (like price, name of the seller, etc.).

The method that does all the hard work of launching webpage and parsing webpage content is displayed below:

As you can see, on the top of ‘scrap’ method are the methods for loading Chromium browser (using ‘puppeteer’ library) and afterward all we are focused on extracting the data.

The file that will be used for loading the NodeJs application will contain the single endpoint for invoking the scraping job and there we’ll include our ‘scraper_amazon’ which contains a method for scraping web page content:

After we’ve made both files, we can run the sample by invoking the command:

node index.js

and that’s it. 😃

This was a brief tutorial on how web scraping is done using NodeJs and I wish you many projects that is using this nice feature!

Source code of sample application:

[1]: “Web scraping”, Wikipedia, https://en.wikipedia.org/wiki/Web_scraping.
Accessed 5 April 2021.




Explore the world of web technologies through a series of tutorials

Recommended from Medium

How to use Prettier with React Native

Basic Setup for Node.js

Create reusable Vue components — using slots

A simple introduction to sorting algorithms

React Native aiding Android and iOS mobile app development

React Native aiding Android and iOS mobile app development

Day 100/100 Reverse Function

Using Tailwind.css in a Meteor project

Most Important Linux Commands for Developers

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zoran Šaško

Zoran Šaško

Web & mobile application developer

More from Medium

Create your own npm package.

Deploy your React App on Ubuntu VPS

Implementing User Authentication with Auth0 in React

4 Steps to Create Google Authentication API in Node.js