How To Create Web Page Scraper in NodeJs
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.
Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping etc.¹
In this article we’ll explore a way how we can extract some data from Amazon product page, using NodeJs. At the bottom of the article, you can find a source code of the application that we are going to build.
So let’s start by initializing a NodeJs project, by executing the following command:
After project initialization, we’ll add the library that will do the job of loading web page content:
npm i puppeteer
The main purpose of ‘puppeteer’ library is to fetch website content. It contains an open-source Chromium browser that is used for loading web pages. Library itself can also be used for creating PDF screenshots, automated form submissions, etc.
The library that we’ll use for extracting/parsing data from the fetched web page is called ‘cheerio‘ and we can add it to our project by invoking the following command:
npm i cheerio
The last dependency that we’ll use in the project is ‘express‘ which is a minimalistic and flexible NodeJs framework. It can be added by invoking the following command:
npm install express — save
Now we are ready to start the coding part. We are going to create a file that will contain a method for parsing website content. The attributes that we’ll try to extract from web page content are title, image, price, name of the seller, and features list.
So, let’s start to find the title of the product, by opening the web page:
and by right-clicking on the title and clicking on ‘Inspect Element’ button (or on similarly named button). There we can see that the title element has an id of ‘productTitle’ which will be important in order the extract the value out of it.
In a similar manner, we can try to find ‘id’ values of other elements that we want to extract from web page content (like price, name of the seller, etc.).
The method that does all the hard work of launching webpage and parsing webpage content is displayed below:
As you can see, on the top of ‘scrap’ method are the methods for loading Chromium browser (using ‘puppeteer’ library) and afterward all we are focused on extracting the data.
The file that will be used for loading the NodeJs application will contain the single endpoint for invoking the scraping job and there we’ll include our ‘scraper_amazon’ which contains a method for scraping web page content:
After we’ve made both files, we can run the sample by invoking the command:
and that’s it. 😃
This was a brief tutorial on how web scraping is done using NodeJs and I wish you many projects that is using this nice feature!
Source code of sample application:
: “Web scraping”, Wikipedia, https://en.wikipedia.org/wiki/Web_scraping.
Accessed 5 April 2021.