Handle web scraping using node js

In the world of API’s, most of the websites provide the API to extract the data from their website. But we can not get all the data that we want from those API’s and we do require those data for many purposes like to analyse the data for the productivity of other websites. Whether you want to start a new project or try to add new ways for the growth of the existing company, you need to access and analyse a vast amount of data. This is where web scraping comes in. like google does crawling of other sites and use that in SEO engine.

Web Scraping is a process of automating the extraction of data in an efficient and fast way. With the help of web scraping, you can extract data from any website, no matter how large is the data, on your computer.

So, let's dive into the ways to handle scraping using nodejs.

Puppeteer

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.

WHAT IS HEADLESS ????

A headless browser is a browser without a GUI(graphical user interface).

Headless browsers provide automated control of a web page in an environment similar to popular web browsers but are executed via CLI. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements and execution of Javascript.

Puppeteer can be used for different purposes -

  1. Generate pdf from HTML page.
  2. Scrape website data in JSON.
  3. Capture screenshots in pdf or image format of a website.
  4. Performance testing of a website.

Let’s start our Puppeteer basic example. We’ll write a script that will cause our headless browser to take a screenshot of a website.

First, install the puppeteer by command :

npm i puppeteer

This command installs both Puppeteer and a Chromium.

We will be using https://news.ycombinator.com/ site which is a news site. We will be taking out the headings of all news.

First, if you inspect the source code for the homepage using the Dev Tools inside your browser, you will notice that the page lists each heading data under a a tag each having class storylink.

We will be using that tag to get the heading as below.

  • headlessfalse means the browser will run with an Interface so you can watch your script execute, while true means the browser will run in headless mode. Note well, however, that if you want to deploy your scraper to the cloud, set headless back to true. Most virtual machines are headless and do not include a user interface, and hence can only run the browser in headless mode.

goto method will open the provided link page in the chromium-browser.

page.waitForSelector() method will wait for the a tag that contains all the news heading related information to be rendered in the DOM, and then you called the page.evaluate method. This method gets the URL element with the selector a.

By using querySelectorAll method will get all the elements in an anchor tag and we trimmed the data for each result and store in the array. And finally we consoled the scrapped data.

Run the code using node index.js and Result is down below:

JSDOM

JSDOM is a pure Javascript implementation of the DOM to be used in NodeJS, as mentioned previously the DOM is not available to Node, so JSDOM is the closest you can get.

Since a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically.

First, install the jsdom by command :

npm i jsdom

Firstly jsdom will create the instance, by providing the link of the page, now for simplicity, we have created the HTML inside the file itself.

Then by using the querySelectorAll we will select all the elements of p tag than will print the first p element. The result is :

This is very simple and easy to implement. We have other options for scrapping, like Cheerio. We can take a lot of advantages of scrapping by using these scrapping tools.

If you enjoyed this post, don’t forget to give it a 👏🏼, share it with a friend you think might benefit from it and leave a comment! Stay tuned for more exciting blogs on Flutter, Elixir, React, Angular, Ruby, etc.

Thanks !!

--

--