Johnson-Awah Alfred
devWithFred
Published in
4 min readDec 30, 2018

--

webscrapping with node.js

In this post, we’ll learn how to use Node(not entirely) and friends(axios and cherrios) to perform a quick and effective web-scraping for single-page applications. This can help us gather and use valuable data which isn’t always available via APIs. Let’s dive in.

What is web scraping?

Web scraping is a technique used to extract data from websites using a script. Web scraping is the way to automate the laborious work of copying data from various websites.

Web Scraping is generally performed in the cases when the desirable websites don’t expose the API for fetching the data,so what do we do? We show them how it is done(Hohohoho!).

What do we need?

Getting started with web scraping is easy and it is divided into two simple parts-

  1. Fetching data by making an HTTP request
  2. Extracting important data by parsing the HTML DOM

We will be using Node.js for web-scraping. If you’re not familiar with Node, check out this article “The only NodeJs introduction you’ll ever need”.

We will also use two open-source npm modules:

  • Axios — Promise based HTTP client for the browser and node.js.
  • Cheerio — jQuery for Node.js. Cheerio makes it easy to select, edit, and view DOM elements.

You can learn more about comparing popular HTTP request libraries here.

Setup

Our setup is pretty simple. We create a new folder and run this command inside that folder to create a package.json file. Who is hunry? Let’s cook the recipe to make our food delicious.

- Find a cool directory for this mini project

-Open up your terminal

-Initialize your project with npm (npm i -y)

Before we start cooking, let’s collect the ingredients for our recipe. Add Axiosand Cheerio from npm as our dependencies.

npm install axios cheerio

Quickly, create an index.js file and require our modules.

const axios = require(‘axios’);

const cheerio = require(‘cheerio’);

Make the Request

We are done with collecting the ingredients for our food, let’s start with the cooking. We are scraping data from the HackerNews website for which we need to make an HTTP request to get the website’s content. That’s where axios come into action.

axios request

You should get a response that looks like a html body with tags.(Don’t get scared about your response) It’s bascically the html structure of hackernews as at when writing this post.Remember to use node index.js to run your node application,else you’ll keep waiting for a result.

We are getting similar HTML content which we get while making a request from Chrome or any browser. Now we need some help of Chrome Developer Tools to search through the HTML of a web page and select the required data. You can learn more about the Chrome DevTools from here.

We want to scrape the News heading and its associated links. You can view the HTML of the webpage by right-clicking anywhere on the webpage and selecting “Inspect”.

Hacker News and Google dev tool

Parsing HTML with Cheerio.js

Cheerio is the jQuery for Node.js, we use selectors to select tags of an HTML document. The selector syntax was borrowed from jQuery. Using Chrome DevTools, we need to find selector for news headlines and its link. Let’s add some spices to our food.

selecting dom tags to scrap

Code to select our target dom node

First, we need to load in the HTML. With Cheerio, we need to pass in the HTML document. After loading the HTML, we iterate all the occurrences of the table row to scrape each and every news on the page.From the Chrome dev tool,our taget is the 3rd child of each element which is the title of the news.We would loop through the result and form an array of object storing the title and link of each story.

The Output will look like below,some kind nice array of object response —

[{title: ‘Japan whale hunting: Commercial whaling to restart in July’,

link: ‘https://www.bbc.co.uk/news/world-asia-4668297'},{

title: ‘Halide: a language for fast, portable computation

on images and tensors’,

link: ‘http://halide-lang.org'},]

Now you can actually do anything with this data, send to a data base,

save it to a file, and lots more.

Screenshot of final code

--

--