Javascript Scraper with Node

Songyishu Yang
3 min readMar 4, 2019

--

No-BS guide to Open link, scrape data, write in new file with Javascript and Node.js

Step 1. Find the right NPM package

“Node.js is an open-source, cross-platform JavaScript run-time environment that executes JavaScript code outside of a browser.” What it simply means is that Node enables Javascript to be its own application (such as Python and Ruby) beyond just running on the web browsers.

As such, it comes together with the awesome library of NPM that has a wide variety of open source Node.js codes.

Instead of writing our own HTML parser, we will simply leverage an existing open-source node package that is published on NPM:

Step 2. Config the environment in console

Under the directory of folder for you scrapper, let’s first intall the NPM package ‘node-html-parser’. To do so, type in the following in your console:

$npm install — save node-html-parser

As we will need to use ‘fetch’ get to open the link of our desirable website to scrape, we will need to install the node-fetch package as well:

$npm install node-fetch — save

Then go ahead and create a new js file:

$touch scraper.js

Step 3. require all necessary parts in the javascript file

Now open the scraper.js with your text editor.

Since now we are under a Node environment, we need to set up a few other keywords in order for things to run:

/scraper.js: const HTMLParser = require('node-html-parser')
const fetch = require('node-fetch')
const fs = require('fs')

HTMLParser to require the node-html-parser parser. Fetch and fs are common javascript keywords but since now we are in Node they need to be redefined.

Step 4. Finally the scrapping!

Using simple javascript fetch, you can get the web content of a particular URL in text format content:

/scraper.js:const url = "https://medium.com/@songyishu.yang"fetch(`${url}`)
.then (res => res.text())

We will then pass the text format content to the HTMLParser we defined (using the node-html-parser package) to get a ‘readable’ content that will respond to all sort of search/find methods pre-coded by the node-html-parser package. I like to define a root variable just to keep track of the overall content we receive form the url.

/scraper.js:let root;
const url = "https://medium.com/@songyishu.yang"
fetch(`${url}`)
.then (res => res.text())
.then (body => root = HTMLParser.parse(body))

Then you may pass the HTML-parsed object into a new function that will manipulate the data and extract the exact bits you want.

/scraper.js:let root;
const url = "https://medium.com/@songyishu.yang"
fetch(`${url}`)
.then (res => res.text())
.then (body => root = HTMLParser.parse(body))
.then (() => extractData(root))

Luckily the node-html-parser is quite user-friendly and have created methods that look A LOT LIKE the browser query methods you would use on a chrome browser:

/scraper.js:function extractData(root){
const description = root.querySelector('p')
console.log(description)
}

With some small tweaks such as rawTexr instead of innerText:

/scraper.js:function extractData(root){
const description = root.querySelector('p').rawText
console.log(description)
}

You can read the full documentation on all query methods applicable to the package on the npm page.

Step 5. Saving it somewhere

It could be quite annoying if the website you scrape all the sudden becomes unavailable. You can keep the data you scrape safely by writing them directly into a json file after fetching.

Even though this might defeat the purpose of having a javascript scrapper (for some small amount scrapping off the back of browser rendering), it might be useful sometimes.

To do so, simple add a few lines of fs.writefile to the end of your extractData function:

/scraper.js:function extractData(root){
const description = root.querySelector('p').rawText
console.log(description)
fs.writeFile(
'scrappeddata.json', JSON.stringify(description), function(err) {
if (err) throw err;
console.log('successfully saved file')
}
)

}

Make sure you have the file scrappeddata.json in the folder.

See codes here: https://github.com/YSongYS/JavscriptScrapperExample

The End.

--

--