Scraping websites with NodeJS

Beware: Scraping may banned for certain website and thus illegal (Facebook for instance), so is downloading movies and crossing the street on the red light.. Beware..
Disclaimer: In a normal situation, you would have your methods in a different file than from the starting point of your node App.
For a ‘real life’ demonstration with integration with Heroku, please refer to my repository .
Scraping is really useful and may comes pretty handy at times.
About a week ago, I was tasked to display some feed on a dashboard about cybersecurity. Here’s the deal, cybersecurity news API are not that common, and if you happen to find some, you have to pay.
So I went ahead and built my scraping script.
If you happen (we never know) to be looking for cybersecurity news feed in JSON format please go to the live Heroku app there : https://cyber-news-scrapr.herokuapp.com/news
So, we are going to need :
- node (of course)
- npm
- cheerio (npm install - save cheerio)
- request (npm install -- save request)
- a tiny brain (will do it)
the process is fairly simple:
we’ll request a web page via the ‘request’ package, once we’ll have the page, we’ll parse it with ‘cheerio’ which is a jQuery like DOM parser for node (super useful when it comes to parsing) and iterate through every DOM element (in that case : the article) and return our custom article Object.
After each iteration, our article Object will be pushed inside an array.
This array will eventually be returned to us will be served over a port with the help of express;
So here is the full code:
PS: If you want to know how to to use Heroku with NodeJs please watch this video.
