Web scraping with JavaScript

Akash Milton
2 min readFeb 9, 2018
Photo by michael podger on Unsplash

When you visit a lot of tutorials regarding scraping of internet you end up with learning python ( As most of the developers are good at python ). Does python efficient in any way? May be Yes (I’m not sure). So why JavaScript? (I too don’t have a reason for that. Just some people love JavaScript over python)

Why scraping?

Scraping is the best technique to collect a lot of data. As a learner or startup you cannot afford datasets online. So the best way is to go for scraping.

Is scraping Illegal?

Kind of. You have to check the robots.txt file before scraping . It gives you knowledge of whether you should scrape or not?. It is illegal if you make a lot of request to a page in a very short time. so be careful as you may end with some legal notices.

How to scrape? Basic Level

I have created a npm package namely spiders (which is completely free to use). Install that using yarn or npm (I prefer yarn)

yarn add spiders

or

npm install spiders

Import the package in your projects using require or import. Create object for spiders with options.

let spiders = require('spiders');
let spidy = new Spiders();
spidy.crawl('www.google.com').then($=>{
let title = $("title").text(); //refer jquery or cheerio
console.log(title);//google
});

Also it allows to download files

spidy.download('http://url/to/downld.png').then(()=>{
console.slog("Donwloaded");
});

Crawling — Advanced Level

The Spiders() allows an optional object using which we can make the scraping more efficient. By default a object {url:url} is stored in visited array [{url:’http://google.com’}] for each crawled url. Before each crawl it is checked based on isVisit() function. And if not visited it makes setVisit(). You can alter the object and also can provide predefined visited nodes. This can be very helpful when you are going to scrape in a non continuous manner.

let options = {
showStats: true,
visited: [],//Previsisted NOdes
isVisit: function(obj,url){return obj.url == url;}
proxy: null,
setVisit: function(url){return {url:url}}
}
let spidy = new Spiders(options);

Comment for any clarification

--

--