A guide to scraping the Web

Hack A BIT
hackabit
Published in
5 min readOct 29, 2019

What is web scraping?

In layman’s language, it is obtaining the data present on a web page. Sounds simple, right? Open the page, search for the data, save it, and we are done. Pretty straightforward. Now, you need data from hundreds of different web pages, and you realize that you can’t visit every page and retrieve data manually. You realize you need an automated code to get the work done for you. In this article, we will build a web scraper and discuss the challenges and hardships faced in the process.

Web Scraping, why?

Data, as you know is the root of everything. Without data, there could not be any analysis, no predictions and many more losses. In today’s time, organizations and firms have become strict regarding the availability of their data. Data has become expensive these days and there are situations where one has to pay to collect data which is already available on the webpages.

Not all websites provide API for access to their data, so it is important to study how to scrape the web.

Getting Started

We will be building a web scraper in Node.js using the modules axios and cheerio. Initialize an npm project and install the dependencies using the command line.

npm init

After initializing the project, download the dependencies.

npm install axios cheerio

We have installed the required modules and now we are ready to build our web scraper. We will be scraping data from Cricbuzz and print all the series that were played in the year 2018.

This is the page that we will be scraping, and in the end, we will print the series name and the time of its occurrence.

How to scrape this page?

We will request this page, and once we receive the page, we will find every ‘div’ tag that has the class name ‘cb-srs-lst-itm’ and we will retrieve data from it.

Let’s write the code for the process.

Now you have learnt how to build a simple web scraper. But things are not this simple every time as most of the websites deploy some anti-scraping mechanism in order to protect their data.

Let’s discuss some mechanism applied by companies to stop programmers from scraping data available on their web pages.

What is the most basic way to collect data from different webpages of a particular company?

Collect the URLs, run a loop, scrape data for every page, and the job is done.

But there is a problem over here. Now, what happens, that you are sending requests to the website, and once you receive the response, you parse the entire HTML and get your data.

Problem?

There are many websites that don’t load the entire data at once, rather the data is loaded only when you scroll down to that section of the page. In this case, the very basic approach fails, although we have a different approach for such websites.

Solution?

We can use automation tools where we request a page to be loaded in a browser. We start scraping the data only when it is loaded, and we can confirm this by checking the presence of selectors.

This is a script to receive announcements made on the Training and Placement Portal of our college, where the announcements are loaded via AJAX calls.

Similarly, you can scrape any other website using browser automation and save it for further use.

More hardships ahead?

So now, you have the resources to scrape any webpage. But, there still exists some problem. You are sending requests to a website to get a particular webpage, and you are repeating the process. The websites have some bot detection mechanism, where they will detect such activity of scraping webpages and will send you responses in a way that will halt your process. At times, they might block your IP permanently.

How to proceed?

The solution comes at the cost of either money or time. There are two ways you can go about this problem,

  1. By subscribing to a certain IP rotator. IP rotator is a system that would change your IP, and assign you an IP from a pool of available of IPs whenever you request for one. There are many IP rotation providers available, but as I mentioned earlier, IP rotators come either at the cost of either time or money and in this case its money. You can proceed further by changing your IP every time you send a request to the web page.
  2. You will have to switch to Virtual Private Network (VPN) every time you are sending a request, or you need to switch after you have made a certain number of requests. How to do this? You need to write automation code for connecting to VPN, wait for it to get connected and then send a request to the websites. The technique of switching to VPN comes at the cost of time as it would slow down the process and you would have to wait for a longer duration for your work to get done.

Start scraping bigger websites!

You write a program to scrape images from a Facebook page using the above-mentioned mechanism, you implement automation to switch the VPNs and you are ready to run the program. You run the program and go to sleep thinking that your work would be done once you wake up. Surprise surprise! Facebook has detected that you are switching VPNs and now you are left fuming as you have implemented everything you could. Sending requests using request modules, Automation using a browser, switching VPNs between processes, or anything related to Web would simply won’t work as Facebook has the strongest anti-scraping mechanism.

Is there still a way? Yes, there always exists a solution to a problem. Open your browser, go to the page whose posts you want to scrape, select the data, copy it and paste it.

Or you can automate this process, using Desktop Automation tools. Desktop Automations are hard to detect except for CAPTCHAs. The only problem that lies here that if you need to read texts, you can’t do it directly, you need to read anything and everything in pixels, which could be done using Digital Image Processing. We have a module in Node.js called RobotJS which can be used for it, and you can use Digital Image Processing libraries from Python, and you can connect them using Child Process.

Conclusion

All in all, web scraping is not merely retrieving data from a web page, it’s much more than that, a good web scraper requires many more technologies, and in this era where companies are very conservative for their data, it’s very important to build a good scraper.

Written by-

Nishant Kumar

Birla Institue of Technology, Mesra

--

--