Writing your first Web Scraper

Published in

Nybles

6 min readJul 25, 2020

But first, what is a Web Scraper?

Internet has a lot of data, and if you can get that data programmatically, you can analyze and visualize that data amazingly for your personal interest or even for research purposes. To ‘get’ this data, you need to make a web scraper, which is a program that can automatically extract data from a website.

What’s the hype? What can I do with it?

Lets not get into why companies or businesses use it, but what we can do with it. For us, the best use case is Automation, we can convert some manual task into an automated task to save time, learn something new, and also to look cool :).
Example 1: Want to buy a product on Amazon but the price keeps fluctuating? You can build a web scraper which continuously scrapes the product page for its price, and notifies you when there’s a dip in price so you can buy it at a good price.
Example 2: Want to know instantly when a new issue is posted in OpenCode? You can build a scraper that scrapes the issues page of a repository and checks if a new issue has been posted and notifies you if there is one. BUT, this is against OpenCode rules so don’t do it if you don’t want to get disqualified xD.

Along with automation, you can also use it to get data for some project of yours. For example, I will share an instance where I implemented Web Scraping. In March, I decided to make a Covid-19 Tracker which showed the heat-map of positive cases in Indian States. Now for that, I needed the number of cases in each state, so I made a data resource by scraping the official Ministry of Health and Family Welfare website. Here’s how the data looked on the site, and how my code returns it after scraping:

So my web scraper was able to convert a table on a website into a JSON object that some other code can understand and utilize.

So how do I build a Web Scraper?

Prerequisites

Building a simple Web Scraper is actually easy, the only prerequisite for this particular tutorial is basic knowledge of Python since that will be our language of choice. So lets drive straight in!

Setting Up

You need two modules to begin web scraping, first is requests which will be used to fetch the HTML source of the website, and the second is beautifulsoup4 which will be used to parse the HTML and then be used to extract data. You can install these modules using pip:

Now we’re all setup to write our first scraper!

Starter Code

For this tutorial, we will be scraping the faculty list in IT Branch of IIITA available at this URL: https://it.iiita.ac.in/?pg=faculty
Here is how our starter code will look like.

In the first two lines, we just import the needed modules.
After that, the requests.get() function takes in a URL and fetches the response from the URL, and then .text on that response gives us the HTML for that response. So if we print the variable html here, it would print the whole HTML source of the website at that URL.
The BeautifulSoup() function takes in the html and parses it for later use.

Observing the HTML Structure

Now, to scrape data, we need to understand how the data is structured in the HTML so we can know what part of the HTML to extract data from.

Luckily, we have a very simple structure here, the table we are looking for is inside a div with the id tooplate_content. So this is going to be easy to search for since we have a direct selector.

Implementing the Observed Structure

Now we will use functions to find and iterate over different HTML tags and selectors. We will explain the common functions by implementing them in our code. To query the HTML structure observed, our code will be:

We observe two functions used, soup.find() and soup.find_all() with certain arguments. soup.find() returns the first HTML element that matches the selector conditions passed in the arguments while soup.find_all() returns a list of all matched elements. For both the functions, the first argument is the HTML tag to query, and the second argument(optional) is a dictionary of attributes, for example id or class.

Here, first we find the div with id tooplate_content, we can use find() since we know there is only 1 such div. Then we find the table inside that div, again we know only 1 such table is there. Now inside that table, all the data is present inside table rows i.e. tr, so we loop over all the tr present inside the table. Inside a table row, data is in present in td, so we loop over all the td in a tr and extract the data in it using .text and append it into a list. We print this list for every row. Here’s how the output of this code looks like:

Clearly this needs some post processing.

Tidying Up

The output we got is correct but unreadable and some parts are useless, so we need to do some post processing on this to make this actually useful.
There is no extra knowledge of Web Scraping needed now, we are just changing the given data into a more useful form.

First of all there are some extra rows which can just be filtered by checking the list size of a row. Next we will strip each line, split text using the \n character and organize it into a dictionary object. Here’s the final code, I have used the pprint module to make the output look readable.

Now the output looks like this:

Now, each faculty member has their own object which has the attributes name, position, qualifications and interests, all extracted from the website. Now this data can be used further somewhere else.

Some last words

Do you realize what we have achieved here? We managed to convert a table on a website into a usable JSON object. Just imagine the possibilities, the things you can scrape and use.
Of course, this was an easy example. The structure can be complex, there maybe authentication involved, there maybe CAPTCHAs ruining your day :).

But this was just an introduction, for more in depth stuff you can always google and find helpful results, web scraping is definitely a popular enough topic for it. I would recommend the Web Scraping chapter from Automate the Boring Stuff: https://automatetheboringstuff.com/2e/chapter12/ . It goes more in-depth and also shows the use of selenium module which is something you’ll have to use if you want to dynamically interact with the website as well.

And that’s a wrap! Good luck!

About Me

I am an undergrad at IIITA who is enthusiastic about technology and programming. I am a frequent open source contributor and love to do hobby projects that help me in my everyday life.