Photo by Vitaly Vlasov from Pexels

Data Scraping Tutorial: an easy project for beginners.

In this tutorial, I will walk you through the fundamentals of data crawling using BeautifulSoup in Python as you write the code from the scratch.

Taylan Kabbani
Oct 9, 2020 · 4 min read

If you are a data scientist, engineer, analyst, or just a simple guy who collects data as a hobby, you will often need to create your dataset despite the huge amount of datasets over the internet by scratching the messy, spacious, and wild web. To do so, you need to get yourself familiar with what we call web scraping, crawling, or harvesting.

Objective: Using the BeautifulSoup library in Python create a bot that aims to crawl private universities names along with the URL of their home websites in a user-specified country and downloading them as xlsx file.

We will be using the following libraries:

How does web scraping work?

When you open your browser and click on a page’s link, the browser sends a request to the webserver which contain the web page files, we call this a GETrequest as we are getting the page files from the server. The server then processes the incoming request over HTTP and several other protocols and sends back the required information (files) that are required to display the page. The browser then displays the HTML source of the page in an elegant and clearer shape.

In Web scraping, we create a GETrequest mimicking the one sent by the browser so we can get the raw HTML source of the page, then we start wrangling to extract the desired data by filtering HTML tags.

GET Request

As now we have a general idea of what we will be doing to extract data from the web, the first step is to download the page from the server. We can achieve this by using the Request library in Python.

Let’s download the main page which we will use it to navigate to other pages to extract data. In this tutorial, we will use the uniRank directory to search for universities in a given country.

This will create a response object which has a status code to indicate if our request to download the page was executed successfully, usually, a code starting with 2 indicates a successful request, 4 or 5 mostly means an error. We can view the HTML source of the page by calling .text property.

BeautifulSoup

Now that we have successfully downloaded the page HTML source, we can start parsing it using the BeautifulSouplibrary which makes the process of pulling data from HTML quite easy. Here are the steps that we will follow to find private universities along with their URLs in a given country:

1. Extract continent names and URLs

The first step is creating a dictionary of continent names as keys and their uniRank Urls as values from the uniRank home page. To do so, we need to create a Beautifulsoupobject to be able to parse the content and extract our desired data.

As you can see from the code above, we have used BeautifulSoup()function to create a bs4 object using the text of the page we have downloaded, we also specified to parse the page as HTML. You can now view the HTML of the page in a more structured way calling prettify method on the BeautifulSoup object

Using Chrome DevTools we can navigate easily to the HTML tags we need to extract, here we can see that continent names and URLs are saved as a list, so we need to navigate to div tag with class = col-sm-4

We can use .select()method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements.

We can further filter the result by selecting items starting with <li> attribute, which defines a list item in HTML, also we can select <a> attribute which defines a hyperlink, you can check the list of HTML tags here.

2. Extract country names and their URLs

In this step, we will also create a dictionary as we did in the previous step but this time keys will be country names and values will be their URLs.

After filtering our desired items from the HTML source, we need to create a BeautifulSoup object for each item in order to be able to use bs4 methods and functions.

As we are only interested in private universities, we should add the URL’s extension that would redirect us to the page of private universities for each country.

3. Extract universities names and their URLs

Let us use find all private universities in Albania. As we have done in previous steps, first we need to get the page’s HTML source using GET request, then we will create a BeautifulSoup object. Universities names and URLs are stored as a table, so we will use the find_all() method to find all tbody tags (body content in a table).

Then, we create a list of universities uniRank URLs. We iterate over this list to go to each university URL and extract its website and name to be added to a new dictionary.

The complete code of the bot is provided down below, there are some parts of the code that I did not explain because it is out of this article’s scope, like pandas, progress bar, etc.

Please contact me on Linkedin if you have any questions.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Taylan Kabbani

Written by

Data Scientist

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store