Data Scraping Tutorial: an easy project for beginners.
In this tutorial, I will walk you through the fundamentals of data crawling using BeautifulSoup in Python as you write the code from the scratch.
If you are a data scientist, engineer, analyst, or just a simple guy who collects data as a hobby, you will often need to create your dataset despite the huge amount of datasets over the internet by scratching the messy, spacious, and wild web. To do so, you need to get yourself familiar with what we call web scraping, crawling, or harvesting.
Objective: Using the BeautifulSoup library in Python create a bot that aims to crawl private universities names along with the URL of their home websites in a user-specified country and downloading them as xlsx file.
We will be using the following libraries:
# Required libraries
import pandas as pd
from bs4 import BeautifulSoup
from progressbar import ProgressBar
How does web scraping work?
When you open your browser and click on a page’s link, the browser sends a request to the webserver which contain the web page files, we call this a
GETrequest as we are getting the page files from the server. The server then processes the incoming request over HTTP and several other protocols and sends back the required information (files) that are required to display the page. The browser then displays the HTML source of the page in an elegant and clearer shape.
In Web scraping, we create a
GETrequest mimicking the one sent by the browser so we can get the raw HTML source of the page, then we start wrangling to extract the desired data by filtering HTML tags.
As now we have a general idea of what we will be doing to extract data from the web, the first step is to download the page from the server. We can achieve this by using the Request library in Python.
Let’s download the main page which we will use it to navigate to other pages to extract data. In this tutorial, we will use the uniRank directory to search for universities in a given country.
This will create a response object which has a status code to indicate if our request to download the page was executed successfully, usually, a code starting with 2 indicates a successful request, 4 or 5 mostly means an error. We can view the HTML source of the page by calling .text property.
Now that we have successfully downloaded the page HTML source, we can start parsing it using the
BeautifulSouplibrary which makes the process of pulling data from HTML quite easy. Here are the steps that we will follow to find private universities along with their URLs in a given country:
1. Extract continent names and URLs
The first step is creating a dictionary of continent names as keys and their uniRank Urls as values from the uniRank home page. To do so, we need to create a
Beautifulsoupobject to be able to parse the content and extract our desired data.
As you can see from the code above, we have used
BeautifulSoup()function to create a bs4 object using the text of the page we have downloaded, we also specified to parse the page as HTML. You can now view the HTML of the page in a more structured way calling
prettify method on the
Chrome DevTools we can navigate easily to the HTML tags we need to extract, here we can see that continent names and URLs are saved as a list, so we need to navigate to
div tag with
class = col-sm-4
We can use
.select()method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements.
We can further filter the result by selecting items starting with
<li> attribute, which defines a list item in HTML, also we can select
<a> attribute which defines a hyperlink, you can check the list of HTML tags here.
2. Extract country names and their URLs
In this step, we will also create a dictionary as we did in the previous step but this time keys will be country names and values will be their URLs.
Note: For the sake of keeping this tutorial simple we will find universities only for one country. The complete code of the bot is at the end of this article.
After filtering our desired items from the HTML source, we need to create a BeautifulSoup object for each item in order to be able to use
bs4 methods and functions.
As we are only interested in private universities, we should add the URL’s extension that would redirect us to the page of private universities for each country.
3. Extract universities names and their URLs
Let us use find all private universities in Albania. As we have done in previous steps, first we need to get the page’s HTML source using
GET request, then we will create a
BeautifulSoup object. Universities names and URLs are stored as a table, so we will use the
find_all() method to find all
tbody tags (body content in a table).
Then, we create a list of universities uniRank URLs. We iterate over this list to go to each university URL and extract its website and name to be added to a new dictionary.
The complete code of the bot is provided down below, there are some parts of the code that I did not explain because it is out of this article’s scope, like pandas, progress bar, etc.
Please contact me on Linkedin if you have any questions.