Hands-On Web Scraping With Python
Web scraping is inherently useful for many people, in particular those who do not know how to do it. I have written many web scraping scripts for friends. None of them had any programming or computer science related background. This tutorial is for all the Sociologists, Business Analysts, Literature Researcher and all other people sometimes need to automatically collect data from the web.
At the end of this tutorial we will have a little script, which if you run it automatically collects an article from medium.com which you could for instance store in a .csv file. You then can process with another software of your choice. The goal is also to teach fundamentals you need in order to do simple website scraping. This guide will not make you a web developer or anything close to that, but I provide some references in the last section, if you are interested in deeper knowledge. For the sake of making this guide understandable for readers without technical background, I decided to oversimplify in some parts as well as I did not use exact terminology.
The Things That I Will Not Explain
Here are listed the things which you need to know in order to follow this guide. In the scope of this tutorial there in no space to explain them, but I will add some resources so you can learn them beforehand.
- What is Python and how do I use it?
- How do I install Python packages?
Basics 1: See Websites From A Different Perspective
The first thing we have to understand in order to perform web scraping is the function of the browser. In order to do so, press F12 on the keyboard. If you are using Firefox or Chrome you see the developer tools popping up. One of the things you can see in the developer tools is the plain HTML text which your browser interprets into a website (see Fig. 1). We will need the developer tools later, in order to locate the data that we want to extract in the HTML text of the website. For now, close the developer tools again and lets discuss where the HTML text comes from in first place.
The browser requests the data that it needs to display a website from a web server via a HTTP request. Such a requests gets then answered from a web server via a HTTP response (See Fig. 2). All this happens in the background while you are clicking through the web. For this guide, we are mostly interested in the body of the response, this is where the HTML text is located.
Basics 2: HTML
HTML is a language which allows to structure the content of a website. It is part of what we get as a response from the HTTP request. The HTML code consists of tags and content in between the tags. The tags are like markers for content on the website. The browser then knows in which style sheet (CSS) it has to look to into in order to display the content correctly.
A tag <h1>Something</h1> for instance is interpreted by the browser as a heading of type 1. In the corresponding Style Sheet, it could for instance be noted that h1 heading have to be displayed in blue or a certain font.
We will use such tags later in order to find the information we want to extract. It is important to note that the <h1> tag for instance is in between the body tag. The tags have an hierarchical order, the inner tags are called children and the outer ones are called parents. In our case the <body> tag is the parent of the <h1> and the <p> tag. Later, when extracting a link from a heading we will come back to this concept.
We learned about interpretation of the HTML into a beautiful website and the request of the HTML text from the web server are two different things, that the browser does for us. Therefor, we can conclude, that we can write our own file containing only HTML text and have the browser interpret it. And this is what we are going to do now.
Create an empty text file on your desktop. Open the file and copy the text from (Code 1) into the file and save it. Right click on the file, click on Open with with any browser of your choice. If everything worked out, you should now see only the text in between the tags formatted in two different ways.
After we got the basics we can now start to use Python to replace the browser. First we are going to make the HTTP request with the help of Python and then we are going to use also Python to actually extract the information from the webserver’s response.
Get the request
For this tutorial, our job will be to extract the headings from medium.com and put them into a .csv file. Then select one article and also save the text of the article.
We start to use Python in order to make a HTTP request without using the browser. Therefor, we will use the Python package called requests. Requests allows us in a very simple manner to formulate an HTTP request and store the response so we can use it later for the extraction of our data.
In code to we first make a request with the get() function and we store the response into the variable r. Then we print the HTML text of the response which is located in the attribute text of the response object.
Find the data on the website
We will work with the non personalized version of the medium.com start page. The easiest way to see how it looks is to open a browser window in incognito mode and go to medium.com. Then open the developer tools again with F12. And use the inspection tool to find the corresponding HTML part for the headings you want to extract. In our case click with the inspection tool activated onto the main heading. The developer tools will now jump to the part of the HTML text which is responsible for the main heading.
We can identify which HTML tag medium has used for the main heading, by looking into the blue marked part and check for the tag which is written in the smaller and larger signs <> . As we can see in Fig. 3 medium.com has used an <h1> tag for the main heading. In the next step we will find all <h1> tags used on the website and extract the heading itself.
Extract the data from the website
To extract the heading we have to first find it in the response, that we got from the HTTP request and secondly we have to remove the surrounding HTML structures. The package we are going to use for this job is called beautiful soup.
Beautiful Soup makes it possible to search through the the different HTML tags, but before we extract certain HTML tags we have to decide which tags we are interested in. The biggest helper here will be the developer tools of your browser, which we have seen before.
In order to extract the first heading we will
Make a request (1), which we learned before. The response of the server will be stores in a variable called r. The HTML text of the response will be parsed (2) the into a Beautiful Soup object, we called it soup here. The parsing basically means that the whole website will be stored in a structure which makes it easy to search through. The BeautifulSoup object additionally gives us some new functions which makes our life easier such as removes the tags from the content. The search for all <h1> tags in the response (3) is performed by calling the function find_all() function on the BeautifulSoup object soup. You can find out how find_all() works in detail by clicking on the link, but for now it is enough to know that it returns a list of all tags which match the search condition. The first element of the list is our heading. For better understanding, we will display the whole list (4) first and then we only print out the first element without the HTML tags by accessing its .text attribute.
The next step is to open the link behind the text and to extract the content. In Code 4 we can see how this is done. The new line here is line 8, we formulate a new HTTP get request as we have learned before, but this time we will use the link behind the heading in order to get the request. In Fig. 3 we can see that the link is a parent of the heading node, this means we can acces the link by accesing the parent of the heading. In line 8, we acces the first element of the list containing all headings by using headings[0], then we acces its parent tag headings[0].parent and lastly we get what is written in between the <href> tag. <href> stands for hyper reference and is basically a link as you know it.
As before, we make a Beautiful Soup object from the response (line 9) and then we are going to find the text of the article which is placed in a couple of <p> tags, but how do we know, that the articles main text is located in <p> tags. We again checked this with the help of the developer tools of the browser. We first click on the link and then mark the text body with the inspection tool.
Since the text is written in different paragraphs .find_all() returns a list of all <p> tags. In line 12 we remove the tags from the elements so that only the plain text is which we put back into a list. In line 13 we use the reduce function, which makes it easy for us to combine the different paragraphs into one single string.
And what can we do now?
Probably you have some application with scraped data in mind. Either you just want to save it, because the website you are looking at in your work tends to go offline from time to time or you want to to analytics.
One great case is to observe changes in the content of a newspaper over time (David Kriesel, CCC, 2017). If we have scraped a couple of articles from, e.g. a news site, we could search for certain key words or count their appearance. We could use also different Machine Learning tools to classify text, which I will cover in a different article.