Web Scraping For Beginners with Python

Python is one of the most widely used programming languages. A very common reason to use python is, web scraping. Python has a rich ecosystem of easy-to-use libraries for making HTTP requests, accessing the webpages and parsing them, making it the de-facto choice for web scraping.

In this article, I will show how you can scrape any website using python with the help of Beautifulsoup and urllib.

This article assumes a basic understanding of python and python syntax from the readers. You would also need some understanding of html and css. So, if you are not familiar with python or html & css, you should try to get some experience with them before you start with this article.

Let’s talk Dependencies

First things first. Python!
You can’t run a python program without python. So, I hope you already have that installed. I’m using 3.6 which is the latest at the time of writing this article. Other Python 3.x versions should be able to run this code fine. If you’re using 2.x, then there would be changes needed to be made but I am assuming that you can make those modifications on your own.

After that, We need to make sure we have the required libraries installed on our system. We are using BeautifulSoup and urllib for this article. urllib is part of python standard library. So, if you have python installed it is already packaged with it. That leaves us with only BeautifulSoup.

You can check if you already have it like this:

$ pip list | grep beautifulsoup
beautifulsoup4 (4.6.0)

If you see output as above, you already have it installed. Make sure you have 4.x installed. If it is not installed, you can install it with

pip install beautifulsoup4

Once that is done, you have Beautifulsoup ready to be used.

What exactly is Beautiful soup?

You have been seeing it all over this article so far. So, it’s good time to know why we need it.

BeautifulSoup is a parsing library that makes working with html very easy. In an ideal world you can fetch any* webpage with just urllib and parse the html yourself. But sadly in the real world, parsing html is a daunting task. There are hundreds and thousands of corner cases that need to be handled which is a nightmare to do for every site. But since BeautifulSoup does this for us, we don’t have to. We will just use it directly.

With that brief information, We are set to scrape any website to extract data.

Let’s get started

Disclaimer: Scraping websites that you’re not supposed to can be illegal. This article is purely intended to be educational and nothing else. Check the scraping policies of individual websites before you scrape them. Scrape at your own volition!

As probably evident from the above disclaimer, scraping can have some serious consequences if you’re not careful. So to keep things simple, I am scraping my own website Freblogg. You should try this on your own websites.

Now, Open a new python file and add the lines

from bs4 import BeautifulSoup as bs
import urllib.request as ureq

With this we can access beautifulsoup with bs and urllib’s request with ureq.

The url for my website is http://freblogg.com and since I’m scraping this this url, I add a variable for that

freblogg_url = 'http://freblogg.com'

To get this website, we make use of our ureq as follows:

website = ureq.urlopen(freblogg_url).read()

Here we are calling the urlopen() method and passing our website url to it. The return value would be an object of class http.client.HTTPResponse. By calling read() on that object, we get all of the html for that site.

You can test this out by printing out website with

print(website)

This will print all of the html of the website and you should be able to see all the html of the homepage of the site you’re scraping.

Technically, at this point you are done as you have the html of the website. But until you use this html to get some information, it is useless. So, let’s do that.

Say, I want to extract titles of all the posts on my website’s homepage. At the time of writing this post, the first three articles are:

  • How to recover from ‘git reset — hard” | Git
  • Functions in C Programming | Part 1
  • Matrix Multiplication | C Programming

You should be able to see these as well if you visit Freblogg now, below this article. So, my goal is to extract this information from the html we just got. We will use Beautifulsoup for this very task.

Using Beautifulsoup

To tell beautifulsoup to read our html, we do:

soup = bs(webpage, "html.parser")

webpage is the html we scraped from the site. We are passing to beautifulsoup along with an argument html.parser. This is our way of telling that we’re interested in using the default html parser. If we have our own custom parsers (which we don’t), we can use that here. Not giving this argument also works. i.e., soup = bs(webpage) would also work, as html.parser is the default parser. But it never hurts to be more specific.

Now that we have our soup object, we can use that to get what we want. In my case, I need to get the titles of all the articles on my homepage.

To do this we have to take a look at the website you are scraping and see what identifies the things we want to extract.

In my case, all the titles of the articles are all h2 headers as evident from the following html source

<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html">How to recover from 'git reset --hard" | Git</a>
</h2>

I can use this information to extract all the h2 headers from the soup object. For that we do this:

headers = soup.find_all('h2')

This will give us a list of all the <h2> tags in the page.

Here we’re searching for all of the <h2> tags. Similarly if we want to get all <div> tags, we can do

divs = soup.find_all('div')

Similarly for any other tag you want to get.

headers has the list of all the <h2> headers in the page. When I print it, the output is:

headers = [
<h2 class="descriptionheader"> </h2>,
<h2 class="date-header"><span>September 11, 2017</span></h2>,
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html">How to recover from 'git reset --hard" | Git</a>
</h2>,
<h2 class="date-header"> <span>August 27, 2017</span></h2>,
<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/08/functions-in-c-programming-part-1.html">Functions in C Programming | Part 1</a>
</h2>,
.
.
.
<h2 class="date-header"> <span>December 21, 2016</span></h2>,
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2016/12/git-octopus-merge.html">Understanding Git Octopus Merge</a>
</h2>,
<h2 class="title">✉ Subscribe</h2>,
<h2>★ Labels</h2>,
<h2>★ Trending</h2>
]

I have taken out some of the output to keep it short, but as you can see, the headers has more things in it than just the article titles. The actual one’s I need are the <h2> tags with class="post-title entry-title". The other tags in the output like <h2>★ Labels</h2> or <h2>★ Trending</h2> are things that we don’t want.

From this output we figured out that the it is class="post-title entry-title" that actually defines the article title along with the <h2> tag. So, we will use that by adding an attribute dictionary as follows:

headers = soup.find_all('h2', {'class':'post-title entry-title'})

Now, if we print headers, we have this:

headers = [
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html">How to recover from 'git reset --hard" | Git</a>
</h2>,
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/08/functions-in-c-programming-part-1.html">Functions in C Programming | Part 1</a>
</h2>,
.
.
.
<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2016/12/git-octopus-merge.html">Understanding Git Octopus Merge</a>
</h2>
]

Now, it is just the article headers. Better than what we had before. From here we just have one more thing to do before we actually get the title.

Before we get the article titles, let me show you a couple of cases of parsing with beautifulsoup. Till now we’re using find_all to get all the tags we want. Instead let’s use find which gives just one element instead of a list of all the headers.

Using the python REPL, we get

>>>>> soup.find('h2', {'class':'post-title entry-title'})
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html">How to recover from 'git reset --hard" | Git</a>
</h2>

Just one header object, as expected. Under this <h2>, we have a <span> tag and a <a> tag as well. To get the <span> tag, we can again use the same find() like before and we can get it.

>>>>> soup.find('h2', {'class':'post-title entry-title'}).find('span')
<span class="post_title_icon"></span>

Similarly the <a> tag,

>>>>> soup.find('h2', {'class':'post-title entry-title'}).find('a')
<a href="http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html">How to recover from 'git reset --hard" | Git</a>

So, Using find() like this in series, we can drill down a nested tag and fetch the innermost values as needed.

To get the link of an <a> tag:

>>>>>> anchor = soup.find('h2', {'class':'post-title entry-title'}).find('a')
link = anchor['href']
'http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html'

And for the text inside the <a> tag,

>>>>>> soup.find('h2', {'class':'post-title entry-title'}).find('a').text.strip()
'How to recover from \'git reset --hard" | Git'

And we have the text inside the header tag. The strip() is to make sure that we trim any unnecessary whitespace characters at the ends.

Now, let’s go back to where we were before.

We had a list of headers of h2 tags.

headers = [
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html">How to recover from 'git reset --hard" | Git</a>
</h2>,
<h2 class="post-title entry-title">
<span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2017/08/functions-in-c-programming-part-1.html">Functions in C Programming | Part 1</a>
</h2>,
.
.
.
<h2 class="post-title entry-title"><span class="post_title_icon"></span>
<a href="http://www.freblogg.com/2016/12/git-octopus-merge.html">Understanding Git Octopus Merge</a>
</h2>
]

And we want to get all the titles of the articles. Each element in the headers list is a tag something like this in simple form.

<h2><a>How to recover from 'git reset --hard" | Git</a></h2>

The title even though it is under <a>, is technically also under <h2> as well. Which means we can just use .text on that and get the title.

Finally we add this

titles = list(map(lambda h: h.text.strip(), headers))

which returns the output

[
'How to recover from \'git reset --hard" | Git',
'Functions in C Programming | Part 1',
'Matrix Multiplication | C Programming',
"Today I Learned | Petrov's Defense | Russian Opening",
'Git Cherrypick',
'Git Merge Vs. Git Rebase',
'Understanding Git Octopus Merge'
]

the titles of articles, which is what we wanted to extract.

Before we stop, let’s do one more thing. Let’s get both the url of the post ()which is in the href of the <a> tag) and the title of the article and store them in a dictionary.

titles_and_links = dict(map(lambda h: (h.text.strip(), h.find('a')['href']), headers))

We’ve mapped each header to a tuple of title and the url.

{
'How to recover from \'git reset --hard" | Git': 'http://www.freblogg.com/2017/09/how-to-recover-from-git-reset-hard-git.html',
'Functions in C Programming | Part 1': 'http://www.freblogg.com/2017/08/functions-in-c-programming-part-1.html',
'Matrix Multiplication | C Programming': 'http://www.freblogg.com/2017/08/matrix-multiplication-in-c.html',
"Today I Learned | Petrov's Defense | Russian Opening": 'http://www.freblogg.com/2017/03/petrovs-defense.html',
'Git Cherrypick': 'http://www.freblogg.com/2017/02/git-cherrypick.html',
'Git Merge Vs. Git Rebase': 'http://www.freblogg.com/2017/01/git-merge-vs-git-rebase.html',
'Understanding Git Octopus Merge': 'http://www.freblogg.com/2016/12/git-octopus-merge.html'
}

Full source code for this is available at:

python code for webscraping

And that is how you can scrape a website and get useful information. Although this article only scrapes the surface (pun intended), there is still much to learn to became master scrapers!

I hope that you have found this article useful. Stay tuned for more articles.


You can reach me on twitter @durgaswaroop
My website: Freblogg