Web Scraping With Python for Beginners

A guide to using Requests and BeautifulSoup on HTML pages

Jonathan Joyner
The Dev Project
5 min readMar 29, 2022

--

Photo by Nicolas Picard on Unsplash

Scraping the internet for data can be one of the most challenging steps in a development project. It doesn’t have to be.

Here we will focus on scraping basic HTML pages using two popular Python libraries:

  • Requests
  • BeautifulSoup

Both of these libraries have ample documentation and plenty of issues resolved on StackOverflow, which make them easy to use and troubleshoot if problems arise.

In this example, we’ll be using IMDb’s Top 250 Movies. You can follow along in this kaggle notebook!

Installing the Libraries

These libraries will be installed using pip, the package installer for Python. Open your terminal or command prompt and run this:

pip install requests

pip install bs4

Once you’ve run both of those, you should have everything you need to begin web scraping.

Importing the Libraries

Once you have the libraries installed, open up python file in whichever editor you like (or open the kaggle notebook for this article) and run the following imports:

import requests
from bs4 import BeautifulSoup

Running Requests and Making a Soup

Let’s run a request to get the html from the page:

html = requests.get('https://www.imdb.com/chart/top/?ref_=nv_mv_250').content

If we were to print our html, we would see the raw html without any formatting. This is because we grabbed the response as bytes using the “content” attribute.

Unparsed HTML

So we now need to parse the html, which is where BeautifulSoup comes in:

soup = BeautifulSoup(html, 'html.parser')

Now if we print out our soup object, we get parsed html:

Parsed HTML

Getting the Data

Now that we’ve collected the html page and parsed it, we can grab whatever data we would like to. This is done by using different methods and attributes on the soup object we created (we called this “soup”).

The easiest way to find elements in the html is through the use of DevTools in your browser. You can activate this by opening up the web page and pressing F12.

Open DevTools

You can then hover over elements of the page to quickly find the ones that may be most useful. In this case, we can see the movie list is made up of one large table:

Movie Table

Getting Movie Titles

First, let’s get the title of each movie on the list. In our code we can narrow down the html we are looking at by only looking at the <tbody></tbody> portion, since we only care about the movies inside the table.

Table Body

Now that we’ve narrowed down what we have to look at, it’s easier to tell where the data we want is. Each movie has its own row in that table:

Table Row

We can simplify what we are looking at further by having our soup return the rows of the table as a list:

rows = soup.tbody.find_all('tr')

Now that we have a list of rows, we can print one row and find the data we want. The title can be found in a column with the class “titleColumn”:

Title Column

So we will need to find a column with that class for each row. Now that we’ve created a list of rows, we’ll have to iterate over that list:

for row in rows:
column = row.find('td', 'titleColumn')

If we were to print out the text of “column” which we just created, we would see extra information besides just the movie title:

Title Column Text

This is because BeautifulSoup grabs all of the text within the title column. So we’ll have to narrow it down further.

The title is within the only <a> tag in the title column.

Movie Title

So we can easily pull the title by telling BeautifulSoup to get the text from the <a> tag:

for row in rows:
column = row.find('td', 'titleColumn')
title = column.a.text

If we print the titles, we get a neat list of all the titles on IMDb’s Top 250:

Movie Titles Printed

Getting Movie Release Years

We can also do this with the year by getting the text from the<span> tag.

Movie Year

Since this is in the same column as the title, the only difference is using column.span instead of column.a :

for row in rows:
column = row.find('td', 'titleColumn')
title = column.a.text
year = column.span.text

Now we can print out the title of each movie along with the release year:

Movie Titles and Years

Of course, we can do a great deal more with these two tools. The user ratings and rank are easy to grab from here as well.

Conclusion

Just to reiterate, the steps we took to get this information was:

  • Request a web page (using the requests library)
  • Parse the HTML (using BeautifulSoup)
  • Search through the parsed HTML for the piece of information we want

To dive a little deeper, check out how you can go into each movie page and get the genres for each movie using the same tools we used here!

If this helped you out, consider following me on twitter for more daily programming tips and articles.

--

--