Web Scraping With Python for Beginners
A guide to using Requests and BeautifulSoup on HTML pages
Scraping the internet for data can be one of the most challenging steps in a development project. It doesn’t have to be.
Here we will focus on scraping basic HTML pages using two popular Python libraries:
- Requests
- BeautifulSoup
Both of these libraries have ample documentation and plenty of issues resolved on StackOverflow, which make them easy to use and troubleshoot if problems arise.
In this example, we’ll be using IMDb’s Top 250 Movies. You can follow along in this kaggle notebook!
Installing the Libraries
These libraries will be installed using pip, the package installer for Python. Open your terminal or command prompt and run this:
pip install requests
pip install bs4
Once you’ve run both of those, you should have everything you need to begin web scraping.
Importing the Libraries
Once you have the libraries installed, open up python file in whichever editor you like (or open the kaggle notebook for this article) and run the following imports:
import requests
from bs4 import BeautifulSoup
Running Requests and Making a Soup
Let’s run a request to get the html from the page:
html = requests.get('https://www.imdb.com/chart/top/?ref_=nv_mv_250').content
If we were to print our html, we would see the raw html without any formatting. This is because we grabbed the response as bytes using the “content” attribute.
So we now need to parse the html, which is where BeautifulSoup comes in:
soup = BeautifulSoup(html, 'html.parser')
Now if we print out our soup object, we get parsed html:
Getting the Data
Now that we’ve collected the html page and parsed it, we can grab whatever data we would like to. This is done by using different methods and attributes on the soup object we created (we called this “soup”).
The easiest way to find elements in the html is through the use of DevTools in your browser. You can activate this by opening up the web page and pressing F12.
You can then hover over elements of the page to quickly find the ones that may be most useful. In this case, we can see the movie list is made up of one large table:
Getting Movie Titles
First, let’s get the title of each movie on the list. In our code we can narrow down the html we are looking at by only looking at the <tbody></tbody>
portion, since we only care about the movies inside the table.
Now that we’ve narrowed down what we have to look at, it’s easier to tell where the data we want is. Each movie has its own row in that table:
We can simplify what we are looking at further by having our soup return the rows of the table as a list:
rows = soup.tbody.find_all('tr')
Now that we have a list of rows, we can print one row and find the data we want. The title can be found in a column with the class “titleColumn”:
So we will need to find a column with that class for each row. Now that we’ve created a list of rows, we’ll have to iterate over that list:
for row in rows:
column = row.find('td', 'titleColumn')
If we were to print out the text of “column” which we just created, we would see extra information besides just the movie title:
This is because BeautifulSoup grabs all of the text within the title column. So we’ll have to narrow it down further.
The title is within the only <a>
tag in the title column.
So we can easily pull the title by telling BeautifulSoup to get the text from the <a>
tag:
for row in rows:
column = row.find('td', 'titleColumn')
title = column.a.text
If we print the titles, we get a neat list of all the titles on IMDb’s Top 250:
Getting Movie Release Years
We can also do this with the year by getting the text from the<span>
tag.
Since this is in the same column as the title, the only difference is using column.span
instead of column.a
:
for row in rows:
column = row.find('td', 'titleColumn')
title = column.a.text
year = column.span.text
Now we can print out the title of each movie along with the release year:
Of course, we can do a great deal more with these two tools. The user ratings and rank are easy to grab from here as well.
Conclusion
Just to reiterate, the steps we took to get this information was:
- Request a web page (using the requests library)
- Parse the HTML (using BeautifulSoup)
- Search through the parsed HTML for the piece of information we want
To dive a little deeper, check out how you can go into each movie page and get the genres for each movie using the same tools we used here!
If this helped you out, consider following me on twitter for more daily programming tips and articles.