Hierarchical Web Scraping With Python
One of the more difficult tasks when web scraping is dealing with hierarchical data. That is, data that lives on different pages.
If you’re looking for a basic way to get data from each page without manually going through them, look no further. We’ll be using two popular Python libraries to do just that:
- Requests
- BeautifulSoup
In this example, we’ll be using IMDb’s Top 250 Movies. You can follow along in this kaggle notebook!
Getting Each Movie Title
We’ll go over getting the basic information quickly since that is relatively simple. If you’d like more detail, I’ve covered this in a previous Medium article.
We can see that all of the movie information is in the main table body:
Each movie has a row in this table. Inside those rows there are columns. The column that holds the name of the movie has a class of “titleColumn”:
Now, getting the title of the movie is easy from here. We can simply use the requests library to get the page and iterate through the rows. For each row we can navigate to the title column, and grab the movie title inside the <a>
tag:
Likewise, we can get the release year in the same column which is inside the <span>
tag:
Diving Deeper
Let’s say we want some information about each movie that isn’t on the Top 250 list page. We want to add the genre of movie it is, which can be found on the individual movie page:
In the bottom left corner of the screenshot, we have the genre categories the movie belongs to. So we’ll need to go to each individual movie page to pull that information.
Luckily, the page is linked right in the main top 250 page. Clicking the movie title would send us here. If we open DevTools (click F12 while in the browser), we can see that this link is inside the <a>
tag we used to get the movie title:
The link is in the “href” portion of the <a>
tag. Luckily BeautifulSoup gives us just the tool to grab that portion of the tag:
link = item.find('a')['href']
It’s very similar to the notation used to slice on strings or lists. In this case, we can grab the data from different elements inside a tag.
Requesting Each Movie’s Page
In our case, we’ll have to get this for each of the movies in the list which will mean iterating over the rows we created earlier:
for row in rows:
link = column.a['href']
And we can print out the links to each of the movies:
The links don’t have the long query string which we saw in DevTools, however it works the same and will take us to the correct page.
Now we just have to request that page, keep in mind that this isn’t the full link. We have to add on the domain portion:
for row in rows:
link = column.a['href']
movie_page = requests.get(f'https://imdb.com/{link}').content
Once we have the contents of the page, we need to parse it to get at the genre section:
for row in rows:
link = column.a['href']
movie_page = requests.get(f'https://imdb.com/{link}').content
movie_soup = BeautifulSoup(movie_page, 'html.parser')
Finding the Movie Genres
Now we just have to figure out where the genre name is located on the page:
So each genre is inside an <a>
tag with a long class name. This can be easily added to the row iteration:
for row in rows:
link = column.a['href']
movie_page = requests.get(f'https://imdb.com/{link}').content
movie_soup = BeautifulSoup(movie_page, 'html.parser')
genre = movie_soup.find('span', 'sc-16ede01-3 bYNgQ ipc-chip ipc-chip--on-baseAlt')
If we were to print out the genre at this point, we get something unexpected:
We only print out one genre per movie. Specifically, the first genre in the html.
Of course, we know a movie can have multiple which is shown in screenshots of The Godfather above. So we need to iterate through the genres. We can do this by using the “find_all” method and using some list comprehension:
And finally, we can attach it to the main movie names and release dates:
This is as deep as we’ll go for this example but you can go much more than two layers deep into the hierarchy using this method.
Conclusion
This has covered quite a lot, you should now be able to:
- Request a web page
- Pull information from that page
- Find other web pages that contain information you want
- Gather information from those pages
On pretty much any website. Let me know if you have any questions in the comments!
If this helped you out, consider following me on twitter for more daily programming tips and articles.