Hierarchical Web Scraping With Python

Jonathan Joyner
The Dev Project
Published in
5 min readMar 29, 2022
Photo by Kelly Sikkema on Unsplash

One of the more difficult tasks when web scraping is dealing with hierarchical data. That is, data that lives on different pages.

If you’re looking for a basic way to get data from each page without manually going through them, look no further. We’ll be using two popular Python libraries to do just that:

  • Requests
  • BeautifulSoup

In this example, we’ll be using IMDb’s Top 250 Movies. You can follow along in this kaggle notebook!

Getting Each Movie Title

We’ll go over getting the basic information quickly since that is relatively simple. If you’d like more detail, I’ve covered this in a previous Medium article.

We can see that all of the movie information is in the main table body:

Table Body

Each movie has a row in this table. Inside those rows there are columns. The column that holds the name of the movie has a class of “titleColumn”:

Title Column

Now, getting the title of the movie is easy from here. We can simply use the requests library to get the page and iterate through the rows. For each row we can navigate to the title column, and grab the movie title inside the <a> tag:

Movie Titles Printed

Likewise, we can get the release year in the same column which is inside the <span> tag:

Movie Title and Year

Diving Deeper

Let’s say we want some information about each movie that isn’t on the Top 250 list page. We want to add the genre of movie it is, which can be found on the individual movie page:

Movie Genre

In the bottom left corner of the screenshot, we have the genre categories the movie belongs to. So we’ll need to go to each individual movie page to pull that information.

Luckily, the page is linked right in the main top 250 page. Clicking the movie title would send us here. If we open DevTools (click F12 while in the browser), we can see that this link is inside the <a> tag we used to get the movie title:

Movie Title Link

The link is in the “href” portion of the <a> tag. Luckily BeautifulSoup gives us just the tool to grab that portion of the tag:

link = item.find('a')['href']

It’s very similar to the notation used to slice on strings or lists. In this case, we can grab the data from different elements inside a tag.

Requesting Each Movie’s Page

In our case, we’ll have to get this for each of the movies in the list which will mean iterating over the rows we created earlier:

for row in rows:
link = column.a['href']

And we can print out the links to each of the movies:

Movie Links

The links don’t have the long query string which we saw in DevTools, however it works the same and will take us to the correct page.

Now we just have to request that page, keep in mind that this isn’t the full link. We have to add on the domain portion:

for row in rows:
link = column.a['href']
movie_page = requests.get(f'https://imdb.com/{link}').content

Once we have the contents of the page, we need to parse it to get at the genre section:

for row in rows:
link = column.a['href']
movie_page = requests.get(f'https://imdb.com/{link}').content
movie_soup = BeautifulSoup(movie_page, 'html.parser')

Finding the Movie Genres

Now we just have to figure out where the genre name is located on the page:

So each genre is inside an <a> tag with a long class name. This can be easily added to the row iteration:

for row in rows:
link = column.a['href']
movie_page = requests.get(f'https://imdb.com/{link}').content
movie_soup = BeautifulSoup(movie_page, 'html.parser')
genre = movie_soup.find('span', 'sc-16ede01-3 bYNgQ ipc-chip ipc-chip--on-baseAlt')

If we were to print out the genre at this point, we get something unexpected:

We only print out one genre per movie. Specifically, the first genre in the html.

Of course, we know a movie can have multiple which is shown in screenshots of The Godfather above. So we need to iterate through the genres. We can do this by using the “find_all” method and using some list comprehension:

Multiple Genres

And finally, we can attach it to the main movie names and release dates:

Movies with year and genres

This is as deep as we’ll go for this example but you can go much more than two layers deep into the hierarchy using this method.

Conclusion

This has covered quite a lot, you should now be able to:

  • Request a web page
  • Pull information from that page
  • Find other web pages that contain information you want
  • Gather information from those pages

On pretty much any website. Let me know if you have any questions in the comments!

If this helped you out, consider following me on twitter for more daily programming tips and articles.

--

--