Hierarchical Web Scraping With Python

Published in

The Dev Project

5 min readMar 29, 2022

One of the more difficult tasks when web scraping is dealing with hierarchical data. That is, data that lives on different pages.

If you’re looking for a basic way to get data from each page without manually going through them, look no further. We’ll be using two popular Python libraries to do just that:

Requests
BeautifulSoup

In this example, we’ll be using IMDb’s Top 250 Movies. You can follow along in this kaggle notebook!

Getting Each Movie Title

We’ll go over getting the basic information quickly since that is relatively simple. If you’d like more detail, I’ve covered this in a previous Medium article.

We can see that all of the movie information is in the main table body:

Each movie has a row in this table. Inside those rows there are columns. The column that holds the name of the movie has a class of “titleColumn”:

Now, getting the title of the movie is easy from here. We can simply use the requests library to get the page and iterate through the rows. For each row we can navigate to the title column, and grab the movie title inside the <a> tag:

Likewise, we can get the release year in the same column which is inside the <span> tag:

Diving Deeper

Let’s say we want some information about each movie that isn’t on the Top 250 list page. We want to add the genre of movie it is, which can be found on the individual movie page:

In the bottom left corner of the screenshot, we have the genre categories the movie belongs to. So we’ll need to go to each individual movie page to pull that information.

Luckily, the page is linked right in the main top 250 page. Clicking the movie title would send us here. If we open DevTools (click F12 while in the browser), we can see that this link is inside the <a> tag we used to get the movie title:

The link is in the “href” portion of the <a> tag. Luckily BeautifulSoup gives us just the tool to grab that portion of the tag:

link = item.find('a')['href']

It’s very similar to the notation used to slice on strings or lists. In this case, we can grab the data from different elements inside a tag.

Requesting Each Movie’s Page

In our case, we’ll have to get this for each of the movies in the list which will mean iterating over the rows we created earlier:

for row in rows:
    link = column.a['href']

And we can print out the links to each of the movies:

The links don’t have the long query string which we saw in DevTools, however it works the same and will take us to the correct page.

Now we just have to request that page, keep in mind that this isn’t the full link. We have to add on the domain portion:

for row in rows:
    link = column.a['href']
    movie_page = requests.get(f'https://imdb.com/{link}').content

Once we have the contents of the page, we need to parse it to get at the genre section:

for row in rows:
    link = column.a['href']
    movie_page = requests.get(f'https://imdb.com/{link}').content
    movie_soup = BeautifulSoup(movie_page, 'html.parser')

Finding the Movie Genres

Now we just have to figure out where the genre name is located on the page:

So each genre is inside an <a> tag with a long class name. This can be easily added to the row iteration:

for row in rows:
    link = column.a['href']
    movie_page = requests.get(f'https://imdb.com/{link}').content
    movie_soup = BeautifulSoup(movie_page, 'html.parser')
    genre = movie_soup.find('span', 'sc-16ede01-3 bYNgQ ipc-chip ipc-chip--on-baseAlt')

If we were to print out the genre at this point, we get something unexpected:

We only print out one genre per movie. Specifically, the first genre in the html.

Of course, we know a movie can have multiple which is shown in screenshots of The Godfather above. So we need to iterate through the genres. We can do this by using the “find_all” method and using some list comprehension: