Scraping Hollywood — Beautiful Soup and Movie Data

Ned H
3 min readApr 21, 2019

--

In my last post about Hollywood movies, I didn’t specifically address the engineering challenge of scraping all the data for that analysis. Below, I will go through the process to identify data for scraping using Beautiful Soup.

The first step is to identify the data you want to scrape, it’s helpful if it’s already in some sort of html structure that beautiful soup can use to locate on the page.

As you can see, the target page I went after already has sort of a table structure, this means there are html tags we can find with BS.

Ok, let’s get started!

So what happened above?

We store our string outputs from Beautiful Soup in raw_list.

The for loop iterates through every url in our url variable (one in the sample but you can add as many pages as you like) and uses requests.get() to generate a response object r. This response object gives it’s text output to BS, which creates our soup object.

The power of the soup object is that we can search by html tag. In this case, after (a lot!) of trial and error I found that the <tr> tag demarcated the data I needed.

Finally, by appending all of the data bracketed by the <tr> tag into a list, it ended up saved on different lines, with a ‘\n’ (or line break) between elements. This is actually ideal for casting into a dataframe because I could simply save the list as a pandas Series and then perform .str.split(‘\n’, expand=True)

By using data.str I treat the series elements as strings.

By using .split(‘\n’, expand=True) I split each element along a delimiter, in this case a return (‘\n’) and expand each item returned into a new column (expand=True).

We get the following output:

Referring back to our source page, we can see what the column headings ought to be and rename them.

Now we end up with a good looking dataframe. We still need to check the dtypes, remove some nulls, etc, but you can check my post here on how to remove nans.

Thanks for reading!

--

--

Ned H

A former textile worker weaves data into compelling stories.