Analysing weekend box office data from Box Office Mojo by using Python (Part 1)

Nukky
4 min readOct 10, 2018

--

I am a huge movie fan, I usually pay a visit to my local movie theatre on a weekly basis. On top of that, I often spent time on watching movie reviews before I go to sleep every night. Fortunately, I came across a website called Box Office Mojo one evening, and the idea of analysing some movie stats emerged instantly.

Let’s skip the trailers & ads and crack on with the main event here!

Firstly and foremost, the prerequisites for this project is that you have some basic understanding of how web scraping works as well as some knowledge in Pandas and matplotlib. It is also important that you’ve installed Jupyter Notebook on your computer.

So let’s import some packages first:

import pandas as pd # data analysis
import requests # get url
from bs4 import BeautifulSoup as bs # data scraping
import matplotlib.pyplot as plt # Data visualisation
import datetime # Check week number

To connect with the Box Office Mojo, we can just use request. Here, I wrote a simple function for this task, as we will need to reuse this step later on in this project.

def get_site(week, year):
html = 'http://www.boxofficemojo.com/weekend/chart/?yr={}&wknd={}&p=.htm'.format(year, week)
r = requests.get(html)
return r.content

By inputting week number and year, get_site will return the raw content of the website. Then we need to parse the content we get from the websites.

def parse_source(page_source):
# use bs4 to search for table in the requested page.
page_soup = bs(page_source, "lxml")
table = page_soup.find_all('table', attrs={'cellspacing': '1'})
table = table[0]
output_columns = ['TW', 'LW', 'Title', 'Studio',
'Weekend_Gross/$', '%_Change',
'Theater_Count','Theater_Change', 'Average/$',
'Total_Gross/$', 'Budget', 'num_week']
output = dict((x, []) for x in output_columns)
# Get content from all rows from the table.
all_rows = table.find_all('tr')[1:106]
# Let's get the top 50 ranked film.
for row in all_rows[:50]:
row_cols = row.find_all('td')
for dict_key, col in zip(output_columns, row_cols):
output[dict_key].append(col.text)
# Output will be a dataframe
output_pd = pd.DataFrame(output)[output_columns]
return output_pd

In the parse_source function here, bs4 was used to get all the data in the table, and all the irrelevant data can be parsed out. Then a list was used to create the column names for the data we wanted to keep. Next, all the data from the rows was found and stored as values in the dictionary with output_columns as the keys. Finally, the dictionary (output) was converted to a dataframe.

Before we start the whole process, there are still some cleaning needs to be done.

def df_format(main_df):
change_columns1 = ['Weekend_Gross/$', 'Average/$',
'Total_Gross/$', '%_Change']
change_columns2 = ['Theater_Change', 'Theater_Count']
change_columns3 = ['num_week', 'TW']
main_df.drop('Budget', axis=1, inplace=True)
for i in change_columns1:
if i != '%_Change':
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
str.replace(',', '')
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
str.replace('$', '')
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
astype(int)
else:
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
str.replace('%', '')
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
str.replace(',', '')
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
replace('-', 0)
main_df.loc[:, str(i)] = main_df.loc[:, str(i)].\
astype(float)
for b in change_columns2:
if b == 'Theater_Count':
main_df.loc[:, str(b)] = main_df.loc[:, str(b)].\
str.replace(',', '')
main_df.loc[:, str(b)] = main_df.loc[:, str(b)].\
astype(int)
elif b == 'Theater_Change':
main_df.loc[:, str(b)] = main_df.loc[:, str(b)].\
str.replace(',', '')
main_df.loc[:, str(b)] = main_df.loc[:, str(b)].\
replace('-', 0)
main_df.loc[:, str(b)] = main_df.loc[:, str(b)].\
astype(int)
for c in change_columns3:
main_df.loc[:, str(c)] = main_df.loc[:, str(c)].astype(int)
main_df.set_index('TW', inplace=True)
return main_df

This function looks a bit confusing, and I know there are more elegant solutions to clean and format the data, but it works. Nevertheless, we are ready to go here.

# Find out the week number for the latest weekend, which was 40datetime.date(2018, 10, 6).isocalendar()[1]# Get the data by using the previous written functionsnew = get_site(40, 2018)
main_df = parse_source(new)
main_df = df_format(main_df)
main_df

This will return us with a cleaned dataframe, and it should look like this.

The meaning of most of these data are fairly self-explanatory, but something to note here: TW means this week’s box office ranking, LW means Last week’s box office ranking and N in LW column means it is new to this weekend. Now let’s investigate how each studio performs in week 40.

# Groupby Studio and calculated the sum of box office sale.
# Then sort the value in an ascending order.
week40 = pd.DataFrame(main_df.groupby('Studio')['Weekend_Gross/$'].\
sum()).\
sort_values('Weekend_Gross/$', ascending=False)
week40.head(10)
Top 10 studio in terms of total weekend box office sale in week 40

From the above table, we can see Sony did really well, but it is mainly because of its new “Venom” movie.

In the next part, we are going to get more box office data from other weekends and dig in deeper to see how the major film studios perform in 2018 so far. The full tutorial can be found in the link below:

The link for the second part of Box Office Mojo project:

https://medium.com/@kunsitu/analysing-weekend-box-office-data-from-box-office-mojo-by-using-python-part-2-a1687e8553f1

In the meantime, please share with your friends if you enjoy this article, and check out some of my other articles down below.

https://medium.com/@kunsitu/freetrade-api-wrapper-unofficial-6dd36e65dd37

https://medium.com/@kunsitu/retrieving-data-from-amazon-relational-database-service-rds-postgresql-d0611cb13d2a

--

--