Analysing weekend box office data from Box Office Mojo by using Python (Part 2)

Nukky
5 min readOct 15, 2018

This is the second part of the analysing weekend box office data from box Office Mojo by using Python (Part 1). It is recommended to go through Part 1 before proceeding with this post.

For your information, here is the link for part 1: https://medium.com/@kunsitu/analysing-weekend-box-office-data-from-box-office-mojo-by-using-python-part-1-86dcabac9164

Now let’s recap what we did in the part 1. In the part 1:

  • We used request module to scrape raw data from Box Office Mojo website.
  • We used Beautiful Soup (BS4) module to obtain the desirable content from the raw data.
  • We used Pandas to clean and format the data.
  • We showed the top 10 studios according to their box office performance, which is also shown in the picture below.
Top 10 studio in terms of total weekend box office sale in week 40

In part 2, we will focus on how to get more data and combine the data to analyse film studios performance in 2018 so far. In addition, matplotlib module will be used for visualisation. So without further ado, let’s start with importing some modules first.

import pandas as pd
import matplotlib.pyplot as plt

In the part 1, we defined the following 3 functions, which are get_site, parse_source and df_format. Here, we would like to combine these three functions and create a new function called get_top_box_office that returns a dataframe contains top 10 studios with their box office sale.

def get_top_box_office(week, year):
data = get_site(week, year)
main_df = df_format(parse_source(data))
df=pd.DataFrame(main_df.groupby('Studio')['Weekend_Gross/$']\
.sum())\
.sort_values('Weekend_Gross/$',
ascending=False).reset_index()
# return the the top 10 studios each week
# also return box office sale for the top 10 studio

return (df['Studio'].tolist()[:10],
df['Weekend_Gross/$'].tolist()[:10])

We can create a for loop to get the top 10 studios in each week and their box office sale for first 40 weeks in tuples.

data = []
for i in range(1, 40):
studio, gross = get_top_box_office(i, 2018)
data.append((i, studio, gross))

Now we can count how many times these studios appeared in the top 10 each week by using value_counts.

# Store all the top 10 studio name in a list and count them
top_10 = []
for i in data:
top_10.extend(i[1])
# Store them in a dataframe
top_10_count = pd.Series(top_10).value_counts().reset_index()
# We only need get the first 10 studios
top_10_count = top_10_count[:10]
# Rename the columns
top_10_count.columns = ['studio', 'count']

top_10_count should look like the picture below:

top_10_count

From the first glance, BV which is the parenting company of Disney is the winner here, then following by Warner Brother and Universal. However, it would be interesting to see the box office sale for these 10 studios. Here, we can extract the data by using another for loop.

box_office = []
for a in data:
for b, c in zip(a[1], a[2]):
box_office.append((b, c))
# Convert box_office into dataframe
box_office_df = pd.DataFrame(box_office, columns=['studio',
'box_office'])
# Calculate the sum box office for each studio
box_office_sum = box_office_df.groupby('studio', as_index=False)['box_office'].sum().sort_values(by='box_office', ascending = False)
# Again get the first 10 studios and their box_office
box_office_sum_10 = box_office_sum[:10].reset_index(drop=True)

box_office_sum_10 should look like the following picture:

box_office_sum_10

As the above picture shown, BV(Disney)’ box office sale in 2018 so far is way better than other top 10 studios here. Although Warner Brother has more films in the top 10 spots, but Universal’s box office sale is better than Warner Brother. Now, we can plot some graphs and analyse these studios more visually. To achieve this, it would be better two merge the top_10_count and box_office_sum_10 first.

# Merge on 'studio' 
top10 = box_office_sum_10.merge(top_10_count, on='studio')

top10 should look something like this:

top10

Now we are ready to plot the graph!

# Set plot as 2X1 and the overall figure size.
fig, ax = plt.subplots(2, 1, figsize=(10, 8), dpi = 100)
# Here are the 2 RGB colour I chose
mycolors = ['#A6192E', '#85714D']
box_sale = top10[['studio', 'box_office']].set_index('studio')
top_10counts = top10[['studio', 'count']].set_index('studio')
# Plot bar chartsf1 = box_sale.plot(kind='bar', alpha = 0.9, rot=0, color =
mycolors[0], ax=ax[0], legend=False, sharex=True)
f2 = top_10counts.plot(kind='bar', alpha = 0.9, rot=0, color =
mycolors[1], ax=ax[1], legend=False, sharex=True)
# Set titles and axises labels
f1.set(title = "Box office analysis", xlabel='Studio',
ylabel='Box office sale/$')
f2.set(title = "Top 10 counts in 2018", xlabel='Studio',
ylabel='Counts')
# Adding x values as annotations
# Box office sale is converted into million$
for i in f1.patches:
f1.annotate(str(round(i.get_height()/1000000, 1))+'M',
(i.get_x() + i.get_width() / 2., i.get_height()),
ha='center', va='center',
xytext=(0, 10),
textcoords='offset points',
color=mycolors[0])
# Set the ylim to 1.2 times of the max box office sale value
f1.set_ylim(0, max(top10['box_office']*1.2))
for i in f2.patches:
f2.annotate(str(i.get_height()),
(i.get_x() + i.get_width() / 2., i.get_height()),
ha='center', va='center',
xytext=(0, 10),
textcoords='offset points',
color=mycolors[1])
# Set the ylim to 1.2 times of the max count value
f2.set_ylim(0, max(top10['count']*1.2))
fig.tight_layout()

plt.show()

In the code snippets above, we first initialise the overall layout of the plot which is 2 by 1, then we defined two new dataframes with the data needed for plotting (top 10 box office sale and top 10 studio count). Then we used df.plot to plot the bar charts. Finally we added individual chart titles, axis labels and annotation. Something to note for annotation: box office value are converted into million$, as it improves the aesthetics.

By using the codes above, it should produce a graph like the following:

Now from the graph above, we can clearly see BV (Disney) crushed other film studio in terms of box offices sale. This is probably due to the massive success fromDisney’s Marvel franchise, such as Avengers: Infinity War and Black Panther. It will be quite interesting to dig in deeper to see which films made Disney so successful in 2018 so far. In the next part, we will focus on more detailed visual exploration on BV (Disney)’s 2018 success, week by week.

Both part 1 and part 2 of this post can be found in Jupyter notebook format on my Github: https://github.com/situkun123/Moive_mojo_project

--

--