The NeverEnding Story of Data Science

Patricio Contreras
The Startup
Published in
8 min readDec 6, 2020

Have you ever been in a situation where you thought you knew something pretty well and then realised you were just barely scratching the surface? Or perhaps you already even developed an opinion and took a stance on an issue only to be disproven later on? While it’s completely understandable and expected to feel frustrated, with many doubts, and perhaps downright hopeless, these events should remind us to keep diving deeper, learn, and have an open mind — a common experience for many data scientists in today’s world.

Background

For my Phase 1 Data Science project at the Flatiron School, I was tasked to analyse movie data provided by IMDb, Box Office Mojo, and The Movie Database (TMDb) to guide the brand new “Microsoft Movie Studio” into the film industry. Yes, yes, yes! Movies! Finally, an area that I’m really passionate about and interests me a lot. Rather than just guesstimating the profits and other interesting features, the analysis I’ll perform will allow me to provide comments backed up by real data! I also feel pretty confident in the analysis and the results I’d find since it’s a topic I know much about. Once I finish the Zoom call with my instructor, I immediately open my Jupyter Notebook and start reading in the data.

The Data

After reviewing all the datasets provided to us for this project, I end up choosing one IMDb dataset, one Box Office Mojo dataset, and two TMDb datasets:

import pandas as pd# reading in IMDb, Box Office Mojo, and TMDb csvs
df_imdb_title_basics = pd.read_csv("zippedData/imdb.title.basics.csv.gz")
df_bom_movie_gross = pd.read_csv("zippedData/bom.movie_gross.csv.gz")df_tn_movie_budgets = pd.read_csv("zippedData/tn.movie_budgets.csv.gz")df_tmdb_movies = pd.read_csv("zippedData/tmdb.movies.csv.gz")df_imdb_title_basics.head()
146,144 rows, 6 columns
df_bom_movie_gross.head()
3,387 rows, 5 columns
df_tn_movie_budgets.head()
5,782 rows, 6 rows
df_tmdb_movies.head()
26,517 rows, 10 columns

The data frames I read in all have information I’d expect (film title, genre, release date, runtime, budget, etc.) Other than popularity, all the other columns are pretty self-explanatory and intuitive.

Research Questions

In order to help “Microsoft Movie Studio” get into the film industry, here are some of the questions I decided to tackle:

  1. What are the most successful movie genres in terms of profit made?

2. What are the top 5 movie studios Microsoft could partner up with?

3. Which non-English language yields the highest film ratings?

I immediately start thinking:

  1. “Action, action, adventure, sci-fi, action (Avengers Endgame anyone?)”
  2. “disney, Disney, Disney, DISNEY!
  3. “Spanish? Although Parasite won last year sooo… ¯\_(ツ)_/¯ “

What are the Most Successful Movie Genres in Terms of Profit Made?

As soon as I get to this part of the project, I realise:

The genres column is not so clear cut. Of course, many movies have more than 1 genre whereas others only have 1. I start wondering, “should I only take the first (primary) genre into account and ignore the rest? Should I include all 3 genres in my calculation? What movie genre does a film like The Matrix fall under? Only action? Action and sci-fi? How will this grouping affect my results?” Ugh, so many unknowns.

A case could be made that if we’re only focusing on the primary genre, then we’re “losing out” a lot of accuracy and details. However, another case could be made that if we go deep into the specifics and group by all 3 (or more) genres, then our results will be skewed and the resultswill be hard to interpret.

Ultimately, I opt for 2 genres and calculate the total profit per pairing:

import matplotlib.pyplot as plt# grouping by primary AND secondary genre and calculating total profit
df_genres = df_imdb_tmdb.groupby(["primary_genre", "secondary_genre"])[["profit"]].sum()
# sort by total profit and store top 5
df_genres = df_genres.sort_values("profit", ascending = False).head()
# plot horizontal bar graph of total profit per genres
plt.figure(figsize = (15,8))
sns.barplot(x = df_genres["profit"]/1e9, y = df_genres.index,
color = "b")
plt.title("Top 5 Movie Genres By Total Profit Made", size = 15);
plt.xlabel("Total Profit Made (Billions)", size = 15);
plt.tick_params(labelsize = 14);
plt.ylabel("Movie Genres", size = 15);

I finish this section by stating that “action” and “adventure” are the most profitable genres, however, as I write that, a bunch of other questions start popping into my head. “What if I only looked at the primary genre? Is the sum profit really the best statistic to use here? What if action and adventure are being skewed by films with crazy high profits?” A little disconcerting, but I carry onto my next research question.

What are the top 5 movie studios Microsoft could partner up with?

“Ok, this one’s going to be pretty straight-forward, right? With the massive success they’ve had in recent years, I think we all know Disney will take throne on this one.”

# grouping by studio and computing median domestic_gross
df_studios = df_imdb_tmdb_bom.groupby("studio")[["domestic_gross"]].median()
# renaming column for easier interpretation
df_studios.rename(columns = {"domestic_gross": "Median Domestic Gross"}, inplace = True)
# sorting values by median domestic gross
df_studios = df_studios.sort_values("Median Domestic Gross",
ascending = False)
df_top_studios = df_studios.head()
# create a temp data frame from imdb_tmdb_bom that only has the top movie studios per median domestic box office
df_top = df_imdb_tmdb_bom[df_imdb_tmdb_bom["studio"].isin(df_top_studios.index)]
plt.figure(figsize = (15,8))
# order boxplots by median
g = sns.boxplot(x = df_top["domestic_gross"]/1e6, y = "studio",
data = df_top, order = df_top_studios.index)
plt.title("Top 5 Movie Studios by Domestic Box Office", size = 15);
plt.xlabel("Domestic Gross Box Office Revenue (in Millions)", size = 15);
plt.tick_params(labelsize = 14);
plt.ylabel("Movie Studio", size = 15)
g.set_yticklabels(["Dreamworks Pictures", "Walt Disney Studios",
"MGM", "Sony", "Paramount"]);

Woah! Huh? Dreamworks? I did not expect that. Bewildered by the results, I ultimately state what’s shown in the plot: Dreamworks, then Disney are the movie studios to be partnered up with (in terms of domestic gross box office revenue). However, unlike the previous question, this one’s not very clear-cut. How are Disney’s outliers affecting the distribution? Even though Disney comes in second, its distribution is the most widespread. Does this empirically mean that Disney performs better than MGM, Sony, or Paramount? What if Dreamworks and Disney are the top 2 just because there’s more data on them? So many questions start rolling into my head and I feel the list of unknowns keeps getting bigger and bigger.

Which non-English language yields the highest film ratings?

# output top 10 films with lowest vote_count
df_tmdb_movies.sort_values("vote_count")[:10]
Top 10 films with the lowest vote_count

As soon as I get to this question, I realise that many films have a really high vote_average but only 1 person voted! If we’re planning to group by the language and calculate anything on vote_average, this can seriously affect our data and produce erroneous results. I take a simple, rather “brute-force” approach and make the cutoff point for vote_count be 122.

# only films with vote_count >= 122
df_temp = df_tmdb_movies[df_tmdb_movies["vote_count"] >= 122]
# output top 12 languages with lowest freq
df_temp["original_language"].value_counts().tail(12)
Top 12 languages with the lowest frequencies

Once again, we face another problem in this section. As seen by the output above, many languages barely have 1 film in the data frame. If we’re making any calculation on vote_average by original_language, languages with several films will have more accurate data than languages with only 1 film! Therefore, yet another “brute-force” approach was taken and I decided to only include languages that show up at least 20 times in the data frame. This way all languages have enough films to level each other off when calculating key statistics:

# calculating how many times each language shows up in data frame
counts = df_temp["original_language"].value_counts()
# filtering df_temp so it only has the languages that show up >= 20
df_temp = df_temp[~df_temp["original_language"].isin(counts[counts < 20].index)]
# grouping by original_language and taking the median vote_average per lang
df_lang = df_temp.groupby("original_language")[["vote_average"]].median()
# rename column for easier interpretability
df_lang.rename(columns = {"vote_average": "Median Rating"}, inplace = True)
# sort the languages by Median Rating in descending order (output top 5)
df_lang = df_lang.sort_values("Median Rating", ascending = False)[:5]
plt.figure(figsize = (15,8))
# order in descending order by the median movie rating
l = sns.boxplot(x = "vote_average", y = "original_language",
data = df_temp, order = df_lang.index)
plt.title("Distribution of Film Rating by Non-English Language",
size = 15);
plt.xlabel("Film Rating (1-10)", size = 15);
plt.tick_params(labelsize = 14);
plt.ylabel("Language", size = 15);
# set these yticklabels instead of 'sv', 'ja', etc.
l.set_yticklabels(["Swedish", "Japanese",
"Spanish", "Italian", "German"]);
Distributions of Film Rating by Language

As oppose to the previous sections, this was the only question where I didn’t really have a clear hypothesis. I see “Swedish” and “Japanese” in the top 2, but are they really that different from the rest? Their distributions overlap and the Japanese distribution is extremely widespread. The data filtering to produce the plot above involved some serious “hard-coding”. How do I know for sure those values are acceptable? I begrudgingly report that Swedish and Japanese films rate higher than films in other languages, but I’m left thinking.

Tying it All Back Together

What was the point in going through my experience with this project? How does it all tie back to the beginning? Even though I left most of these sections with a sweet and sour feeling, I love that there’s more room for exploration and analysis! Data science will often leave you with a never-ending list of unknowns and questions, but that means there’s room for learning more about a subject that you’re passionate about and for growth. Rarely will someone finish a question, project, or assignment and think, “I have nothing else to do. There are no more questions to ponder or things I could’ve done differently.” There’ll always be room for improvement in the world of data science!

Conclusion

I believe people studying or working in data science have to be comfortable with knowing that there are plenty of ways to address a problem. It may seem daunting and at times stressful to think that there’s a never-ending list of possibilities, unknowns, and that perhaps we’re barely scratching the surface. But there’s also a certain beauty in knowing that there’s no clear end or stop when it comes to analysing data. This becomes even more satisfying when you’re researching into a topic you’re passionate about. It’s about getting our hands dirty, diving deeper, keeping an open mind, and accept that data science is a never-ending story of questions, doubts, and most importantly, answers.

--

--