Do you have to make a good movie to make money?

5 min readJun 1, 2020

Jim Carrey as the Mask holding lots of money and looking excited — “We don’t make movies to make money, we make money to make more movies” — Walt Disney

Motivation: My current portfolio consists of no media related projects; I’m interested in data science applied to media (e.g. movies, music, etc.). I wanted to get familiar with connecting to a public API to scrape data and I wanted to start a basic examination of movie data; hopefully having the data useful for further work.

Getting the Data:

I found a few resources about analyzing movie data that mostly discuss Netflix and IMDB. Netflix didn’t answer my requests for data and IMDB has walled off much of their data.

Hello TMDB! The Movie Data Base, has an excellent API for movie info.

Timeliness is always of interest, but since the COVID-19 pandemic has probably impacted the levels of viewership, I imagine that’s affected ratings and revenue.

Here’s how I got the data:

Got an API key from TMDB.com
Adapted code to download data
Run the module and combine the data files (.csv)

This took a while to build, there is some room for improvement here, but as Kaggle boasted 4,800 data points, my data is… larger. I have 133,744 data points and I think that’s valuable, but those extra data points will definitely add more work in the way of cleaning (spoiler: it did).

I need to revisit the method of retrieval so that I can evaluate more recent data — my current data is from the year 1888 but stops at 2018
My program died because my internet crashed, but I figured having almost 28x the data Kaggle provided was a fine way to start

Viewing and Cleaning the Data

In this section, I clean the data. You can skip to the next section, Posing the Questions, if you aren’t interested in the data engineering aspects of this. This took most of my efforts.

After importing the data, I saw the data was pretty dirty; I was surprised the data was stored at TMDB like that, but I got down to business.

Initial shape of data: 134,744 observations x 20 factors

Drop Certain Observations With Nulls:

vote count, vote average, revenue: 106,267 (28,477 observations left)
release date: 111 observations (28,366 left)

2. Drop Certain Columns:

id: no useful information
homepage: 86% null

3. Identify and Indicate Top-10 most commonly observed:

genres

keywords

production companies

production countries

spoken languages

4. Indicate:

production company in top-10
production country in top-10

5. Measure Proportion of:

Number of genres in top-10
Number of keywords in top-10
number of spoken languages in top-10

6. Address Overview:

Create column indicating amount of words in overview

7. Drop Columns:

genres
keywords
production companies
production countries
spoken languages
overview

8. Handle Categoricals:

original language: one-hot-encoding
original title: create column of length count and drop original
overview: create column of length count and drop original
status: one-hot-encoding
tagline: create column of length count and drop original
title: create column of length count and drop original

9. Impute missing values with KNN

10. Scale Values

11. Drop Missing release date

FINAL Shape After Transforming: 28,218 observations by 92 factors

Started with: 134,744 observations by 20 factors
Lost: 106,526 observations
Added: 72 factors

Posing the Questions

I wanted to investigate:

What’s the relationship between:

Movie’s rating and revenue:

Age and ratings

2. Can I predict revenue:

output:

Linear Regression Train R2: 0.76354213745444
Ridge Train R2: 0.6544291213406286
Linear Regression Test R2: -3.154566238553024e+21
Ridge Test R2: 0.5573

2.a. Can I predict if revenue is higher than the average for the year:

output:

LogisticRegression
==================
	Train Accuracy: 0.8547
	Test Accuracy: 0.849
	Train F1: 0.5841
	Test F1: 0.5684
LogisticRegressionCV
====================
	Train Accuracy: 0.8547
	Test Accuracy: 0.849
	Train F1: 0.5841
	Test F1: 0.5684
AdaBoostClassifier
==================
	Train Accuracy: 0.9414
	Test Accuracy: 0.9382
	Train F1: 0.7419
	Test F1: 0.7263
RandomForestClassifier
======================
	Train Accuracy: 1.0
	Test Accuracy: 0.9424
	Train F1: 1.0
	Test F1: 0.7381
GradientBoostingClassifier
==========================
	Train Accuracy: 0.9478
	Test Accuracy: 0.9389
	Train F1: 0.7674
	Test F1: 0.7298
KNeighborsClassifier
====================
	Train Accuracy: 0.9414
	Test Accuracy: 0.9215
	Train F1: 0.7383
	Test F1: 0.6531

3. Can I predict rating

output:

Linear Regression Train R2: 0.20249096812305722
Ridge Train R2: 0.15252773577526613
Linear Regression Test R2: -2.1949919056917983e+25
Ridge Test R2: 0.1507

Analyzing the Results

What’s the relationship between:

Movie’s rating and revenue: there are more movies with revenues above average for their year around the 5 to 7 range of average rating
Age and ratings: the amount of ratings is decreasing in the age of the films (i.e. the more recent a movie, the more ratings it’s likely to have)

2. Can I predict revenue:

It doesn’t seem like I can very well when looking at the results of the linear regression

2a. Can I predict if revenue is higher than the average for the year:

Better than predicting revenue itself, certainly
Some algorithms perform better than others, but I haven’t tuned or really accounted for overfitting
Overfitting seems likely for the Random Forest Classifier, but the Ada Boost Classifier and Gradient Boosting Classifier do alright in bias-variance tradeoff

3. Can I predict rating:

Probably not!
These values are so low, and with R2 monotonically increasing in factors, these results are surprisingly bad

Do you have to make a good movie to make money?

Getting the Data:

Viewing and Cleaning the Data

Posing the Questions

Analyzing the Results

Written by Christopher Daigle