Do you have to make a good movie to make money?

Christopher Daigle
5 min readJun 1, 2020

--

Jim Carrey as the Mask holding lots of money and looking excited
“We don’t make movies to make money, we make money to make more movies” — Walt Disney

Motivation: My current portfolio consists of no media related projects; I’m interested in data science applied to media (e.g. movies, music, etc.). I wanted to get familiar with connecting to a public API to scrape data and I wanted to start a basic examination of movie data; hopefully having the data useful for further work.

Getting the Data:

I found a few resources about analyzing movie data that mostly discuss Netflix and IMDB. Netflix didn’t answer my requests for data and IMDB has walled off much of their data.

Hello TMDB! The Movie Data Base, has an excellent API for movie info.

Timeliness is always of interest, but since the COVID-19 pandemic has probably impacted the levels of viewership, I imagine that’s affected ratings and revenue.

Here’s how I got the data:

  1. Got an API key from TMDB.com
  2. Adapted code to download data
  3. Run the module and combine the data files (.csv)

This took a while to build, there is some room for improvement here, but as Kaggle boasted 4,800 data points, my data is… larger. I have 133,744 data points and I think that’s valuable, but those extra data points will definitely add more work in the way of cleaning (spoiler: it did).

  • I need to revisit the method of retrieval so that I can evaluate more recent data — my current data is from the year 1888 but stops at 2018
  • My program died because my internet crashed, but I figured having almost 28x the data Kaggle provided was a fine way to start

Viewing and Cleaning the Data

In this section, I clean the data. You can skip to the next section, Posing the Questions, if you aren’t interested in the data engineering aspects of this. This took most of my efforts.

After importing the data, I saw the data was pretty dirty; I was surprised the data was stored at TMDB like that, but I got down to business.

Initial shape of data: 134,744 observations x 20 factors

  1. Drop Certain Observations With Nulls:
  • vote count, vote average, revenue: 106,267 (28,477 observations left)
  • release date: 111 observations (28,366 left)

2. Drop Certain Columns:

  • id: no useful information
  • homepage: 86% null

3. Identify and Indicate Top-10 most commonly observed:

  • genres
  • keywords
  • production companies
  • production countries
  • spoken languages

4. Indicate:

  • production company in top-10
  • production country in top-10

5. Measure Proportion of:

  • Number of genres in top-10
  • Number of keywords in top-10
  • number of spoken languages in top-10

6. Address Overview:

  • Create column indicating amount of words in overview

7. Drop Columns:

  • genres
  • keywords
  • production companies
  • production countries
  • spoken languages
  • overview

8. Handle Categoricals:

  • original language: one-hot-encoding
  • original title: create column of length count and drop original
  • overview: create column of length count and drop original
  • status: one-hot-encoding
  • tagline: create column of length count and drop original
  • title: create column of length count and drop original

9. Impute missing values with KNN

10. Scale Values

11. Drop Missing release date

FINAL Shape After Transforming: 28,218 observations by 92 factors

  • Started with: 134,744 observations by 20 factors
  • Lost: 106,526 observations
  • Added: 72 factors

Posing the Questions

I wanted to investigate:

  1. What’s the relationship between:
  • Movie’s rating and revenue:
  • Age and ratings

2. Can I predict revenue:

output:

Linear Regression Train R2: 0.76354213745444
Ridge Train R2: 0.6544291213406286
Linear Regression Test R2: -3.154566238553024e+21
Ridge Test R2: 0.5573

2.a. Can I predict if revenue is higher than the average for the year:

output:

LogisticRegression
==================
Train Accuracy: 0.8547
Test Accuracy: 0.849
Train F1: 0.5841
Test F1: 0.5684
LogisticRegressionCV
====================
Train Accuracy: 0.8547
Test Accuracy: 0.849
Train F1: 0.5841
Test F1: 0.5684
AdaBoostClassifier
==================
Train Accuracy: 0.9414
Test Accuracy: 0.9382
Train F1: 0.7419
Test F1: 0.7263
RandomForestClassifier
======================
Train Accuracy: 1.0
Test Accuracy: 0.9424
Train F1: 1.0
Test F1: 0.7381
GradientBoostingClassifier
==========================
Train Accuracy: 0.9478
Test Accuracy: 0.9389
Train F1: 0.7674
Test F1: 0.7298
KNeighborsClassifier
====================
Train Accuracy: 0.9414
Test Accuracy: 0.9215
Train F1: 0.7383
Test F1: 0.6531

3. Can I predict rating

output:

Linear Regression Train R2: 0.20249096812305722
Ridge Train R2: 0.15252773577526613
Linear Regression Test R2: -2.1949919056917983e+25
Ridge Test R2: 0.1507

Analyzing the Results

  1. What’s the relationship between:
  • Movie’s rating and revenue: there are more movies with revenues above average for their year around the 5 to 7 range of average rating
  • Age and ratings: the amount of ratings is decreasing in the age of the films (i.e. the more recent a movie, the more ratings it’s likely to have)

2. Can I predict revenue:

  • It doesn’t seem like I can very well when looking at the results of the linear regression

2a. Can I predict if revenue is higher than the average for the year:

  • Better than predicting revenue itself, certainly
  • Some algorithms perform better than others, but I haven’t tuned or really accounted for overfitting
  • Overfitting seems likely for the Random Forest Classifier, but the Ada Boost Classifier and Gradient Boosting Classifier do alright in bias-variance tradeoff

3. Can I predict rating:

  • Probably not!
  • These values are so low, and with R2 monotonically increasing in factors, these results are surprisingly bad

--

--