Movie Ratings for Fans of Small Internationally-Successful Studios

Liam Isaacs
Apr 7, 2021 · 7 min read

Impersonating a critic through linear regression

I was curious about how we can simulate a someone’s opinion of movies using linear regression. Today, I will go over in five steps how we can scrape data from BoxOfficeMojo and IMDB and engineer the perspective of a fan of small internationally-successful studios.

Pt. 1: Where does our critic go for information?

Media, media media. If the critic is the chef, media is the ingredient. That’ll be the eyes of our critic, the data that our totally-reasonable judgements will originate from. Take this movie I just made up, for instance, “The Great Cat Chase”. What information do you think is relevant for a fan of small international studios?

What should our critic care about?

I think we can consider what other people think, genre and who made the movie. I was worried about using cast, since we do not want to overly bias towards Toby Mcguire and want to stand up for what’s right: Tommy Wiseau. It’ll be easy to bias towards smaller studios using distributor by using the # of movies produced/year., but I couldn’t think of a way to do this with cast.

I couldn’t find a way to bias towards Tommy Wiseau. I’m sorry..

The next question is: which reviews do we trust? IGN, IMDB, Bon appétit?We’ll use audience scores for populist reasons. We can make a data-based decision from histograms of audience scores for RottenTomatoes, Metacritic and IMDB:

Looks like our reviews are all skewed (the “7 is average” mentality strikes again). Since IMDB looks the most normal, we’ll go with IMDB.

If you’re looking to webscrape I’d recommend using Scrapy since it’s pretty fast. You can see my crawlers/csvs in the github repo for this project, and I wrote a tutorial on scraping BoxOfficeMojo data last week.

Pt. 2: Revenue, MPAA Rating (PG-13, R, etc.) and Budget

Almost by accident, in my frenzy of web crawling I collected revenue, MPAA Rating & Budget. Revenue and budget ultimately end up rating movies from wealthy studios first; and I couldn’t rationalize why our critic would care about if a movie is R, PG-13, that sort of thing.

Pt. 3: Cooking with fire: designing an algorithm

Herein lies the start of our impersonation. This is our master plan:

This is how we will structure the critic’s opinion

Pt. 4: How do we tell what’s going on? Model evaluation

We can “benchmark” each step of the process using evaluation metrics for linear models. The reviews of people who watch these internationally-successful movies (defined by the list of terms above) are those who we want to imitate. So, when we run our models, if we train our critic to notice the same sorts of things about movies as would those people, we want the critic’s score to be the same as them. In other words, we want our model to explain a good % of the variance in the data. This can be observed by using R².

R² scores along the way

A technical way of describing the degree 2 polynomial regression’s score R² is: 36.1% of variance in our training (Tr) data and 33.1% of the variance in our testing (Te) data is explained by a degree 2 polynomial regression on data that has distributor, genre-genre, and genre “feature engineered” into it. Since both the polynomial iteration and previous iteration of our model have similar scores on Te data, we can compare these two approaches.

Regularization and residual plots for underfit/overfit data

The reason why we do a train-test split isn’t just to see how well we perform on unseen data — it helps us see if our model is underfitting or overfitting to the dataset. This basically means: is our critic not getting the gist of what’s going on at all (underfitting) or just imitating people 100% of the time (overfitting)? Underfitting is expressed by Tr < Te scores, and overfitting is expressed by Tr > Te scores.

Except for the polynomial regression, our models constantly underfit. While it’s hard to pinpoint why this happens, it could be that our feature-space is too small (our list of stuff we’re looking at is too short) or that our data source (# of IMDB reviews) is too little, or both. Since I couldn’t include more features or data without more and more time-consuming webscraping, I decided to skip over that and just go straight to regularization.

Regularization (seen here as “Ridge” and “LASSO”) works by penalizing our model for mimicking people 100% of the time. You use it when Tr > Te, when you’re overfitting, but it’s worth a try for underfit models too. We can visualize what’s going on by looking at how our predictions change when we start to slap our model on the wrist for mimicry. This is seen in “residual plots”, which plot predicted values for Te data against the actual values.

Linear (2nd to last in the R² diagram) and Polynomial (last in the R² diagram) regression’s residual plots

Generally speaking, we go from residuals that look to be somewhat linear (especially in the case of that polynomial regression) to residuals that look a lot more random. This is called increasing the homoscedasticity (error) of our model, which means our models are less misleading and/or bias (source: assumptions of linear regression). Practically speaking, this means our critic’s creativity in opinion is more trustworthy; it’s actually spontaneous.

Q-Q Plots for regularized data

The plots above are more for eye-balling variance in our errors. Quantile-Quantile (Q-Q), pictured here with residual plots to the left (and their accompanying Tr, Te R² values), can help visualize the distribution of our residuals. They work by plotting the quantiles of two probability distributions against each other. In our case, we plot the “observed quantiles” of our residual plots against that of a normal distribution. If our data’s normal, if these two distributions are similar, they will follow a y=x line.

Residual plots with R² values and Q-Q Plots for Linear and Polynomial regressions

For everything except the Ridge CV polynomial regression, the points follow a fairly linear pattern, meaning the distribution of our residuals is normally distributed (checks off another assumption of linear regression). The linearity tends to weaken for reviews in the “mid-section” of movies (the line bends) — as we can see in the residual plots, that is because there are outliers: these are movies that “flop”, ones we predict will do OK but end up failing. While we can choose to identify and remove these movies, I think keeping them and having the character of a “flawed critic” is mildly interesting.

I ended up choosing the RidgeCV linear model, since its residuals are normally distributed, the Q-Q plot is the most linear, and most importantly scores can get up to 8ish. In addition, the polynomial model creates a massive feature space (378 to ~7000) which takes a long time to compute let alone run a LassoCV on, so it is impractical for our use case of a WebApp.

Pt. 5: Visualizing the movies

We’ve created our critic, now we can start reviewing some movies. I just ran the same train-test-split and LassoCV Linear model to give predictions, and displayed that in a grid format. I decided to keep the IMDB rating, just so you know if the critic is totally off the rails for a given movie.

You can see the ratings for yourself here.

The top 10 movies

Pt. 6: Reflections

I think the model does a really good job at exposure to a lot of international films. I think this method might just be on the cusp of suggesting under-rated films that you haven’t seen. I was happy to see a recent favorite, Tokyo Godfathers, in the top 50. Studio Ghibli films also tend to do very well, which is nice to see. Strangely, panda movies do well.

The view of the critic is a double-edged sword;. Since it’s so far from the status quo, it pretty much feels random. I didn’t want to take into account the volume of reviews, but in retrospect that piece of information might help weed out these documentary films that have 1–10 reviews on IMDB.

Due to the visualization method of showing all the testing data, ranking loses some of its weight. It’s nice to just show everything for exposure purposes, but this deflates the worth of a 1–10 critique and might question the need to have a score given at all. I ended up asking myself, “What’s different from a random order?” From our analysis, though, of R², residual and Q-Q plots, it’s clear that the order of the movies is non-random. But it still comes across as random, so it’s hard to say. Ultimately, I would say “it kind of worked.”

Thank You

Thank you for reading! You can find the github repository for this project here, which has outlined each step of the way so that you can follow along. You can find me on my website, LinkedIn, or github. You can read more of my stuff here.

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.

Liam Isaacs

Written by

叶秋 pianist & data scientist, liamisaacs.com

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit https://www.nerdfortech.org/.