Predicting IMDb Movie Ratings using Supervised Machine Learning
My name is Joe Cowell and I recently enrolled in the Metis Data Science Bootcamp. The 12-week immersive program will turn me from ‘data novice’ into a full-fledged data scientist. I mean, the title of this post includes ‘Supervised Machine Learning’ and I’ve only been in the program for three weeks, so it seems like Metis is holding up their end of the bargain. Anyway, I’ll try to make a post about who I am for those interested, but for now, let’s take a look at how I used supervised machine learning to predict IMDb movie ratings.
Background:
During my musical career, the question was always, “how good is this song?” and never, “how much money will this song make?” Maybe that’s why we were your typical starving artists… Regardless, I took that concept and applied it to movies for this model. The idea is that artists in the movie industry can utilize this model to predict how well a movie will be received by viewers, thus, focusing on IMDb rating as the target, rather than Metacritic’s rating system or Rotten Tomatoes’s Tomatometer.
In its entirety, this project explored a few critical skills required of a data scientist:
- Web scraping (requests, HTML, Beautiful Soup)
- EDA (pandas, numpy)
- Linear regression (scikit-learn)
- Data visualization (seaborn, matplotlib)
Step 1: Data acquisition & cleaning 🔍
As a quick note, IMDb has an API available to download bulk data, but a primary requirement for this project was to obtain data through web scraping; so, I went along and got the information from IMDb using requests and Beautiful Soup. Requests is the module required to take the webpage and turn it into an object in python. Beautiful Soup takes that object, which is the HTML information behind the webpage, and makes searching and accessing specific information within the HTML text easy. You really need both in order to fully complete the process of web scraping.
On the IMDb page, I used the advanced search feature to access titles between 2000 and 2020. The results spanned thousands of pages and each page held the titles and links to 100 movies. Upon further inspection, I noticed the URL contained the phrase: ‘start=1’. Increasing this start number by 100 would flip through each page. With a helper function, I used requests and Beautiful Soup to pull the links for each page and returned a list of those links.
To utilize that list of movie hyperlinks, I created another function to extract as much data as I could from each page. This function took in a link and returned a dictionary containing the following information: title, IMDb rating, the number of IMDb raters, MPAA rating, genres, directors, writers, top three stars, initial country of the release, original language of the release, release date, budget, opening weekend USA, gross USA, cumulative worldwide gross, production companies, and runtime.
As part of the EDA, some data had to be cleaned. This consisted of turning any numerical value from a string into an integer. Runtime had to be converted into minutes, all of the monetary values needed commas and dollar signs removed, and the release date had to be converted into datetime. Additionally, categories that contained lists needed to be converted from strings into actual python lists (genres, directors, stars, production companies). The retrieval function did most of this cleaning, but after putting the data into a DataFrame, some other cleaning was necessary.
With over 2,000 movies in a DataFrame, I needed to do some more processing to get a functional DataFrame for modeling. This meant dropping movies without information on budget, movies with a budget below $1,000, and movies with a sum of raters under 1,500. In regards to that last requirement, movies with a low number of raters proved to report the more extreme movie ratings (movies leaning towards a perfect 10 or a big goose egg). All in all, I ended up with a DataFrame consisting of over 1,100 movies. Now it’s time to start modeling.
Pairplots: Before moving on to the next section, I’d like to mention pairplots. Pairplots is a great visualization tool for exploring relationships within the data and informing where to start for an MVP. It seems like a lot of information, but when you format your DataFrame with the first or last column being the target, it is a lot easier to interpret all of this information. For this pairplot, the plots in the first column show relationships between the independent variables and the target. Although I did not use most of the numerical data, it is obvious that there are linear and exponential relationships, which can easily inform where to start modeling.
sns.pairplot(movies_df_drop, height=1.2, aspect=1.25)
Step 2: Models and features 📈
It is important to note that another requirement for this project was the use of linear regression, so the models I experimented with were linear regressions and ridge regressions. With such a large number of features available and having this as my first experience with regression in python, it took me a bit of time to sort out each feature.
First, I decided to take the easy route by conducting a simple linear regression with runtime as my sole feature and IMDb rating as the target. This resulted in an R² value of 0.2687. Honestly, I was fairly excited to get any number above zero, so I was ready to dive in to the rest of the data.
For MPAA rating and genre, I created dummy variables to add to the DataFrame and got an R² of 0.3997. As for directors, writers, stars, and production company, I created a list of the most frequently occurring players in each of those categories and created dummy variables for the top contenders. If a director only appeared once in my data, then that director’s weight (or coefficient) would be a direct result of that specific film’s rating, so having players with multiple rows of data would give the model more information to create a better informed coefficient.
To get a little more creative, I took the release date and made a ‘release month’ feature. In the same vein, I took the release date and created another feature that determined the years since the movie was released. It may not have been the most relevant feature, but I was excited to experiment with datetime information.
Having loaded the features into a model, a resulting R² of 0.4751 seemed promising, but the next step was to rigorously test the model with cross validation.
Step 3: Testing and training / the results 🎥
Although linear regression was getting the job done, I knew I wanted to compare the coefficients of the model, and using a ridge regression was a great way to force myself to scale the inputs and try a different approach to creating a model.
For this section, I would recommend taking a look at the project repository to see the process behind training and testing models, but I’ll just jump to the final model and the results.
The final model resulted in an R² of 0.432 and a mean absolute error of 0.64. This is a fairly low R², but this article describes why an R² below 0.5 for predicting human behavior is expected. Additionally, the plot to the left of predicted ratings vs. actual ratings provided more confidence in the model, as there is some sort of linear relationship between the two. Also, the movies with highest residuals had either a low number of ratings, or were movies like Cats, Fifty Shades of Grey, and The Emoji Movie. These particular movies have good stats behind them, but the public just did not receive them well, which is a hard metric to incorporate into this model.
It’s also important to look at the coefficients associated with each feature. As seen in the plot on the left, runtime, years since release, and budget were all big players in the model, with some genres and writers being up there as well. That’s the beauty of the ridge regression: being able to use the coefficients to determine the weight of a specific feature.
In the end, I had a model that predicted IMDb rating with an R² of 0.432, significantly better than just predicting with the mean, and an MSE of 0.64, which means the prediction was liable to be wrong by 0.64 points in either direction.
Summary: 📘
Not only was this my first time scraping the web for data, but it was also my first time creating a model, let alone a linear regression model. And with all things considered, I’m fairly proud of this model. Also, the experience of individually traversing the data science workflow was very rewarding; I:
- Created my own dataset through scraping the web for information
- Explored the dataset and cleaned up anything that was off
- Developed an MVP to have a working model at any given moment
- Iteratively improved that model to get a better product with each feature
- Visualized the validity of my model and what contributed to the rating of a movie
Within three weeks of the bootcamp, I became comfortable with web scraping, EDA, linear regression modeling, and data visualization. Once again, for a more code-heavy explanation of my process, check out my GitHub repository, and feel free to reach out if you have any questions or comments.