Predicting IMDb Ratings of New Movies

Kevin Chen
Web Mining [IS688, Spring 2021]
24 min readMay 12, 2021
Source: IMDb

Introduction

IMDb is the world’s most popular and authoritative source for movie, TV, and celebrity content. IMDb users often look at ratings to get an idea of how good movies are, so that they can decide what movies to watch or which ones to prioritize. However, movies that are not yet released don’t have ratings, and even the ones with few votes often change as more users vote. Therefore, I wrote code to predict IMDb ratings of new movies based on various features, such as budget, actors, directors, writers, release year, genres, and plot. While others have used linear regressions to predict ratings of movies in general, those predictions rely on features like movie earnings or number of votes, which would not be available for new movies. I instead combined cosine similarities and normalized Euclidean distances with a modified kNN algorithm, which still produced mostly very accurate predictions. This will provide a way to obtain an estimated rating that’s not yet provided by IMDb.

1. Data Collection

I coded all of my work, including the data collection, in Python in Jupyter Notebooks. It probably would’ve been simpler to scrape data from IMDb’s website, but because I wasn’t sure if that was allowed by IMDb, I collected the data from 4 sources:

  1. IMDb’s datasets: https://www.imdb.com/interfaces/
  2. RapidAPI’s Movie Database IMDb Alternative: https://rapidapi.com/rapidapi/api/movie-database-imdb-alternative
  3. TMDb’s (The Movie Database) API: https://developers.themoviedb.org/3/getting-started/introduction
  4. Macrotrends’ data table: https://www.macrotrends.net/countries/USA/united-states/inflation-rate-cpi

IMDb’s Datasets

IMDb provides subsets of IMDb data that are available for personal and non-commercial use, so I downloaded 7 of the TSV files from its website and I read the TSV files as DataFrames using pandas’ read_csv() function.

Download links on IMDb’s website
Code to read TSV files

After exploring the data, I ultimately used the data from 4 of these TSV files:

  1. Regions shown in (called region) came from title.akas.tsv
  2. IMDb ratings (called averageRating) and IMDb # of votes (called numVotes) came from title.ratings.tsv
  3. Directors and writers came from title.crew.tsv
  4. Runtimes (called runtimeMinutes) and media types (called titleType) came from title.basics.tsv

IMDb ratings, directors, writers, and runtimes are the features that I ended up using from the IMDb datasets. IMDb # of votes, regions shown in, and media types are features that I used to filter my data. IMDb # of votes refers to # of votes that make up the IMDb rating, and media types are movies, TV shows, video games, and so on because IMDb contains more than just movies.

RapidAPI’s Movie Database IMDb Alternative

RapidAPI’s Movie Database IMDb Alternative is an API that I used to extract actors, titles, genres, plots, release years, and countries filmed in. It queries the data for each movie individually, which was very slow for my computer because there were over 500,000 movies to query before I decided on what data to filter out.

Therefore, I coded multiple scripts to perform different queries in parallel, and this sped up the process a lot. I created 4 copies of my Jupyter Notebook and slightly adjusted them to simultaneously call and store data for different movies. For example, in my original Jupyter Notebook, I sliced the 1st 1/5 of the title IDs, and saved it as a movies1.csv, and in the next notebook, I sliced the 2nd 1/5 of title IDs, and saved it as movies2.csv.

Slicing 1/5 of title IDs
import requests
import json
import traceback
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C" and ch!='\\')
url = "https://movie-database-imdb-alternative.p.rapidapi.com/"headers = {
'x-rapidapi-key': key,
'x-rapidapi-host': "movie-database-imdb-alternative.p.rapidapi.com"
}
json_list = []
error_title_IDs = []
for title_ID in title_IDs:
querystring = {"i":title_ID,"r":"json"}

try:
response = requests.request("GET", url, headers=headers, params=querystring)
json_list.append(response.json())
except:
try:
json_list.append(json.loads(remove_control_characters(response.text)))
except:
error_title_IDs.append(title_ID)
print(title_ID, traceback.format_exc())

df = pd.DataFrame(json_list)
df.to_csv('movies1.csv',index=False)

In the code above, I used the requests and json libraries to call the API and extract the data in the form of JSON. I removed control characters in an exception statement because it otherwise encountered JSON decode errors. The 2nd exception statement using with the traceback.format_exc() function was to keep track of which title IDs there were other errors for (which in my experience was an API network error), so that I could call the API again with those queries.

There were cases where the title ID from IMDb’s dataset displayed a different imdbID when I called the API. For example, the title ID for this in IMDb’s dataset was ‘tt0090111’.

When I manually looked up the title IDs up on IMDb’s website, the URL redirected to the same imdbID value from the API calls. Some of them had information, while others didn’t. I added code to remove their JSON from json_list, and I saved their IDs in a list to check afterward. Using numpy’s .setdiff1d() function, there were no title IDs from RapidAPI that weren’t in the IMDb datasets because the code below returned an empty set, so I didn’t have to do anything with this removed data.

In my original Jupyter Notebook, I read all the movies CSV files (that I created using the above code) as DataFrames and combined them, which provided with me with the data for all the movies.

There were features, between IMDb’s datasets and RapidAPI’s Movie Database IMDb Alternative, that referred to the same feature but had different values, so I had to use my best judgement to decide which to use. For example, the genres from the IMDb datasets had only the first 3 genres that show up in IMDb, so I used the genres from RapidAPI that had up to 8 genres for a given movie; the IMDb ratings from RapidAPI were often missing or far from the current IMDb ratings, so I used the IMDb ratings from the IMDb datasets.

TMDb’s API

After I performed extensive analysis, I thought budgets would be helpful for improving the accuracy of my predicted ratings, so I extracted them from TMDb’s API. TMDb’s API queries use TMDb’s movie ID, which is different from IMDb’s title ID, so I couldn’t query only the movies that I filtered previously (covered later in the Data Processing section), and like RapidAPI, it queries the data for each movie individually.

To work around this, I performed two queries, as shown below, that extracted the TMDb movie IDs that were filtered the same way as my other IMDb data. TMDb’s “discover” API query provides up to 500 pages of results for a total of up to 10,000 results. Fortunately, after I filtered out movies (covered later in the Data Processing section), there were just over 10,000 movies, so I used parameters for the TMDb discover API calls such that the combination of the two would include those same movies. Filtering the API query by release year wasn’t working, so I filtered by language (as U.S) and vote count (≥ 55). Note the vote count is the number of TMDb votes, not the number of IMDb votes that I filtered. 55 was the minimum number, at the time of calling the API, that the two queries intersected data. I wanted to minimize the number of TMDb votes to minimize excluding movies. 1 of the queries was ascending (parameter sort_by=vote_count.asc) to capture the later pages, and the other query was descending to capture to earlier pages. I combined all these TMDb IDs in a list called IDs and removed 1 of each of the duplicates that were intersected by converting the list of IDs into a set.

IDs = []
for i in range(1,501): #Page numbers
url = 'https://api.themoviedb.org/3/discover/movie?api_key={}'.format(key)
url += '&language=en-US&vote_count.gte=55&sort_by=vote_count.asc&page={}'.format(i)
response = requests.get(url)
for result in response.json()['results']:
IDs.append(result['id'])

url = 'https://api.themoviedb.org/3/discover/movie?api_key={}'.format(key)
url += '&language=en-US&vote_count.gte=55&sort_by=vote_count.desc&page={}'.format(i)
response = requests.get(url)
for result in response.json()['results']:
IDs.append(result['id'])
tmdb_IDs = set(IDs)

I then queried the IMDb data for each TMDb ID by using similar code to that of what I used for RapidAPI’s Movie Database IMDb Alternative. The differences were that I didn’t need to remove control characters, and the query URL and output CSV filename were changed.

json_list = []
error_list = []
for tmdb_ID in tmdb_IDs:
url = 'https://api.themoviedb.org/3/movie/{}?api_key={}&language=en-US'.format(tmdb_ID,key)
response = requests.get(url)
try:
json_list.append(response.json())
except:
error_list.append(tmdb_ID,traceback.format_exc())
print(tmdb_ID,traceback.format_exc())
tmdb_df = pd.DataFrame(json_list)
tmdb_df.to_csv('tmdb.csv', index=False)

Macrotrends’ Data Table

I wanted to adjust the budgets based on inflation because the U.S. dollar of the 1970s is worth almost 6 times more than today’s U.S. dollar, and this is a large reason for much lower budgets back then compared to now. Macrotrends provides U.S. inflation rates from 1960–2019 in a data table. I copied the data from the table and pasted it into a CSV file, and I appended the data for 2020 and 2021 from Google.

Then, I read the CSV file, as a DataFrame I named inflation_df, using pandas’ read_csv() function again.

2. Data Processing

Merging, Cleaning, & Filtering

After collecting the data, I first merged the data from IMDb’s datasets and RapidAPI. I renamed the column names to match, so that I could then merge most of the DataFrames based on title IDs.

After merging, the ‘akas’ and ‘principals’ DataFrames had multiple rows for each title ID, which resulted in an enormous CSV file of 12.81 GB. Therefore, I removed them, such that my merged DataFrame had a size of 237359 x 44, and a CSV file size 153.6 MB.

I continued processing the data in a DataFrame called df, which was the same as the merged data.

The missing data was provided as newline characters, so I converted those into null values, and the numeric IMDb data was provided as strings, so I converted them to integers.

I dropped columns, like box office earnings and awards, that aren’t relevant to or updated for new movies using pandas’ .drop() function.

I filtered out movies with missing data for the columns of interest (after dropping unnecessary columns) by using pandas’ .dropna() function.

df = df.dropna()

I further filtered movies after 1970 with U.S. as a region and with > 5000 IMDb votes. I needed to reduce the number of movies due to limited memory and processing power for my computer, and these filters make sense for producing accurate predicted ratings: Older movies won’t have the same people that make the movie to compare to those of new movies. Movies that aren’t shown in the U.S. have more data errors and less data that’s similar. Movies with few IMDb votes are more extreme and more likely to significantly change if they get more votes.

After this filtering, the number of movies that I had data for was 11,175.

Note that these filters are for analyzing movies that have reliable IMDb ratings and for generating predicted ratings that can have their accuracy validated. New movies with ≤ 5000 votes will of course all be after 1970 and included for predicted ratings that IMDb users would be able to use.

Adjusting Budgets Based on Inflation

The data I collected from Macrotrends provided U.S. inflation rates per year, so I converted them to cumulative inflation rates by multiplying by all previous inflation rates for each year. The code below sorts the years so that the earlier years are before the later years, and then it converts the string inflation rate into a float added to 1. The code below multiplies each previous inflation rate by the previous inflation rate, so that it accumulates.

I then removed the records with budgets, that were missing, a string “0”, or not in df, from tmdb_df. I did a left merge of df and the tmdb_df, so that all of the previously filtered movies would be included, rather than just the ones with a budget.

For missing budgets, I ended up setting them to the mean of these budgets that account for U.S. inflation rates. This implementation, as opposed to excluding the missing budgets, resulted in better accuracy for my predicted ratings. In the code below, I set the missing values to 0 using pandas’ .fillna() function so that I could later set such values to the mean budget. I needed to use a number because the operations with NaN wouldn’t work. I filtered out budgets > 1000, excluding the 0, and budgets < 9x10⁸ because those were outliers and mostly incorrect budgets anyway.

Next, I divided the budgets by the cumulative U.S. inflation rates (with 1971 as the starting point) that I previously calculated. I calculated the mean budget after the adjustments for inflation, using pandas’ .mean() function, and applied that to the 0’s that I used to represent the missing values.

After this, the total number of movies was still 11,175, but the ones with actual budgets was 6,452.

3. Initial Data Visualizations

To help understand the data, in addition to looking through the data and using a lot of .groupby() functions that has too much data to show, I created scatter plots and a heatmap to compare each of my 3 numeric features to IMDb ratings, and I created data tables to explore the categorical features. They ultimately showed that I needed to analyze combinations of features because no single feature can accurately predict ratings of new movies, and they also showed that there were numerous combinations of features.

Scatter Plots & Heatmap

In the below scatter plots, higher budgets and runtimes tend to have higher IMDb ratings, but for the vast majority of budgets and runtimes, the IMDb ratings are spread from about 1 to 9, so they can’t be predicted using either of these features by themselves.

For the below Year vs IMDb rating scatter plot, the older years tend to have higher IMDb ratings, but the spread of IMDb ratings increases for more recent years, which applies to new movies.

Below is the code I used to create the budget. The same code, with the appropriate variable and column names were used for the other 2 scatter plots.

import matplotlib.pyplot as pltbudget = df['budget'].tolist()
rating = df['imdbRating'].tolist()
plt.scatter(budget,rating)
plt.title('Budget vs IMDb Rating')
plt.xlabel('Budget')
plt.ylabel('IMDb Rating')
plt.show()

The heatmap below shows the correlations between each of my numeric features. None of the correlations are particularly high. Even the correlation between the runtime (“runtimeMinutes”) and IMDb rating (“imdbRating”), which was the highest as 0.33, was increased because of the high runtime outliers that had high IMDb ratings.

Below is the code I used to create the heatmap using the seaborn library.

Data Tables

Below are the first 10 rows of the data tables for my categorical features. The values for the writers and directors are their name IDs that are in IMDb.

Each of these can have multiple values that are separated by commas, resulting in these high number of unique combinations. The high number of combinations indicate that there are several combinations that have few movies that are associated with them, which makes it difficult to compare the individual values to IMDb ratings. The correlations of the data for such few movies is unreliable, and combinations of values can affect IMDb ratings differently than the individual values would indicate.

For example, there’s only 1 movie with the genres Action, Adventure, Biography, and Drama, so I wouldn’t expect a new movie with those genres to necessarily have a similar rating. Yet, 2 genres may individually have high IMDb ratings but have low IMDb ratings when combined. These combinations get exponentially more complicated as more features are introduced, like actors that have varying levels of chemistry with other actors and performances in certain genres of movies.

I used pandas’ .groupby() function to generate these (they could instead be created using the .drop_duplicates() function) because I initially compared columns like IMDb rating and compared the statistics for each grouping. For example, below is the code for grouping by genre (shown in above the data tables screenshot) and comparing the mean IMDb rating (excluded from the data tables screenshot).

4. Additional Data Processing

I performed additional data processing in order to generate my predicted ratings. I used TF-ID and one-hot encoding on the categorical features to calculate cosine similarities. Then, I calculated normalized Euclidean distances for the numeric features.

TF-IDF

I used TF-IDF (term frequency-inverse document frequency) to reflect how important each word of each movie plot is. TF is the number of times the word appears in the plot. IDF is the logarithm of the total number of plots divided by the number of plots containing the word. TF-IDF is TF multiplied by IDF. I used sklearn’s TfidfVectorizer() function to create the TF-IDF vectors for the movie plots.

from sklearn.feature_extraction.text import TfidfVectorizertfidf = TfidfVectorizer(stop_words='english')
tfidf_plot = tfidf.fit_transform(df['Plot'])

One-Hot Encoding

I used one-hot encoding to be able to perform numeric calculations on the categorical variables.

I split genre strings into a list for the movies with multiple genres.

I appended each genre into a list to see their counts.

To apply the one-hot encoding, I iterated through the list of genre names, and put a 1 if the movie has the genre and otherwise a 0.

I applied the same code, with the appropriate variable and column names, to the actors, directors, writers, and countries data.

Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.

I made a dictionary with each of the cosine similarities for the TF-IDF vectors and one-hot encodings that I previously created. I used the cosine_similarity() function from the sklearn.metrics.pairwise library to calculate the cosine similarities for every pair of movies.

from sklearn.metrics.pairwise import cosine_similarityco_sim_dict = {}
attribute_list = ['One-hot Actors','One-hot Directors','One-hot Genres','One-hot Countries','One-hot Writers','Plot']
for atr in attribute_list:
if atr == 'Plot':
co_sim_dict[atr] = cosine_similarity(tfidf_plot,tfidf_plot)
else:
co_sim_dict[atr] = cosine_similarity(df[atr].tolist(),df[atr].tolist())

Euclidean Distance

Euclidean distance is the straight line distance between 2 points, so I used it as the opposite of similarity. Because the numeric data was 1- dimensional, I transformed each value into a list with an appended 0. In this case, the Euclidean distance is equivalent to the absolute value of the difference. The reason I converted the values into lists and appended 0’s was that the sklearn pairwise Euclidean distance function requires at least 2 dimensions and performs the operation for every pair of movies efficiently. Before actually calculating the Euclidean distances, I first normalized the numeric values, which I will cover in the next (Normalization) subsection.

Normalization

I normalized numeric values so that their distances would be rescaled between 0 to 1. Rescaling between 0 to 1 allows for subtracting from 1 to calculate similarity scores, and it makes these similarity scores scaled the same as cosine similarity scores.

Below is the function I used to normalize each numeric feature.

def normalize(array):
norm = np.linalg.norm(array)
normal_array = array/norm
return normal_array

Like for the cosine similarities, I used the sklearn.metrics.pairwise library to calculate the Euclidean distances for every pair of movies, using the euclidean_distances() function. Below is the code for normalizing the budgets and getting the Euclidean distances of those normalized budgets.

from sklearn.metrics.pairwise import euclidean_distancesdf['normalized budget'] = normalize(df['budget'].tolist())*10
df['budget distances'] = df['normalized budget'].apply(lambda x: [x] + [0])
budget_distances = euclidean_distances(df['budget distances'].tolist(),df['budget distances'].tolist())
budget_sims = 1 - budget_distances

Note that I multiplied the normalized budgets by 10 to scale the Euclidean distances more because the they were only between about 0 to 0.1. Also note that I calculated the budget similarity score (which I named budget_sims) as 1 minus the Euclidean distance of the budgets because distance is the opposite of similarity. I applied the same code, with the appropriate variable and column names to the other numeric variables, but I multiplied the normalized values by different amounts to scale them similarly. The normalized release year was multiplied by 1000, and the normalized runtime was multiplied by 20.

5. Predicting IMDb Ratings

Using the cosine similarities and Euclidean distances from the previous (Additional Data Processing) section, I calculated the total similarity score as the (cosine similarities + (1 — Euclidean distances)) with some weights applied. Then, I predicted ratings using the mean IMDb ratings for the k Nearest Neighbors (kNN), which are the movies with the highest similarity scores. Calculating similarities between movies incorporates what I learned in the IS688 course, and because of the sufficiently diverse and high number of movies to compare in the data, it successfully circumnavigated needing to estimate the effects that numerous combinations of features have on IMDb ratings.

Code for Predicting IMDb Ratings

I first mapped the title IDs (‘tconst’ column) to their indexes in df, and mapped the movie titles (‘Title’ column) to the title IDs, so that my function could index movies based on the title ID or movie name.

df = df.reset_index(drop=True)
ID_indices_dict = dict(zip(df['tconst'],df.index))
title_ID_dict = dict(zip(df['Title'],df['tconst']))

Below is my function that I created to predict IMDb ratings. The “title” parameter takes as an argument either the title ID or movie title, and the “k” parameter is the number of movies with the highest similarity scores to use for the predicted rating.

def predict_rating(title, k):
#Convert title into IMDb title ID if needed, and index title ID
if title in title_ID_dict:
title = title_ID_dict[title]

idx = ID_indices_dict[title]

#Sort total similarity scores
sims = np.array(budget_sims[idx])*2 + np.array(year_sims[idx]) + np.array(co_sim_dict['One-hot Actors'][idx]) + np.array(co_sim_dict['One-hot Directors'][idx])*2 + np.array(co_sim_dict['One-hot Genres'][idx]) + np.array(co_sim_dict['Plot'][idx] + np.array(co_sim_dict['One-hot Writers'][idx]) + np.array(co_sim_dict['One-hot Countries'][idx]))

sims = list(enumerate(sims))
sims = sorted(sims, key=lambda x: x[1], reverse=True)

#Create list of k most similar movies that have > 5000 IMDb votes
k_sims = sims[0] #Index 0 is the selected movie, only included to print its data
for sim in sims[1:]:
if df['numVotes'].iloc[sim[0]] > 5000:
k_sims.append(sim)
if len(k_sims)>k:
break

#Make predictions using mean of IMDb ratings of k most similar movies
movie_indices = [i[0] for i in k_sims]
print(df[['Title','imdbRating','budget','Year','Actors','directors','writers','Genre','Country']].iloc[movie_indices])
if (df['imdbRating'].iloc[movie_indices[1:11]].mean() > 7) and (df['imdbRating'].iloc[movie_indices[1:6]].mean() > 7.5):
return df['imdbRating'].iloc[movie_indices[1:6]].mean()
elif df['imdbRating'].iloc[movie_indices[1:4]].mean() < 4:
return df['imdbRating'].iloc[movie_indices[1:4]].mean()
else:
return df['imdbRating'].iloc[movie_indices[1:]].mean()

The 1st part of the function determines the index (variable called idx) of the title ID or movie title that is passed into the function. Next, the function uses the determined index to retrieve similarity scores between the selected movie and every other movie, which are used to calculate the total similarity scores. Then, it creates a list of the indices of the “k” most similar movies that have > 5000 IMDb votes, which is used for the predicted ratings.

Note the weights of 2 for the similarity scores of budgets and directors, which increases their impact on the total similarity score. The total similarity score is calibrated to find movies that are likely to average an IMDb rating close to that of the queried movie. Thus, the retrieved similar movies aren’t expected to have similar IMDb ratings by themselves, nor are they necessarily going to be liked the same amount by an individual like the goal would be for a recommendation system.

I tested numerous combinations of similarity scores, weights, and k values. The reason I had to test combinations rather than determining the importance of individual choices at a time was because the importance of the individual choices change based on the combinations they are a part of. For example, with fewer features used for similarity scores, certain higher k values increased the accuracy of my predicted ratings; and without budget, including similarity scores of countries that the movies were filmed in significantly lowered the accuracy of my predicted ratings.

I validated the accuracy of my predicted ratings by comparing to the IMDb ratings of movies that have > 5000 IMDb votes, rather than new movies that have <= 5000 IMDb votes. I otherwise wouldn’t have a stable IMDb rating to compare to, and as a bonus, the diversity of these movies is much greater, which means less chance of overfitting. Note that I will provide more details about calculating accuracy in the next (Accuracy) section, and the function does still produce predicted ratings for new movies.

The last part of the function are the rules for the predicted ratings, which are returned by the function. The 1st 2 rules improve the accuracy of my prediction ratings for high and low IMDb ratings, without making the accuracy of my prediction ratings for other IMDb ratings significantly worse, which I will provide examples for in the next (Example Recommendations) subsection.

Example Recommendations

Below is a prediction example for a high IMDb rating.

The queried movie is the first title, which is Avengers Endgame, and the most similar movies are in order as Avengers Infinity War, Captain America Civil War, Captain America: Winter Soldier, and so on. The actual rating is 8.4, and my predicted rating is 7.84, which is the mean of the top 5 most similar movies rather than the top 10. This is based on my rule that the mean IMDb rating of the top 5 most similar movies is > 7.5 because it’s 7.85, and the mean IMDb rating of the top 10 most similar movies is > 7 because it’s 7.34. As you can see, this rule helps predict high rated movies because it’s less likely for the mean IMDb rating of the top 10 most similar movies, to average a high rating.

Below is a prediction example for a low IMDb rating.

The queried movie House of the Dead has an IMDb rating of 2.0. My predicted rating is 2.83, which is the mean IMDb rating of the top 3 most similar movies because it is less than 4. This is closer to the IMDb rating than 4.11, which is what the prediction would be using the mean of the top 10 most similar movies. I restricted the rule to being based on the top 3 instead of top 5 because there are much fewer low rated movies in my filtered data, making it more unlikely for the mean IMDb rating, for higher k values, to average a low rating.

Cats is a movie that had high expectations based on the past success of the director, high budget, popular celebrities, and frequent advertisements. However, it simply was received poorly by the public, which isn’t something that can be accounted for with the data. Because the mean IMDb rating of the top 5 most similar movies is less than 7.5, the predicted rating is based on the mean of the top 10 most similar movies, which is 7.34.

6. Accuracy

I used my predict_rating() function to create predicted ratings (with the column name “Predicted Rating”) for every movie in df. I calculated “rating difference” (with the column name “Difference”) as the absolute value of (the IMDb rating minus the predicted rating).

The rating differences were overall very low, which indicates good accuracy. The mean rating difference was 0.60, and the standard deviation of rating differences was 0.55.

The vast majority of the rating differences were <= 1. Most of the other rating differences were between 1 and 2. Very few rating differences were higher than 2.

More specifically:

82.85% of rating differences are <= 1.

14.69% of rating differences are >1 and <= 2.

1.74% of rating differences are >2 and <=3.

0.47% of rating differences are >3 and <=4.

0.02% of rating differences are >4 and <=5.

I also checked the accuracy for the predicted ratings of movies that had budgets, which I expected to be better because the ones with missing budgets used the mean budget as substitute that is not necessarily close to their actual budget. I saved a copy of df with the missing budgets, and then filtered out the movies that used the mean budget as a substitute for missing budgets.

df_with_missing_budgets = df.copy()
df = df[df['budget']!=mean_budget]

This resulted in a a slightly higher mean rating difference, but also a lower standard deviation, and thus less rating differences above 2.

I consider this to be better accuracy because having predicted ratings be exactly or almost exactly the same as the IMDb rating isn’t necessary compared to having them be close to the same. Having fewer predictions that are far from the IMDb rating is more important. As shown below, there is almost half the proportion of rating differences above 3 when only checking the accuracy for the movies with budgets as opposed to all of the movies.

For the sake of the length of this blog, I will continue to show the accuracy for only the predicted ratings of movies that have budgets. Note that these predicted ratings may still be calculated using movies that use the mean budget as a substitute for missing budgets. I’m just not including the predicted ratings of the movies without budgets, which have more extreme rating differences.

Scatter Plots

The below scatter plots help visualize the accuracies of my predicted ratings.

The IMDb Rating vs Rating Difference scatter plot uses orange dots to represents the rating differences > 3, and blue dots to represent the others. The rating differences are higher for IMDb ratings below 4 because there are much fewer low rated movies in my filtered data.

Code for IMDb Rating vs Rating Difference
Code for IMDb Rating vs Predicted Rating

Bar Charts

The below bar charts help visualize the counts of rating differences for different intervals.

This 1st bar chart represents the percentages that I previously mentioned. Again, the vast majority of the rating differences were <= 1. Most of the other rating differences were between 1 and 2. Very few rating differences were higher than 2.

Code for Counts for Rating Difference Intervals of 1

This 2nd bar chart is between 0 to 1, where most of the rating differences fall under. There are more rating differences closer to 0 than they are to 1, which indicates higher accuracy.

Code for Counts for Rating Difference Intervals of 0.25

Additional Scatter Plot

This Year vs Rating Difference scatter plot shows that the rating differences are about the same regardless of year, including recent years that I initially mentioned had a greater spread of IMDb ratings and would be more similar to the new movies that I would predict ratings for. This indicates that my prediction ratings are accurate for a wide variety of movies.

Below is the code for this scatter plot.

year = df['Year'].tolist()
diff = df['Difference'].tolist()
plt.scatter(year,diff)
plt.title('Year vs Rating Difference')
plt.xlabel('Year')
plt.ylabel('Rating Difference')
plt.show()

Conclusion

Accurately predicting IMDb ratings of new movies is challenging. I needed to analyze combinations of features because no single feature can accurately predict ratings of new movies. Yet, there were numerous combinations of features that can affect IMDb ratings differently than the individual values would indicate. I performed a lot of data processing and ultimately combined weighted cosine similarities and normalized Euclidean distances with a modified kNN algorithm. With this technique, my predicted ratings of most movies are within a difference of 1 from their actual IMDb rating, with very few exceptions.

The main limitation is that the predicted ratings are dependent on having data for movies that have similar combinations of features. This is usually but not always the case, as evidenced by the fact that because there are much fewer low rated movies in my filtered data, the predicted ratings are more inaccurate for IMDb ratings below 4. In the future, I potentially could include similarities based on movie production companies (e.g. Marvel Studios, Paramount Pictures, Walt Disney Pictures), which may improve the accuracy of the predicted ratings. This would likely require scraping from IMDb’s website because only the 1st (not necessarily most important) production company was provided by the IMDb datasets, and many weren’t correct from TMDb’s API.

--

--