Rotten Tomatoes ratings for 30,000+ movies explained with Machine Learning

SHAP values for director, genre, rating, and more

Dmytro Iakubovskyi
Data And Beyond
Published in
5 min readApr 20, 2023

--

Photo by lan deng on Unsplash

In this article, I use the dataset containing extensive information for about 140,000 unique movies from the Rotten Tomatoes website collected by April 2023. The dataset is publicly available on Kaggle. Full details of the analysis can be found in this public Kaggle notebook.

Step 1 — data preprocessing

Here, data preprocessing consists of the following steps:

  • selecting movies with identified labels — rating scores from both users (audience score) and professional critics (tomato-meter score) — scaled from 0 to 100;
  • converting movie release dates to decades;
  • grouping movie runtime lengths into larger (20-minute) bins;
  • extracting movie genres, directors, and sound mix columns and encoding them with at least 25 records present in the dataset;
  • removing unused columns;
  • finally, encoding rare categorical variables such as movie ratings, distributors, original languages, runtime lengths, and release decades with no more than 60 different categories in each column and at least 100 records in each category.

--

--

Dmytro Iakubovskyi
Data And Beyond

Top writer in AI, Movies | Senior data scientist | Editor in Data And Beyond | https://www.linkedin.com/in/dima806/