Christmas movie IMDb ratings explained with Machine Learning

SHAP values of director, genre, actors, and more

Dmytro Iakubovskyi
Data And Beyond
Published in
3 min readDec 24, 2022

--

In this article, I use the dataset of about 750 Christmas movies taken from the IMDb website. The dataset is publicly available on Kaggle. Full details of the analysis can be found in this public Kaggle notebook.

Photo by Samira Rahi on Unsplash

Step 1 — data preprocessing

Here, data preprocessing consists of the following steps:

  • selecting movies with known IMDb ratings;
  • selecting movies with a long enough runtime (65 minutes or greater);
  • grouping movie runtime into larger bins;
  • extracting information about individual movie stars and genres with the help of CountVectorizer taking only the items with at least 6 appearances across the dataset;
  • encoding rare categorical variables (director, description, and film rating) with no more than 60 different categories in each column and at least 6 records in each category;
  • finally, removing unused columns, records with null values and single categories.

As a result, we have obtained a cleaned dataset containing 562 movies rated from 0 to 10.

Step 2 — setting a Machine…

--

--

Dmytro Iakubovskyi
Data And Beyond

Top writer in AI, Movies | Senior data scientist | Editor in Data And Beyond | https://www.linkedin.com/in/dima806/