How can we tell if a movie will be great?

Combining k-means clustering with decision tree on IMDb data

James Chen
4 min readNov 26, 2016

Background

Ever wonder how great a movie will be before its release? Recently IMDb data of 5,000+ movies was scraped with Python and published on Kaggle.com by Chuan Sun. The dataset is available here.

Screenshot of the data scraped from IMDb website
Screenshot of data preview

Objective

Our objective is to find out what factors might contribute to great movies.

Approach

1…Apply k-means clustering to assign movies into 5 classes

First we need to define what great movies are. One simple way is to use the IMDb scores; however, the approach we are using here is to combine the scores with box office (gross) in order to get a more balanced indicator on how great the movies are. Movies are being labeled into 5 different categories, as shown below. We can see that the majority of movies are in cluster 3, which generated low box offices and with high variations in IMDb scores; whereas movies in cluster 1 have relatively higher gross and scores compared to other clusters.

Screenshot of the k-mean clustering result (number of clusters is 5 and simulated 20 times)

2…Construct a decision tree to classify movies based on different variables

Next we will input different factors and see which factors help classifying movies into the 5 clusters created most, by building a decision tree. According to the result, number of voted users, movie budget, R-rated content, and number of critic reviews are observed to have influences on greatness of the movies.

Result of the decision tree (going left indicates positive on the given condition; going right indicates negative)

Say we want to know if a movie will be great (cluster 1 in green), we will have to follow the path created from top to bottom in the decision tree:

  1. The number of voted user has to be greater than or equal to 43,000
  2. Movie budget has to be greater than $36M USD
  3. The movie cannot be R-rated
  4. The number of voted user actually has to be greater than or equal to 519,000

By fulfilling the conditions above, we will have close to 70% chance of knowing that the movie will be a great one before its release!

Data Preperation

1…Some critical values are missing in the file, and thus rows are removed

2…Dummy variables created for categorical variables (language, country)

3…Training set and testing set are created with 4:1 ratio

Results

Based on the confusion matrix, the overall accuracy is 0.6572 on 5 clusters (with cluster 1 accuracy close to 0.70).

Screenshot of the decision tree performance on 5 clusters

In order to improve the overall accuracy, we may want to include more variables from the file, as we have not included genres (action, adverture, thriller..etc), movie title (may require natural language processing) and keywords in plots (may require text mining).

In addition, we can explore further if the number of clusters can be improved. Let us first take a look at performance on 6 clusters. It is observed that the performance dropped by 5% in terms of accuracy.

Screenshot of the decision tree performance on 6 clusters

Next we will examine performance with only 4 clusters.

Screenshot of the decision tree performance on 4 clusters

The result is at 100% accuracy! It makes sense because as we reduce the outcome clusters, it is easier for the decision tree model to predict. For example, if there is only one outcome cluster, we will always have 100% accuracy too.

The full code in R is as below.

library(ggplot2)
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(caret)
setwd("~/Desktop/imdb")
mydata <- read.csv("movie_label.csv")
set.seed(123)
#Generate movie clusters and plotmydataCluster <- kmeans(mydata[, 9:10], 5, nstart = 20)
mydata$cluster <- as.factor(mydataCluster$cluster)
ggplot(mydata, aes(gross, imdb_score, color = mydata$cluster)) + geom_point()+scale_colour_manual(values=c("green", "blue","orange","purple","red")) + xlab("Gross") + ylab("IMDB Score")
#Create dummy variables for categorial variablesfor(level in unique(mydata$language)){
mydata[paste("dummy", level, sep = "_")] <- ifelse(mydata$language == level, 1, 0)
}
for(level in unique(mydata$country)){
mydata[paste("dummy", level, sep = "_")] <- ifelse(mydata$country == level, 1, 0)
}
for(level in unique(mydata$content_rating)){
mydata[paste("dummy", level, sep = "_")] <- ifelse(mydata$content_rating == level, 1, 0)
}
for(level in unique(mydata$title_year)){
mydata[paste("dummy", level, sep = "_")] <- ifelse(mydata$title_year == level, 1, 0)
}
for(level in unique(mydata$aspect_ratio)){
mydata[paste("dummy", level, sep = "_")] <- ifelse(mydata$aspect_ratio == level, 1, 0)
}
#Remove unwanted variables for decision treemydata$id <- NULL
mydata$gross <- NULL
mydata$imdb_score <- NULL
mydata$genres <- NULL
mydata$language <- NULL
mydata$country <- NULL
mydata$content_rating <- NULL
mydata$title_year <- NULL
mydata$aspect_ratio <- NULL
#Create training and testing datasetstraining_size <- floor(0.80 * nrow(mydata))
train_ind <- sample(seq_len(nrow(mydata)), size = training_size)
training <- mydata[train_ind, ]
testing <- mydata[-train_ind, ]
#Construct decision treefit <- rpart(cluster ~.,method="class",data=training)
fancyRpartPlot(fit)
#Evaluate decision tree performancePrediction <- predict(fit, testing, type = "class")
confusionMatrix(Prediction, testing$cluster)

Questions, comments, or concerns?
jchen6912@gmail.com

--

--

James Chen

Engineer by training. Analytics by passion. R and Python addict who hacks and decodes data for marketers.