Decision Trees Demystified for Simbu movie Directors

Published in

DataComics

9 min readDec 23, 2019

Oh Don’t worry about the gun in his hand, he won’t shoot.

Imagine you are directing a movie with STR as a lead in it, I know it feels stressful but try. You wake up early in the morning, brush your teeth, take bath, pray to god to make your life less miserable. Then you reach the shooting spot and make all necessary arrangements for the shoot and wait for Simbu. He doesn’t turn up in spite of living just beneath the shooting spot. You call and inform the producer about the incident which has happened 3rd time this week and report the financial loses. He scolds your parents and slams you and the actor so badly. After a couple of minutes into the scolding session, you realise that your movie is produced by the same lead actor who did not turn up for shooting. You feel like shooting yourself, but you don’t want to give the actor another reason to opt-out from this project. So you decide not to.

But you have an urge to help other fellow directors. You want to help your fellow directors who may fall into the trap in the near future. But you have very little things that are in your control. After some 102 days of shooting into this movie out of which 73 got cancelled, you start to notice there is some pattern in Simbu turning up or not turning up for a shoot. You weren’t able to process so much information on your head but you start noting down some features of the shoot and at the end of the day you also noted down whether STR turned up for the shoot or not. By the time the movie got ready for release, you had some 800 plus shooting days worth of information with a clear label of whether the actor turned up for a shoot or not. So this is the precious data set we will be working on and building a decision tree on it. This blog is a sincere attempt to teach and learn how a decision tree is built which forms the basis of a lot of machine learning algorithms. By the end of this blog will have people who know about Simbu will have a fair idea about how decision trees are built and people who know how decision trees are built will have a fair idea about Simbu. If you are a person who knows both, may God bless you.

The Dataset

Here is the snapshot of the data which was collected by the dejected Director

Data Dictionary

A decision tree will be built on these variables to predict the y variable, cameForShooting. Why he came for the shooting is altogether a different problem statement but for now, it is the Y variable, which we will be predicting by building a decision tree.

Establishing a Baseline Accuracy

Here is the snapshot of the number of times each class appear in the data given to us.

There are totally 891 observations. Out of that, only 342 are the observations where Simbu has come for shooting (Class 1). So If we randomly predict that Simbu won’t turn up for the shooting on any given day we will be right around 549*100/(549+342 ) — 61% of the times. So the tree we are going to build at least should beat this accuracy score. Else there is no point in building a tree at all, just like announcing the release date well in advance for a movie that won’t release at all.

Measure of Impurity

Consider there are two different top Stars in a film industry A and B. Both do movies that appeal to the mass audience. A does a watchable film once in every 5 films. Every alternate movie of B is watchable. Ie B does a watchable film once in every two films. Consider both the actors have around 50 films under their belt. Your friend, who is an NRI is visiting you from the US. He has no clue either about the Actors A and B nor the films they do. But he is an ardent fan of actor C which no one cares about. You are taking him for a walk in the market street and you find a vendor selling CDs of all the movies of Actor A and Actor B. You have a small talk with the vendor. He says he started selling CDs here a couple of days ago. And without much surprise, you understand from him that none of the CDs was sold yet. So he has a couple of bags of CDs which was split based on the lead actor in the Movie.

Gini is one of the ways to measure the impurity in the system. Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labelled according to the class distribution in the dataset. So the person who has kept the stall tells you that in bucket A there are 5 watchable movies and rest 45 are not worth watching, you randomly pick each movie and make a prediction whether that movie is watchable or not based on the underlying distribution ( pick 5 watchable and 45 not worth watching randomly), what is the probability of you making wrong predictions, if you repeat the same experiment for a large number of times. Say if you repeat the experiment 100 times your probability distribution will look something like this.

The mean of this distribution comes around 0.18, which is the probability of making a wrong prediction of you pick each movie and make random predictions. Unlike the other 45 movies, you can try this at home. You will get a similar probability score. If you are too lazy to repeat this experiment 100 times Gini impurity has an elegant formula to which gets this probability without doing any random trails. Let’s calculate the Gini impurity of both the bags which the seller has.

Even though actor B has given watchable movies more often than actor A the Gini index of the bucket with all his movie is high, which implies high impurity. The Gini index curve looks like the one below. The maximum impurity is when the index is 0.5.

The total Gini impurity of the split is given by the weighted average of the split buckets. which is

(movies in bucket A / Total movies) * Gini impurity of bag A +
(movies in bucket B/ Total movies) * Gini impurity of bag B

Gini index is just one way to compute impurity in the system, there are other measures like Entropy but the idea is just the same.

Hypothetically if the seller decides to split all the 100 movies by the person who directed them instead of the lead actors in them. For the sake of simplicity let’s considers all these movies of these actors were repeatedly directed by only two directors D1 and D2. D1 has 60 movies under his belt 25 of them are watchable and director D2 has 40 movies under his name of which 5 are watchable. Let’s recalculate the Gini impurity of individual buckets and the total impurity of the split.

The Gini impurity of the split is 0.372 which is higher than 0.34 (Split by actors). So it is better to have the movies split by the actors than directors to reduce the impurity in the system. All characters and other entities which appeared in this example are fictitious. Any resemblance to real persons or other real-life entities is purely coincidental.

Oh wait, we forgot your friend who came from the US! Nevermind, he went back when we were busy discussing.

Back to STR
Let’s split the data we have in hand into train and test sets. Let’s train a decision tree classifier on the training set. Now replace the vendor with an algorithm that tries to split the training data in hand by various features available, and chooses the split which gives minimum impurity, that's the zen behind decision trees. The decision tree starts with the root node and keeps doing this split recursively for the remaining data points available in each node until the point where further split won’t improve the Gini impurity of the existing nodes. Those nodes become the leaf nodes of the decision tree.

With all that said and after the tree is built, let’s take a look at how the decision tree of whether STR will come for shooting or not looks.

Shit! This looks much more complicated to understand than STR himself. What have we done? In the name of simplifying the decision, we have made things even more complicated. The accuracy of this tree is 75% on the test set which is much better than the baseline accuracy 61%, but who cares if we are going to pass on this tree to the next director who is going to direct Simbu, that director will hang himself on the same tree. We need something which makes our decision and intuitive and it should be easy for the director to explain to other actors and junior artists why he thinks Simbu won’t come.

Enter tree pruning. Pruning is a technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting. Let’s do a pre pruning. One of the ways to prune is to specify the maximum depth of the tree (Number of levels excluding the root node). Let’s specify the maximum depth to by 3. Here’s is how the tree looks.

Simpler and much more readable. It also has improved the accuracy to 77% on the test set as it is much more generalisable. Don’t get baffled by the <= 0.5 condition. Since all the variables in our data are just binary flags either 1 or 0, the condition <= 0.5 represents a flag that is turned off. ShootingInADifferentArea <= 0.5 Means, Shooting is not happening in a different area, and the false of this condition means Shooting in happening in a different area. A classic double negative example. let’s now use this tree to make some predictions on our own.

If you are arranging for a shoot on the terrace of the house where Simbu lives (Shooting In a Different Area ≤ 0.5 True) and if it is the third shoot of the week (2 or More shoots in The same Week ≤ 0.5 False ) and but you are shooting for a song (Song Shoot ≤ 0.5 False) You can assume that STR will come. He may be late but there are better chances that he will come. You can continue your arrangements for the shoot.

If you are arranging for a shoot in Maangadu (Shooting In a Different Area ≤ 0.5 False) and if it is a night shoot (Morning shoot ≤ 0.5 is True, 0 flag for morning ie night) and this is the second shoot in the same workweek (2orMoreShootInTheSameWeek ≤ 0.5 is false) you can pack up, wind up go home and smoke up. Or else you can find someone like Mahat, who can be Simbu’s back up.

Because given these features of the shoot my Thalaivan Simbu will never show up.

Decision Trees Demystified for Simbu movie Directors

Written by Raghunandh GS