The Indecisive Decision Tree — Story of an emotional algorithm (1/2)

5 min readJun 6, 2017

He survives… He survives not… Oh wait, he survives… Does he?

It was 1968. A PhD candidate at the University of Sydney was pondering over a profound idea proposed by a 14th century philosopher — William of Occam — called Occam’s Razor, which goes something like this: “Entities should not be multiplied unnecessarily.” His name was J. Ross Quinlan.

Quinlan published a book in 1975 called Machine Learning vol. 1 no. 1, in which he presented an algorithm with its foundation based in Occam’s Razor. It was called Iterative Dichotomiser 3 (ID3). This algorithm would go on to become one of the most famous algorithms, which now a days, goes by the name — Decision Tree.

This is the part one of a two part series. In part 1, we will be exploring briefly how decision trees work and where they don’t perform very well. In part 2, we will look at how random forest overcomes these issues.

How Decision Tree Works:

Decision tree’s working is very intuitive. It is a supervised leaning algorithm and has a layered splitting process. At each layer, we try to split the population or sample into two or more groups so that all the observations in the same group are most similar to each other (homogeneity) and the groups are significantly different from each other (heterogeneity). Decision tree achieves this using algorithms like Gini Index, Chi-Square, Information Gain etc.

Decision trees became quite popular as they are great at both classification and regression problems and can be trained very fast as well. However, there are some critical shortcomings of decision trees that any data scientist should be aware about.

One, they very easily overfit, and
Two, they can be very unstable.

I will explain both of the problems in detail with relevant examples. We will be using following two data sets — one for classification and one for regression — to evaluate the performance of decision trees:

[Classification] Titanic Survivors: Data containing info about the passengers on the Titanic ship

Objective: Predict if a passenger survived
Shape: 891 rows x 9 features

[Regression] House Prices: Data containing features of houses in Ames, Iowa and their final sale prices

Objective: Predict the final price of each home
Shape: 1460 rows x 79 features

I have performed data cleaning and manipulation like missing value treatment and minimal encoding of categorical variables. In case you want to reproduce the results, here are the links to the scripts for cleaning Titanic and House Prices datasets.

Though Scikitlearn is a more popular python library for machine learning, I used H2O (with its python3.6 client) because of the following reasons:

Scikitlearn takes only numerical values. Hence, categorical values need encoding, which comes with its own set of problems. (I am working on a story on encoding of categorical variables as well.)
As per this bench-marking exercise, H2O has been found to outperform sklearn when it comes to decision trees and random forests.

So let’s get started.

Decision Trees Overfit :

Decision trees are known to overfit. Overfitting occurs when there is a significant difference between the performance for training data set and validation data set. In case of decisions trees, overfitting increases as depth of the tree increases.

Why Overfitting Happens:

A data set contains mix of signal and noise. An algorithm is expected to learn the signal. However, when an algorithm’s complexity increases, it not learns signal but starts learning noise as well. Therefore, it produces exceedingly good results for the data set it trained on but fails to generalise, performing poorly on other previously unseen (validation) datasets. Complexity of decision tree increases with increase in tree depth, and in turn, overfitting also increases.

I trained both the data sets with decision trees with increasing depth and charted out the performances for both training and validation sets.

Titanic:

As apparent from the figure, there is a severe gap in the model performance for training set and validation set in both the cases. As the depth the of tree increases, the model accuracy (the metric of evaluation) for the training data set goes near 95%, whereas the accuracy for the validation set loiters somewhere near 70%.

Here is the code snippet I used to get the results.

House Prices:

Similarly, for the House Prices data set, where the metric is root mean squared error, the value of RMSE approaches to zero in case of the training set as the depth of the tree increases. However, the same for the validation set remains slightly below 2.

Here is the code snippet for House Prices.

Hence, it is clear from both the experiments that the decision trees overfit. Now, let’s move on to the next issue.

Decision Trees Are Unstable:

A stable learning algorithm is one for which the prediction does not change much when the training data is modified slightly. — Wikipedia

Decision trees are inherently unstable as with slight change in input, not only the output changes but it gives rise to a completely different tree structure. To verify this hypothesis, I performed following experiment.

I trained 10 decision trees having exact same hyper-parameters with 10 different seed values, which ensures that each tree trains with a slightly different sample. Let’s look at the top three most important variables, which represent structure of the tree, obtained by the trees for both the data sets.

Titanic:

As can be clearly inferred, the tree is very unstable as the most important factors vary significantly for different seeds. For example, at seed 0 (zero), three most important features are Title, Fare and Age in that order; which become Sex, Pclass and Fare for seed 2304.

Code snippet:

House Pricing:

Similarly, for House Prices data set as well, the tree is quite unstable as the most important variable changes considerably.

Code snippet:

Food For Thought:

The fact that decision trees are unstable is a manifestation of overfitting.

Decision tree is unstable because training a tree with a slightly different sub-sample causes the structure of the tree to change drastically. It overfits by learning from noise data as well and optimises for that particular sample, which causes its variable importance order to change significantly.

Key Takeaway:

Decision trees overfit
Decision trees are unstable
Random Forests are here to save the day (Stay tuned for part 2)

In the next part we will look at how Random Forest tackles these issues to become one of the most widely used supervised learning algorithms in modern times.

Until next time.

The Indecisive Decision Tree — Story of an emotional algorithm (1/2)

How Decision Tree Works:

Decision Trees Overfit :

Why Overfitting Happens:

Titanic:

House Prices:

Decision Trees Are Unstable:

Titanic:

House Pricing:

Food For Thought:

Key Takeaway:

Written by Jay Trivedi