Decision Trees: Which feature to split on?

Published in

Analytics Vidhya

5 min readJun 19, 2019

Decision Tree classification is perhaps one of the most intuitive and easy-to-interpret classification algorithms we have today. It can easily model non-linear relationships and results in predictive models that are quite accurate and stable.

Learning in Decision Tree Classification has the following key features:

We recursively split our population into two or more sub-populations based on a feature. This can be visualized as a tree, with its root node representing the entire population and its subnodes representing the subpopulations
With each subnode that is added to the tree, the homogeneity or purity increases with respect to the target variable (or class)
The leaf nodes are completely pure and contain objects of a single class

In this article, I will discuss the following metrics used for deciding which feature to split on:

Gini Index
Entropy

Before proceeding, I strongly recommend you to read this complete tutorial on Tree-based modeling from scratch (in R and Python).

We will use the two metrics on the following example:

Note: We are using only categorical variables for a better illustration. However, continuous variables are also possible. In the case of categorical variables, we need to find a threshold value for splitting. For example, age ≤ 18 and age>18 is a possible splitting if age was a continuous variable.

Gini Index

Gini Index is used with binary splits, where one class can be considered a success and other a failure.

A higher value of the Gini Index indicates more homogeneity in the sub-nodes.

The Gini-Index for a split is calculated in two steps:

For each subnode, calculate Gini as p² + q², where p is the probability of success and q of failure
Then for the split, Gini is equal to the weighted average of the Gini scores of the subnodes

Let us calculate the Gini Scores for our example.

Here, our success class is defined as using the computer daily. The corresponding probabilities of success and failure are calculated for both subnodes as shown.

Then Gini for Male subnode = 0.5² + 0.5² = 0.5

Gini for Female subnode = 0.33² + 0.67² = 0.56

Finally, Gini for the split on gender = (20 x 0.5 + 30 x 0.56) / 50 = 0.536

Similarly, we calculate the Gini for splitting on our categorical age variable.

Gini for Adult subnode = 0.48² + 0.42² = 0.4068

Gini for Minor subnode = 0.32² + 0.68² = 0.5648

Thus, Gini for split on age = (25 x 0.4068 + 25 x 0.5648) / 50 = 0.4856

Since, Gini for split on gender is greater than on age, gender is chosen as the feature to split on. The resultant subnodes — Males and Females — divide the population into more homogenous subpopulations.

You might come across the term Gini Impurity. It is defined as:

Gini Impurity = 1 -Gini

Thus equivalently, we need to find the feature that minimizes the Gini Impurity of the split.

We can easily implement Decision Trees with the Gini Index using the sklearn library in Python. This is done by specifying the value of the criterion attribute as ‘gini’ in the DecisionTreeClassifier() function.

Decision Tree for the Iris Dataset with gini value at each node

Entropy

The entropy is a metric frequently used to measure the uncertainty in a distribution. For a binary classification with the probability of success p and of failure q, entropy is calculated as follows:

Entropy(p, q) = -plog2(p)-qlog2(q), where log2(.) is log with base 2.

Since we want our subnodes to be homogenous, we desire to split on the feature with the minimum value for entropy.

Here’s how you calculate the value for the entropy of a split:

For each subnode, calculate entropy using the formula mentioned above.
Then the entropy for the split is the weighted average of entropy for all the subnodes.

Let’s calculate the entropy values for our example.

Split on Gender:

Entropy for the Male subnode = -0.5 x log2(0.5)-0.5 x log2(0.5) = 1

Entropy for the Female subnode = -0.33 x log2(0.33)-0.67 x log2(0.67) = 0.91

Then, entropy for split on gender = (20 x 1 + 30 x 0.91) / 50 = 0.946

Split on Age:

Entropy for the Adult subnode = -0.48 x log2(0.48)-0.42 x log2(0.42) = 1.033

Entropy for the Minor subnode = -0.32 x log2(0.32)-0.68 x log2(0.68) = 0.904

Then, entropy for split on age = (25 x 1.033 + 25 x 0.904) / 50 = 0.968

Again, we find that Gender is a better feature for the split than Age.

Just like Gini, DecisionTreeClassifer() can use the ‘entropy’ criterion to decide the splitting.

Decision Tree for the Iris Dataset with entropy value at each node

End Notes

It is not coincidental that the Gini Index and Entropy suggest the same feature to split on. The two objectives are:

Minimize Gini Impurity = 1- (p² + q²)
Minimize Entropy = -plog2(p)-qlog2(q)

These are more or less the same. However, since entropy requires logarithm to be computed, it will be slower in some cases. Generally, though, they will provide similar performance.