A-Z of Decision Trees

Part 2: Information Gain, Gini Index, Overfitting and Underfitting

Analytics Vidhya
Published in
5 min readSep 21, 2019

--

Now since we have learned a bit about Decision Trees, let’s delve deeper into other similar concepts.

Information Gain : the name is misleading

Information gain is actually the expected reduction in entropy caused due to the division of the dataset into the decision tree. However, we call it gain as we measure it relative to a collection of samples (S).

IG of an attribute A for a sample S is defined as below.

Fig 1 : Information Gain G(S,A)

Values(A) is the set of positive values for attribute A. ‘Sv’ is the subset of S for which attribute A has value v. Let’s understand.

Fig 2 : Calculating IG

Let us divide our sample (S) dataset into 3 parts based on the ‘Outlook’ attribute. If we calculate the individual entropies of these datasets, we have E = {0.97, 0, 0.97}.

The second component in Fig 1 is called the ‘Weighted entropy’ and we will calculate it as follows.

(5/14 * 0.97) + (4/14 * 0) + (5/14 * 0.97) = 0.6928, where there are 5 points in the 1st and 3rd dataset and 4 points in the second dataset.

Now, we can easily find the Information gain as (0.94–0.69 = 0.25). Refer to Fig 6 of Part 1 of this article to check how we got 0.94.

Significance of Information Gain

Now, if we repeat this exercise with different attributes instead of outlook, we would get different values of IG. Below’s a sample elaborating the same.

Fig 2 : IG with other attributes

Clearly, there is significant IG if we use ‘Outlook’ as our root node and grow the tree from there.

Thus IG tells us how important a feature vector is and which attribute in a set of feature vectors is most useful for discriminating between the classes.

Gini Impurity

Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset. It’s calculated as

Fig 3 : Gini Impurity

where CC is the number of classes and p(i) is the probability of randomly picking an element of class i.

When training a decision tree, the best split is chosen by maximizing the Gini Gain, which is calculated by subtracting the weighted impurities of the branches from the original impurity.

The Gini Impurity of a perfectly split dataset is ‘0.5'.

Constructing a Decision Tree

Fig 4 : DT for playing tennis dataset

As explained in previous sections, Information Gain, plays a very important part in decision trees.

✒︎ Since figure 2 suggests that ‘Outlook’ attribute has the highest IG, we will split the dataset with this as the ‘root node’.

︎✒︎ In the next step, we recalculate IG for all the remaining features, excluding the one used as a root node.

✒︎ We will split the data like this until we reach our explicitly set threshold of points (max_depth) or we reach the leaf nodes only.

Overfitting and max_depth

To avoid overfitting of our data using Decision trees, sometimes, it is imperative to choose a maximum level beyond which we will stop splitting our data. This is called max_depth.

When a DT reaches max_depth, we will do a majority votes of the class label and reach to a conclusion.

If depth of a model is very high, the interpretability of the model takes a severe hit, as we will have a lot of if-then-else conditions to traverse before reaching the leaf nodes.

Underfitting and Decision Stump

It might also happen, that we set the max_depth very low and at a very low level, we have to take a majority vote on the lead nodes.

Such leaf nodes won’t be ‘pure nodes’ (having both positive and negative class labels), and hence we might have to take a majority vote to predict the class label. This might lead to underfitting in the model.

A decision tree with max_depth = 1 is called as a Decision Stump.

The default value for max_depth is usually 30, but it has to be decided using cross validation techniques.

Use cases of Decision Trees

The basic intuition behind Decision Trees is used in practical algorithms like Random Forest, however there are certain things that we should know about data preparation before applying a Decision Tree.

Inbalanced Dataset : It is always advisable to convert an inbalanced dataset to a balanced dataset before applying DT algorithm. This can be done by using upsampling techniques or else it impacts Entropy calculations.

Curse of dimensionality : When there are large number of dimensions, then calculating the IG for each feature in each iteration increases the complexity of the algorithm. For categorical features, it is advisable to avoid one hot encoding, as the feature set will be expanded. Categorical features should instead be converted to numerical features.

Multi class classification : Multi class classification doesn’t change the nature of decisition trees like other classification algorithms, where we have to use ‘one vs many’ technique. In decision trees, we can still arrive at pure nodes for more than 2 class labels. Incase we reach max depth before a pure node, we simply take a majority vote on the class label to determine the class of the random variable.

References :

--

--

Analytics Vidhya

As Mr. Richard Feynman said, “Study hard what interests you the most in the most undisciplined, irreverent and original manner possible.” Average AI evangelist.