Impurity and Information Gain part II

Adarsh Kumar Pandey
7 min readJan 27, 2022

--

Part 1 talked about what is split and things one needs to consider while going ahead with the splitting of data to create a Tree. In this blog, we will talk more about the splitting procedure.

Suppose we have two classes C0 and C1 in binary setting and each has 10 records. For the shake of conversation let’s say “own Car”, “Car type” and “student id” are some feature columns. Now we want to create a Tree model for this dataset. Few questions come to our minds before we start building the Tree.

  1. Where to start? or Which feature to consider first for splitting?
  2. How to know whether the selected feature is good for splitting?
  3. Whether to consider binary split or go for the multi-way split?
Which one is best?

To answer these questions we need some mathematical quantity to tell us whether we are going in the correct direction or not?

Now the time has come to Introduce the word “Information gain” and “GINI gain” in their full spirit. But first, let’s see what are the way of measuring the node impurity. There are three ways we can measure the impurity :

  1. GINI Index
  2. Entropy
  3. Misclassification Error

Let’s go one by one and define some formulas.

GINI Index:

It is the total variance across the classes, the formula for that is given as :

GINI Index

p( j | t) is the relative frequency (probability) of class j at node t.

The maximum GINI index value can be computed using the formula (1–1/number of classes). So if we have 2 classes (binary setting) then

GINI max value = ( 1- 1/2) = 0.5 , this means records are equally distributed among classes.

When records are equally distributed

Calculation of GINI index:

P(C1) = 3/6 = 0.5, P(C2) = 3/6 = 0.5

GINI = 1 — P(C1)²— P(C2)² = 1 — (0.5)² — (0.5)² = 0.500

GINI min value = (0.0), this means records belong to one class.

When records belong to a single class

Calculation :

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 — P(C1)2 — P(C2)2 = 1–0–1 = 0.000

This is the case when records are Homogeneously distributed

Similarly, the GINI index can be calculated for Homogeneously distributed records.

While we are talking about the GINI index, let me introduce yet another term called GINI split and the formula for the same is:

GINI Split

When node N is split into k partitions (children), the quality of that split is computed using this formula. Let’s not bogged down in terminologies here, in simple terms, it is the weighted sum of the GINI index for children.

Here :

n suffix i = number of records at child i.

n = number of records at the node N.

The below flow diagram explains the steps to calculate GINI gain GINI index and GINI split.

Let’s walk through the flow diagram,

In step 1, we are calculating the GINI index of the parent node before the split, just punch in the numbers in the formula.

Step 2, after split count the number of classes falls into child nodes. then calculate the GINI index for each node.

Step 3, Now calculate the GINI split (or weighted average)

Finally, in step 4, we calculate the GINI gain, the higher value better the split is. We have to repeat the same steps for all features in the dataset and choose the feature which gives us the highest GINI gain for split or construction of the tree.

Let’s construct a tree for the below dataset:

Step 1: Calculate the GINI index for the entire dataset:

Total number of records = 14

The target variable i.e Play Golf distribution is :

Yes = 9

No = 5

GINI index of the dataset

Step 2: Calculate the GINI index and GINI gain for each feature:

Gini(Sunny) = 1 — (3/5)² — (2/5)² = 0.48

Gini(Overcast) = 1 — (4/4)² — (0/4)² = 0

Gini(Rainy) = 1 — (2/5)² — (3/5)² = 0.48

Calculate GINI split :

5/14*0.48 + 4/14*0 + 5/14*0.48 = 0.34

Now let’s calculate the Gini gain for the Outlook feature

Gini gain = Gini of parent — weighted Gini of children = (0.46–0.34) = 0.12

Similarly, we can calculate the GINI gain for the rest of the features.

Gini for Temperature

Calculate GINI split for Temperature:

GINI Split = (4/14)*0.5 + (6/14)*0.44 + (4/14)*0.375 = 0.14 + 0.18 + 0.10 = 0.42

Gini gain = 0.46–0.42 = 0.04

Gini for Humidity

GINI split = (7/14)*0.489 + (7/14)*0.245 = 0.24 + 0.12 = 0.36

Gini gain = 0.46–0.36 = 0.10

Gini for Windy

GINI split = (8/14)*0.375 + (6/14)*0.5 = 0.21 + 0.21 = 0.42

GINI gain = 0.46–0.42 = 0.04

As we can see that feature Outlook has the highest GINI gain i.e 0.12, so let’s start building the tree from Outlook.

Outlook being as root

Since Overcast has GINI index 0 it becomes the leaf node and it does not require further splitting but Sunny and Rainy require further splitting.

Now our dataset looks like this :

Dataset after splitting on Outlook

As we can see our parent node has changed, now Rainy and Sunny are new parent nodes. Let’s calculate the GINI index of parents in both branches.

GINI index of Sunny

Consider the rows which have Outlook = Sunny,

Gini Index for Humidity given Outlook = Sunny is

Gini(High) = 1- (1/2)²-(1/2)² = 0.5

Gini(Normal) = 1-(2/3)²-(1/3)² = 0.44

GINI split = (2/5)*0.5 + (3/5)*0.44 = 0.46

GINI gain = 0.48- 0.46 = 0.02

Gini Index for Windy given Outlook = Sunny is

Gini(True) = 1- (2/2)²-(0/2)² = 0.0

Gini(False) = 1-(3/3)²-(0/3)² = 0.0

GINI split = (2/5)*0.0 + (3/5)*0.0. = 0.0

GINI gain = 0.48- 0.0 = 0.48

Gini Index for Temperature given Outlook = Sunny is

Gini(Mild) = 1- (2/3)²-(1/3)² = 0.44

Gini(Cool) = 1-(1/2)²-(1/2)² = 0.5

GINI split = (3/5)*0.44 + (2/5)*0.5 = 0.46

GINI gain = 0.48- 0.46 = 0.02

For the branch Sunny, Windy has the highest GINI gain so use it as the split condition.

Similarly, we have to calculate the GINI index for Rainy:

GINI index of Rainy

Consider the rows which have Outlook = Rainy,

Gini Index for Humidity given Outlook = Rainy is

Gini(High) = 1- (3/3)²-(0/3)² = 0.0

Gini(Normal) = 1-(2/2)²-(0/2)² = 0.0

GINI split = (3/5)*0.0 + (2/5)*0.0 = 0.0

GINI gain = 0.48- 0.0 = 0.48

Gini Index for Windy given Outlook = Rainy is

Gini(True) = 1- (1/2)²-(1/2)² = 0.5

Gini(False) = 1-(1/3)²-(2/3)² = 0.44

GINI split = (2/5)*0.5 + (3/5)*0.44 = 0.46

GINI gain = 0.48- 0.46 = 0.02

Gini Index for Temperature given Outlook = Rainy is

Gini(Mild) = 1- (1/2)²-(1/2)² = 0.5

Gini(Cool) = 1-(1/1)²-(0/1)² = 0.0

Gini(Hot) = 1- (0/2)²- (2/2)² = 0.0

GINI split = (2/5)*0.5 + (1/5)*0.0 + (2/5)*0.0= 0.20

GINI gain = 0.48- 0.20 = 0.28

For the branch Rainy, Humidity has the highest GINI gain so use it as the split condition.

Once we split the Sunny and Rainy branches, the Tree looks like this:

Final Tree

Why is it a final Tree?

Well, Windy and Humidity both have the GINI index of 0 so no need to split further.

Here we conclude the GINI Index and in the next post, we will discuss the alternative of the GINI Index called Entropy.

--

--