Decision Trees

5 min readJan 29, 2023

Part 4: Gini Index

The other way of splitting a decision tree is via the Gini Index. The Gini Index is also known as Gini impurity. It is a measure of how mixed or impure a dataset is. The Gini impurity ranges between 0 and 1, where 0 represents a pure dataset and 1 represents a completely impure dataset(In a pure dataset, all the samples belong to the same class or category. On the other hand, an impure dataset contains a mixture of different classes or categories. In decision tree algorithms, a pure dataset is considered ideal as it can be easily divided into subsets of the same class. An impure dataset is more difficult to divide, as it contains a mixture of different classes.), It means an attribute with a lower Gini index should be preferred i.e The lower the Gini impurity, the better the feature is for splitting the dataset.

Where p(i) is the probability of a specific class and the summation is done for all classes present in the dataset.

Both of these formulae are equivalent, and the result you get from these formulae would be the same. The first formula is more computationally efficient, therefore it is more commonly used.

For example, let’s consider a toy dataset with two classes “Yes” and “No” and the following class probabilities:

p(Yes) = 0.3 and p(No) = 0.7

Using the first formula for Gini impurity:

Gini impurity = 1 — (0.3)² — (0.7)² = 0.45

Using the second formula for Gini impurity:

Gini impurity = 1 — (0.3 * (1–0.3)) — (0.7 * (1–0.7)) = 0.45

As you can see, both formulae give the same result of 0.45.

In the example, the Gini impurity value of 0.45 represents that there is a 45% chance of misclassifying a sample if we were to randomly assign a label from the dataset to that sample. This means that the dataset is not completely pure, and there is some degree of disorder in it.

Explanation:

For example, if we have a dataset with two classes “Yes” and “No”, and the class probabilities are:

p(Yes) = 0.3 and p(No) = 0.7

If we were to randomly pick a sample from the dataset, there would be a 30% chance that the sample belongs to the “Yes” class and a 70% chance that the sample belongs to the “No” class. If we were to randomly assign one of the class labels to the sample, there would be a 45% chance that we would assign the wrong label to the sample.

Example:

let’s consider a toy dataset with the following features and class labels:

The target class label is “Buys_insurance” and it can take two values “Yes” or “No”.

We want to determine the best feature to use as the root node for the decision tree. To do this, we will calculate the Gini impurity for each feature and select the feature with the lowest Gini impurity.

First, we will calculate the Gini impurity of the target class label “Buys_insurance” using the formula:

Gini impurity = 1 — (p(Yes))²- (p(No))²

Where p(Yes) and p(No) are the class probabilities.

In this dataset, the class probabilities are:

p(Yes) = 4/6 = 0.67 p(No) = 2/6 = 0.33

so,

Gini impurity = 1 — (0.67)² — (0.33)² = 0.48

Now, we will calculate the Gini impurity for each feature using the formula:

Gini impurity = 1 — (p(Feature=Value1))²- (p(Feature=Value2))²- … -(p(Feature=ValueN))²

Gini impurity for feature “Age”

p(Age=20) = 1/6 = 0.17

p(Age=25) = 1/6 = 0.17

p(Age=30) = 1/6 = 0.17

p(Age=35) = 1/6 = 0.17

p(Age=40) = 1/6 = 0.17

p(Age=45) = 1/6 = 0.17

Gini impurity for feature “Age” = 1 — (0.17)² — (0.17)² — (0.17)² — (0.17)² — (0.17)² — (0.17)² = 1

2. Gini impurity for feature “Gender”

p(Gender=Male) = 3/6 = 0.5

p(Gender=Female) = 3/6 = 0.5

Gini impurity for feature “Gender” = 1 — (0.5)² — (0.5)² = 0.5

3. Gini impurity for feature “Income”

p(Income=High) = 3/6 = 0.5

p(Income=Medium) = 1/6 = 0.17

p(Income=Low) = 2/6 = 0.33

Gini impurity for feature “Income” = 1 — (0.5)² — (0.17)² — (0.33)² = 0.44

4. Gini impurity for feature “Credit Score”

p(Credit Score=Excellent) = 3/6 = 0.5

p(Credit Score=Fair) = 2/6 = 0.33

p(Credit Score=Poor) = 1/6 = 0.17

Gini impurity for feature “Credit Score” = 1 — (0.5)² — (0.33)² — (0.17)² = 0.44

From the above calculations, we can see that the feature “Income” and “Credit Score” have the lowest Gini impurity of 0.44. So, we can select either one of them as the root node for the decision tree.

Python Implementation:

Here is an example of how you can use Gini impurity to determine the best feature for splitting in a decision tree, using the scikit-learn library in Python and the Iris dataset as an example:

from sklearn import datasets
from sklearn import tree
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Create the decision tree classifier
clf = tree.DecisionTreeClassifier(criterion='gini')

# initialize a dictionary to hold gini impurities for each feature
gini_impurities = {}

# loop through each feature
for i in range(X.shape[1]):
    # fit the classifier with only the current feature
    clf = clf.fit(X[:, i].reshape(-1, 1), y)
    prob = clf.predict_proba(X[:, i].reshape(-1, 1))
    gini_impurities[i] = 1 - (prob[:, 0]**2 + prob[:, 1]**2 + prob[:, 2]**2).sum()
    
# find the feature with the lowest gini impurity
best_feature = min(gini_impurities, key=gini_impurities.get)
print(f"Best Feature: {best_feature}")

In this code, we loop through each feature and fit the decision tree classifier with only that feature and calculate the gini impurity using the predict_proba method. We store the gini impurity of each feature in a dictionary where the key is the feature index and the value is the gini impurity. Then we use the min function to find the feature with the lowest gini impurity.

Best Feature: 2

Note- we use the predict_proba method to get the probability of each class for each sample in the feature.

it returns an array of shape (n_samples, n_classes) where n_samples is the number of samples in the feature and n_classes is the number of classes in the dataset. Each element in the array represents the probability of the sample belonging to the corresponding class. For example, if we have a feature with 5 samples, and 3 classes in the dataset, the shape of the output would be (5, 3). The output will be a probability array of size 5x3, where each row represent probability of each sample belonging to each class.

For example, the first row of the output, prob[0], represents the probability of the first sample belonging to each class (class 0, class 1, class 2).

[0. 0. 1.]

This means that the first sample has 100% probability of belonging to class 2 (versicolor) and 0% probability of belonging to class 0 or class 1.

This information is used for calculating the gini impurity for the feature by summing the square of the probability of each class in the feature and subtracting the result from 1.

Note that this example is only for demonstration purpose and in practice, scikit-learn’s decision tree implementation already take care of finding the best feature for splitting, and the feature_importances_ attribute already give the feature importance which you can use to determine the best feature for splitting.

Final Note: Thanks for reading! I hope you find this article informative.

Don’t forget to read: Decision Trees — Part 5

Decision Trees

Written by om pramod