Gini Impurity vs Gini Importance vs Mean Decrease Impurity

Aneesha B Soman
6 min readOct 18, 2023

--

Gini Impurity

Imagine you have a big jar of marbles, and you want to know how mixed up the marbles are in terms of their colors. Gini impurity is a way to measure this. Here’s how it works:.

  1. You start by looking at one marble at a time. For each marble, you ask, “What is the chance that I randomly pick this candy, and it’s a different color?” If all marbles are of the same color, the chance is low, so the Gini impurity is low. But if marbles are of various colors and equally mixed, the chance is high, so the Gini impurity is high.
  2. Gini impurity is like a measure of how messy or mixed up the marbles are in the jar. If they are all the same color, it’s not messy (low Gini impurity), and if they are a mix of colors, it’s messy (high Gini impurity).

So, Gini impurity is a measure of how mixed or impure things are. In the context of machine learning, it’s often used to decide how good a decision tree is at splitting data into different categories. The goal is to find the best way to separate data into groups that are as pure as possible (low Gini impurity), which makes it easier to make accurate predictions.

Gini Importance

Now, think about a different situation. You have a group of friends, and you want to know which friend has the most influence in making decisions. Gini importance helps you figure this out. Here’s how it works:

You have a list of things that each friend is good at

Friend A: good as singing

Friend B:Good at playing drums

Friend C: Good in dancing

Friend D:Good in football

Gini importance helps you decide which friend’s skill is the most important in making decisions. Currently the group is deciding to perform a song for an event, which would involve singing and instruments. If one friend’s skills are super important and strongly influence the group’s decisions, their Gini importance is high. If all friends have equal influence, the Gini importance is low.

So here we can see that Friend A and B would have high gini importance.

The main difference between Gini impurity and Gini importance is what they measure:

  • Gini Impurity measures how mixed up or messy things are, like the variety of colors in a marble jar.
  • Gini Importance measures how important or influential something is in making decisions, like which friend’s skills matter the most in your group.

Mean Decrease Impurity

Imagine you have the same big box of colorful marbles as the previous example, and you want to organize them into groups.

You can do this by asking questions about the marbles’ characteristics, like color, size, or whether they have stripes. These questions help you decide how to group the marbles.

Now, let’s say you want to find out which question is the most helpful in making the best groups. For example, if you asked, “Are the marbles red?” and that question helped you make very distinct groups, it’s a good question. If you asked another question like, “Do the marbles have yellow stripes?” and that also helped make good groups, it’s another good question.

In the world of data and statistics, there’s something similar to these questions when we’re trying to decide which features (like color, size, or stripes) are the most important for making decisions. This is where “Mean Decrease Impurity” comes in.

“Mean Decrease Impurity” is a way to measure how much each question (or feature) helps in organizing our data. The word “impurity” means how mixed up or messy things are. If a question (or feature) helps separate the data into cleaner and more organized groups, it gets a higher “Mean Decrease Impurity” score.

Maths behind the Mean Decrease Impurity

Problem

Lets look at another example to understand Mean Decrease Impurity. Imagine you are a teacher, and you want to understand what factors affect your students’ performance in a math test. You have data on their study hours, sleep hours, and whether they had breakfast on the test day. You want to figure out which of these factors is the most important in determining how well your students do on the test.

Here’s how “Mean Decrease Impurity” can help:

  1. Study Hours(feature 1): You start by asking the question, “Do more study hours lead to better test scores?”
  2. Sleep Hours(feature 2): Next, you ask, “Do more sleep hours lead to better test scores?”
  3. Breakfast(feature 3): Lastly, you ask, “Does having breakfast on the test day make a difference?”

Here is a dataset for it:

Here, we have information about 8 students. We want to find out which feature (Study Hours, Sleep Hours, or Breakfast) is the most important for predicting Test Scores.

Solution:

In a decision tree, you’re making a series of decisions based on these different features to reach a final decision or prediction. Each decision is made to reduce the uncertainty or “impurity” in the data. Impurity measures how mixed up or uncertain the data is.

When building a decision tree, you choose which feature (or question) to split the data on at each step. The feature that, when you split the data, reduces the impurity the most is considered the most important.

Now, “Mean Decrease Impurity” works like this:

1.You start with the entire dataset at the top of the tree:

Calculate the Gini Impurity (I) for the initial dataset based on Test Scores.

Gini_initial = 1 — [(3/8)² + (5/8)²] = 1 — [9/64 + 25/64] = 1–34/64 = 30/64 = 0.46875

2.Calculate the reduction in impurity for each feature:

Reduction in Impurity for F1, F2 and F3 is calculated using calculated using impurity measures like Gini impurity or Entropy

For Feature 1 (Study Hours):

  • Split the data into two subsets: One where Study Hours are less than or equal to 2.5, and the other where Study Hours are greater than 2.5.
  • Calculate the Gini Impurity for each subset.
  • Calculate the weighted Gini Impurity after the split.
  • Reduction in Impurity for F1 (Study Hours):

step1: Gini_initial = 0.46875 (we found it already)

step 2:Calculate Weighted_Gini_After_Split

Weighted_Gini_After_Split = (3/8) * Gini_left + (5/8) * Gini_right
Gini_left (Study Hours <= 2.5) = 1 - [(2/3)² + (1/3)²] = 0.4444 
Gini_right (Study Hours > 2.5) = 1 - [(1/5)² + (4/5)²] = 0.32

Weighted_Gini_After_Split = (3/8) * 0.4444 + (5/8) * 0.32 = 0.36055

step 3:Calculate Reduction in Impurity for Study Hours

Reduction in Impurity for Study Hours = Gini_initial - Weighted_Gini_After_Split

Reduction in Impurity for F1 = Gini_initial — Weighted_Gini_After_Split = 0.46875–0.36055 = 0.1082

For Feature 2 (Sleep Hours):

Reduction in Impurity for F2 (Sleep Hours) = 0.12 (calculated using the same steps)

For Feature 3 (Breakfast):

Reduction in Impurity for F3 (Breakfast) = 0.09(calculated using the same steps)

3.Calculate Mean Decrease Impurity

The “Mean Decrease Impurity” is the average of these reductions in impurity over all the features.

Mean Decrease Impurity = (Reduction in Impurity for F1 + Reduction in Impurity for F2 + Reduction in Impurity for F3 ) / Number of Features

After calculation:

Mean Decrease Impurity = (0.1082 + 0.12 + 0.09) / 3 = 0.3182 / 3 = 0.1061

So, the Mean Decrease Impurity for this dataset is approximately 0.1061. This value represents the overall importance of the features in reducing impurity and making decisions in a decision tree.

The feature with the highest Reduction in Impurity contributes the most to the Mean Decrease Impurity and is considered the most important feature for decision-making.Mean Decrease Impurity helped us figure out that both study hours(0.1082) and sleep hours(0.12) are important factors in making the test scores more predictable and organized, while having breakfast is not as important for this purpose.

Do subscribe for more of such content!

Happy coding :)

--

--

Aneesha B Soman

An AI Engineer with a passion for NLP. A Guitarist, Singer, Sketch artist and Tennis player as well.