What is Entropy and How is it used in Decision Trees model

Hanane Dupouy

8 min readJan 26, 2023

What will you learn in this article?

In this article, you will learn:

What is Entropy?
The intuition behind. How could it be represented?
What is the relationship between probability and entropy?
How is the Entropy used in the Decision Trees model ?
How do we compute it using a practical example, step by step, to build a tree?

What is Entropy?

Entropy is a measure, as average, of the uncertainty of the possible outcomes of a random variable.

Entropy is used for classification in decision trees models. This measure helps to decide on the best feature to use to split a node in the tree. This is a very powerful and useful metric. It’s also used in Information Gain measure, which could also be used in decision trees model.

For example, the different outcomes coming from a fair coin flip (two equally outcomes) constitute less uncertainty than the ones coming from a roll of a die (six equally outcomes). The first one has a lower entropy, and the second one higher entropy.

Entropy method used in machine learning comes from Information Theory and was developed by Claude Shannon in 1948.

Visualization and Intuition

We can talk about Entropy as a measure of uncertainty or impurity or surprise or information value in a group of observations. All these concepts are related.

Here is an explanation. Let’s assume we have these 3 groups:

Entropy

In the first square (leftmost one), we have 6 yellow balls and 6 blue balls. So the probabilities to pick one yellow ball or 1 yellow blue are equals (6/12). In this case, we have an equal uncertainty. Both outcomes are equally likely. The entropy is equal in both cases. There is a higher disorder and lower information value.
In the second square, we have 9 blue balls and 3 yellow balls. There is more probability to pick one blue ball (9/12) than a yellow ball (3/12). In this case I’m more certain to pick a blue ball than a yellow one. For the blue balls, the certainty is higher, thus the entropy is lower. Inversely, for yellow balls, the certainty is lower, thus the entropy is higher.
In the last square, there are only blue balls. I’m certain 100% to get a blue ball. There is no uncertainty, thus the entropy is 0.

What we need to understand from this figure is the relationship between the probability of an outcome, its uncertainty and its entropy:

The greater the likelihood of an outcome occurring, i.e. the more certain we are of getting that outcome, the lower is the entropy. So the entropy is inversely related to the probability.
In other terms, the entropy is related to the uncertainty. The more uncertain we are to get an outcome, the more important is the entropy.

Mathematical formula

Intuition

As the entropy and uncertainty are inversely related to the probability, we want to write it as : Uncertainty=1/p.

Let’s go back to the last square in our figure:

The probability of picking one blue ball is 1.
We want the entropy(uncertainty) to be 0 then.

But if we only express the uncertainty as a function of 1/p, we will get Uncertainty= 1 instead of 0. To solve this issue, a log function is introduced :

In the second square of the illustration, we have:

For blue balls, the probability to pick one ball is 9/12, and for yellow ones is 3/12.

So what is the expected uncertainty for this whole square to pick one blue or one yellow ball?

The global expected Uncertainty of the whole square is the weighted average uncertainty for each color:

In other words:

In other terms, the expected uncertainty is what we call the Entropy. It’s also the sum of 2 entropies : The blue one and the yellow one.

On one hand the uncertainty of getting one yellow ball is high, so is the entropy 0.5. On the other hand, the uncertainty to get one blue ball is low, so is the entropy 0.311.

Final Formula

The final formula of the entropy is:

How is the Entropy used in the Decision Trees model?

In the introduction, I explain that the entropy is used as a metric in the decision trees classification model to decide which feature and value to choose to split each node in the tree.

For each node, we study every feature in the dataset by computing its entropy. The lowest is the entropy, the better is the feature.

Here are the steps to follow to compute entropy and compare among the features:

1- For each node:

Take a feature
Use it to split the node, on the left and right side, by creating two subsets
Compute the entropy of each subset/side
Compute the average value of the entropy for this subtree (left + right)
Repeat again the above steps by selecting another feature from the dataset
Compare all entropies computed for the different features
Take the lowest entropy, and thus select the corresponding feature to split the current node

2- Go to the next node, and repeat the logic again, until you build the tree.

3- Stop the splitting once the stop criteria of the tree is met.

Practical example

Here is an example of a dataset for which we need to build a tree, to understand how we use entropy.

They are fake data, built for illustration

Let’s imagine that this dataset is gathered by a bank that wants to score clients.

A client having a good credit note, will have access to a loan with the bank. And the client, having a bad note, will see his request refused by the bank.

The goal is to predict the credit note of a client, based on his/her information: savings, assets, and salary.

We want to build a tree for this dataset. The first question is what would be the best feature (Savings, Assets, Salary) to begin with in the root node?

General entropy

To do that, let’s first compute the general entropy of the dataset, before any split.

We have 4 “Good” notes and 3 “Bad” notes.
The probability to have a “Good” note is:

It’s entropy is:

The probability to have a “Bad” note is:

It’s entropy is

On one hand, there is more uncertainty about getting a “Bad” note, so its entropy is high. On the other hand there is less uncertainty about getting a “Good” note, so its entropy is low.

The general Entropy of this dataset is:

The entropy is high.

Root Node and entropy

Now, you need, for each feature in the dataset, to build a sub tree and compute the entropy. In the following, we will do it for “Savings” to illustrate the calculation. I let you practice with the other features.

Feature 1 — Categorical: Savings

For Savings for example:

If “Savings” is High, we are on the left side of the subtree. There are 3 observations and they are all having Good as a credit note. So the entropy on this side of the tree is 0.
If “Savings” is Low, we are on the right side of the subtree. There are 4 observations. 3 of them are having Bad as a credit note, and 1 observation is having Good. The entropy in this side is computed and is equal to 0.8111.
The weighted entropy of this subtree is computed as the weighted entropy of the 2 branches. It’s equal to 0.464.
By using “Savings as a candidate for the root node, we get 0.464 as entropy.

Now, you understand how the computation of the entropy is handled for each subtree. You can continue on your side as an exercise.

Results

Here is a summary of the different subtrees for the root node:

Savings has the lowest Entropy. It will be the best candidate for the split on the root node.

For the internal nodes, the same method will be applied, by taking only the observations included in this sub-tree.

Log loss

Important: in Scikit-learn log-loss is also a type of entropy. Both methods are triggering entropy computation.

Summary

The most important thing to keep in mind is that the lowest is the entropy of a subtree, the better is the feature (and the threshold value) used to split the node of this tree.

Now that you master the computation of the entropy, you can have a look at the Information Gain metric.

In this article, you have learned about Entropy, used in decision trees classification:

Its definition
Its mathematical formula and the intuition behind
How to use it to build a tree: Its computation, step by step, feature by feature using an example of a dataset

I hope you enjoyed reading this article.

Was the explanation clear and useful? Tell me in the comments, I appreciate your feedback 👇.

Visit my blog: http://www.machinelearning-basics.com/