Entropy Explained

3 min readFeb 2, 2022

Entropy Explained

What is Entropy and why use it?

Entropy, originally a concept from thermodynamics, would tell physicists how much disorder there is in a system. Machine learning practitioners must have then adopted the idea because of its ability to classify data on an unsupervised front (of course it can be used in supervised situations like decision trees). So, what does entropy do? Well it can quantify the amount of uncertainty in an entire probability distribution. Some say it is a measure of chaos, but I don’t like to delineate it in this way.

So then what would you say?

That entropy is just a metric for information gained. The more information gained then the more we can rule out certain scenarios or the more we can tell how something is going to happen.

Implications:

Low entropy-> less chaos-> more information obtained

High entropy -> more chaos -> less information gain

The lower entropy, the less chaos or the purer the set is. Meaning the next state might be easier to predict. The higher entropy, the more chaos or the heterogenous the set it. Meaning the next state will be harder to predict.

Common Application: One application of entropy is in decision trees. In a decision tree you decide on which feature you would like to split your data-set on first. To figure out which feature is most suitable you can use entropy. You would loop over your features and depending on which feature has the lowest entropy, you would use this as the first split in the decision tree. This is because the lowest entropy would tell you the feature that describes the data the best. Meaning it classifies most of the data right off the bat into a specific class.

Day to day example of entropy:

Imagine you have the features height, skin tone, and day of the week. And the goal is to predict an individual’s weight. So now you loop over the features and realize the lowest entropy is a person’s height. This is because height is the best predictor of a person’s weight, hence the most information gained.

Equation:

C- is the number of clusters you would like to go up to for an unsupervised case. But C can also be the number of classes or the number of dependent variables that something could be classified as in a supervised situation.

The intuitive reader would realize that entropy could be used to figure out the proper number of clusters for a dataset.

P- is the probability of the class in the total dataset.

An example, 3/10 people are male, and 7/10 are female. And we are trying to find the entropy, or the information gained. We obviously have 2 classes (male and female), and probabilities of 3/10 and 7/10. So, the equation would look like, -3/10log_2(3/10) — 7/10log_2(7/10). This would equal 0.88, which is considered a high entropy due to the fact entropy is bounded by 0 and 1. What does this tell us? It tells us we don’t have much information gain.

Ill brush this up with some code in the future and a better way of visualizing the individual features in the future. Hope you enjoyed!

Matt DiCicco
2107741848
matt.diciccomhs@aol

Written by Matthew Dicicco