Entropy In The World Of Computer Science

Let’s explore Information Entropy!

Shubham Panchal
The Startup
5 min readFeb 2, 2020

--

Entropy has always been a confusing yet interesting term, be it in Physics, Chemistry or Math. In Physics, we related it with the fact why a cup of coffee cools down. In Chemistry, we related it with Gibb’s Free Energy and spontaneity. But in Mathematics, it’s even more interesting. Don’t worry, the meaning of entropy in this context is the same,

Entropy is the measure of disorderedness.

Instead, here we have information ( bits ) instead of atoms and molecules!

Enter the Math Zone!

To understand information entropy, let’s consider two machines ( Machine 1 and Machine 2 ) that are producing sequences of letters A, B, C, and D. These are considered as sources of information and to know which source produces more information is our goal.

Machines A and B producing letters.

Each machine produces letters having a fixed probability. Meaning the chance of occurrence of each letter is fixed. Now, machine A produces the letters according to the following probabilities,

Probabilities for Machine A

And for machine B,

Probabilities for Machine B

Based on this data, how can we know whether which machine is producing more information? Is it A or B? Now enters Information Entropy. Shannon rephrased the question.

The more the questions we ask, the more information we get!

The information provided by any of the machines depends on how many questions we need to ask in order to predict the next letter which will be produced. Let’s consider machine A for instance,

Machine 1 has equal chances of producing all four letters. A diagram of Yes/No questions for machine A will look like this,

Questions asked to know which letter is produced by Machine A.

For machine B,

Questions asked to know which letter is produced by Machine B.

In the above diagram, why have we kept the question “Is it A?” at the top.

See the probabilities for machine B. The letter ‘A’ has the highest chance of occurrence. So, we separate the mostly likely outcome first. Letters ‘C’ and ‘D’ are the least probable and hence are at the bottommost level.

Now, what if we wanna know how many questions we have to ask to reach at a particular letter at the end of the tree? For machine A, we had equal chances of getting the four letters. For instance, to reach the letter ‘D’ in machine A, we had to ask two questions namely “A, B?” and “Is it C?”. The number of questions for all the four consequences in machine 1 is the same i.e 2 questions. We can multiply the number of bounces for each letter with its corresponding probability which we saw earlier.

The number of questions asked for machine A ( N subscript A ).

Now, similarly, we can calculate the number of bounces for machine B. Observe the diagram. You will notice that we have to ask a single question for the letter ‘A’. For the letter ‘D’, we have to ask 2 questions and for ‘C’ and ‘B’, we need to ask 3 questions.

The number of questions asked for machine B ( N subscript B ).

Wait, we need to ask 1.75 questions?

Let’s ignore the exception that the number of questions asked has to be integral.

The more the questions we ask, the more the information we get from any of the machines. Hence, we conclude that machine A is producing more information as it enables us to ask more questions producing more information. Here 1.75 and 2 are the information entropies for machines B and A respectively. Instead of counting the number of questions we asked by making a diagram each time, we can generalize this calculation by the use of logarithms.

Note, in both the diagrams of machines A and B, the number of questions depends on how deep we are going into the tree. For instance, in machine B we are going 3 levels deep for the letter ‘C’ and ‘D’. The number of outcomes is the reciprocal of the probability of the letters. Let’s consider the letter ‘C’. For the letter ‘C’, the number of outcomes is 1/0.125 = 8. Now we take the binary logarithm of 8,

The logarithm with base 2 of 8.

Which gives us the number of questions we asked. Yes, we did ask 2 questions to get the letter ‘C’ in machine B. Let’s put all our knowledge together and we get the equation for Information Entropy also known as Shannon’s Entropy ( represented by letter H ),

The expression for Shannon’s Entropy.

Which brings us down to the final equation. The logarithm term gives us the number of questions we asked and finally multiplying them with their respective probabilities yields us the information entropy. The information entropy of machine A is higher meaning it provides more information as compared to machine B.

The End!

I will like to thank Khan Academy for creating such a wonderful video on information entropy. Make sure you see the video too.

Information Entropy has wide use in machine learning. Remember creating decision trees, we calculate the Information Gain which we obtained by subtracting entropies. Thanks for reading!

--

--

Shubham Panchal
The Startup

Android developer, ML and Math enthusiast. Exploring low-level programming and backend development