Is it a disorder, uncertainty or surprise?
The idea of entropy is confusing at first because so many words are used to describe it: disorder, uncertainty, surprise, unpredictability, amount of information and so on. If you’ve got confused with the word “entropy”, you are in the right place. I am going to demystify it for you.
Who Invented Entropy and Why?
In 1948, Claude Shannon introduced the concept of information entropy in his paper “A Mathematical Theory of Communication”.
Shannon was looking for a way to efficiently send messages without losing any information.
Shannon measured the efficiency in terms of average message length. Therefore, he was thinking about how to encode the original message into the smallest possible data structure. At the same time, he required that there should be no information loss. As such, the decoder at the destination must be able to restore the original message losslessly.
Shannon defined the entropy as the smallest possible average size of lossless encoding of the messages sent from the source to the destination. He showed how to calculate the entropy which is a useful thing to know as to make efficient use of the communication channel.
The above definition of the entropy might not be obvious to you at this moment. We shall see a lot of examples to learn what it means to efficiently and losslessly send messages.
How to Make Efficient and Lossless Encoding?
Suppose you want to send a message from Tokyo to New York regarding Tokyo’s weather today.
Is this efficient?
In this scenario, the sender in Tokyo and the receiver in New York both know that they are only communicating today’s Tokyo’s weather. So, we don’t need to send the words like “Today”, “Tokyo’s” and “weather”. We need to tell “Fine” or “Not fine”, and that’s it.
It is much shorter. Can we do better?
You bet. Because we only need to tell “yes” or “no” which is a binary. We need precisely 1 bit to encode the above messages. 0 for “Fine” and 1 for “Not fine”.
Is this lossless?
Yes, we can encode and decode the message without losing any information since we only have two possible message types: “Fine” and “Not fine”.
But if we also want to talk about “Cloudy” or “Rainy”, the 1-bit encoding won’t be able to cover all cases.
How about using 2 bits instead?
Now we can express all cases. It requires only 2 bits to send the messages.
But there is something wrong with this approach. Semantically, “Cloudy” and “Rainy” are both “Not fine” as well. It seems “Not fine” is redundant. Let’s remove it.
In the above encoding, “Fine” is “00” but we don’t need the second zero. When the first bit is zero, we already know it is “Fine” because both “Cloudy” and “Rainy” start with “1”.
In the above encoding, the first bit indicates “Fine or not”. The second bit indicates “Cloudy or Rainy”.
But there is a problem. It can snow in Tokyo, and we have no way to communicate such weather. What do we do?
We could add a third bit and use it for “Rainy” and “Snow” cases.
We changed “Rainy” from 11 to 110 and added “Snow” as 111 because 110 is distinctly different from 111, but 11 is ambiguous when multiple encodings are sent one after another.
For example, if we use 11 for “Rainy” and 111 for “Snow”, a 5-bit value “11111” could mean either “Rainy, Snow” or “Snow, Rainy”. Because of this, we use 110 for “Rainy” so that we can use “110111” for “Rainy, Snow” and “111110” for “Snow, Rainy”. There is no ambiguity.
So, the encoding now uses the first bit for “Fine or not”, the second bit for “Cloudy or others” and the third bit for “Rain or snow”.
For example, “010110111” means “Fine, Cloudy, Rainy, Snow”. “110010111” means “Rainy, Fine, Cloudy, Snow”. Once again, there is no ambiguity. It is lossless, and it seems very efficient.
There are more weather conditions in Tokyo than just those four cases but let’s keep the problem simple for now.
So, are we done?
Not quite — we want to know whether we’ve achieved the smallest possible average encoding size using the above lossless encoding or not. How do we calculate the average encoding size anyway?
How to Calculate Average Encoding Size
Let’s suppose Tokyo is sending the weather messages to New York many times (say, every hour) using the lossless encoding we’ve discussed so far.
We can accumulate weather report data so that we know the probability distribution of message types sent from Tokyo to New York.
Then, we can calculate the average number of bits used to send messages from Tokyo to New York.
(0.6 x 1 bit)+(0.38 x 2 bits)+(0.01 x 3 bits)+(0.01 x 3 bits)=1.42 bits
So, with the above encoding, we use 1.42 bits on average per message. As you can see, the average number of bits depends on the probability distribution and the message encoding (how many bits we use for each message type).
For example, if we have much more rain and snow in Tokyo, the average encoding size will be very different.
(0.1 x 1 bit)+(0.1 x 2 bits)+(0.4 x 3 bits)+(0.4 x 3 bits)=2.7 bits
We would use 2.7 bits to encode messages on average. We could reduce the average encoding size by changing the encoding to use fewer bits for “Rainy” and “Snow”.
The above encoding requires 1.8 bits on average which is a much smaller size than what the previous encoding yields.
Although this encoding yields smaller average encoding size, we lose the nice semantic meaning we had in the previous encoding scheme (i.e., the first bit for “Fine or not”, the second bit for, etc.). It’s ok since we are only concerned with the average encoding size, and we worry neither the semantics of encoding nor ease of decoding.
We now know how to calculate an average encoding size. But how do we know if our encoding is the best one that achieves the smallest average encoding size?
Finding the Smallest Average Encoding Size
Let’s suppose we have the following probability distribution. How do we calculate the minimum lossless encoding size for each message type?
What is the minimum required encoding size for “Rainy”? How many bits should we use?
Let’s suppose we use 1 bit to encode “Rainy” so that 0 means “Rainy”. In this case, we would have to use at least 2 bits for other message types to make the whole encoding unambiguous. It is not optimal as message types like “Fine” and “Cloudy” happens more frequently. We should reserve 1-bit encoding for them to make the average encoding size smaller. Hence, we should use more than 1 bit for “Rainy” as it occurs much less frequently.
As an example shown below, the encoding is unambiguous but the average size is not a theoretical minimum because we are using more bits for more frequent message types like “Fine” and “Cloudy”.
How about using 2 bits for “Rainy”? Let’s swap the encoding between “Rainy” and “Fine”.
This encoding has the same problem as the previous one because we are using more bits for more frequent messages type (“Cloudy”).
In fact, if you use 3 bits for “Rainy”, we can achieve the minimum encoding size on average.
If we use 4 bits for “Rainy”, our encoding becomes redundant. So, 3 bits is the right encoding size for “Rainy”. It’s all good, but this process involves too many trials and errors to find the right encoding size.
Given a probability distribution, is there any easy way to calculate the smallest possible average size of lossless encoding of the messages sent from the source to the destination. In other words, how do we calculate entropy?
How to Calculate Entropy
Suppose we have 8 message types, each of which happens with equal probability (⅛ = 12.5%). What is the minimum number of bits we need to encode them without any ambiguity?
How many bits do we need to encode 8 different values?
1 bit can encode 0 or 1 (2 values). Adding another bit doubles the capacity since 2 bits = 2x2 = 4 values (00, 01, 10, and 11). So, 8 different values require 3 bits = 2x2x2 = 8 (000, 001, 010, 011, 100, 101, 110 and 111).
We can not reduce any bit from any of the message types. If we do, it will cause ambiguity. Also, we don’t need more than 3 bits to encode 8 different values.
In general, when we need N different values expressed in bits, we need
bits and we don’t need more than this.
In other words, if a message type happens 1 out of N times, the above formula gives the minimum size required. As P=1/N is the probability of the message, the same thing can be expressed as:
Let’s combine this knowledge with how we calculate the minimum average encoding size (in bits) which is entropy:
P(i) is the probability of i-th message type. You may need to sit back and reflect on the simplicity and the beauty of this formula. There is no magic. We are merely calculating the average encoding size using the minimum encoding size for each message type.
Let’s go through an example using the weather reporting scenario to solidify our understanding.
So, the entropy is:
(0.5 x 1 bit)+(0.25 x 2 bits)+(0.125 x 3 bits)+(0.125 x 3 bits)=1.75 bits
What About Those Analogies?
There are a lot of analogies used for entropy: disorder, uncertainty, surprise, unpredictability, amount of information and so on.
If entropy is high (encoding size is big on average), it means we have many message types with small probabilities. Hence, every time a new message arrives, you’d expect a different type than previous messages. You may see it as a disorder or uncertainty or unpredictability.
When a message types with much smaller probability than other message types happens, it appears as a surprise because on average you’d expect other more frequently sent message types.
Also, a rare message type has more information than more frequent message types because it eliminates a lot of other probabilities and tells us more specific information.
In the weather scenario, by sending “Rainy” which happens 12.5% of the times, we are reducing the uncertainty by 87.5% of the probability distribution (“Fine, Cloudy, Snow”) provided we had no information before. If we were sending “Fine” (50%) instead, we would be reducing the uncertainty by 50% only.
If the entropy is high, the average encoding size is significant which means each message tends to have more (specific) information. Again, this is why high entropy is associated with disorder, uncertainty, surprise, unpredictability, amount of information.
Low entropy means that most of the times we are receiving the more predictable information which means less disorder, less uncertainty, less surprise, more predictability and less (specific) information.
I hope these analogies are no longer confusing for you.
What is it? Is there any relation to the entropy concept? Why is it used for classification loss? What about the binary…
Demystifying KL divergence
Demystifying KL Divergence
What does KL stand for? Is it a distance measure? What does it mean to measure the similarity of two probability…
Shannon, C. E. 1997. “The Mathematical Theory of Communication. 1963.” M.D. Computing: Computers in Medical Practice 14 (4): 306–17.
Wikipedia, Entropy (Information theory)
A Short Introduction to Entropy, Cross-Entropy and KL-Divergence