Shannon Information Content, Entropy with Examples

Nishant Nikhil
5 min readMay 22, 2017

--

This is in contituation to my previous post: Introduction to Information Theory -Hamming(7,4) Code.
This discusses lecture 2 from Information Theory, Pattern Recognition, and Neural Networks.

Shannon Information Content h(x=a_i)
This measures the amount of information you gain when an event occurs which had some probability value assosciated with it.
Suppose the probability of happening of an event was 1, and it happens. Surely you get no information from the event.

Given probability P(x=a_i):
h(x=a_i)
= -log2(P(x=a_i)) bits
Note the unit ‘bits’ here.

For an intuition suppose there was an action which had a probability of occuring equal to half and not occuring equal to half. After the outcome you can say that it has occured or not occured in one bit i.e. ‘1’ or ‘0’. As per the Shannon information content h = -ln(1/2)bit = 1 bit, which agrees with our calculation of one bit.

Entropy of an ensemble of events H(X)

Entropy is a measure of unpredictability of the state, or equivalently, of its average information content. To get an intuitive understanding of these terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll is not already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the a priori entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the a priori entropy of the second poll result is small relative to that of the first.

H(X) = -ΣP(x)log2(P(x)) bits

Wait this is going boring, So let us take an example to understand its usability.

Puzzle

Your friend has a number between 1 to 100 in his mind, to guess the number what is the minimum number of questions you would ask given that each question can be answered only by a yes or no.
A CS major would simply answer, this is basically binary search. We have to ask log2(100) = 6.64 = 7 questions (We take the ceil).
The questions would go on like:
1. Is the number greater than 50?
2. Based on previous answer, is the number greater than 75(or 25) ?
3. …
4. …

But what if he could have answered with three options. What would be the number of questions required now. It would become log3(100) = 4.19 = 5 questions.

Now let’s get to a more interesting puzzle: You have 12 balls identical in size and appearance but 1 is an odd weight (could be either light or heavy).

You have a set of balance scales which will give 3 possible readings:

Image taken from Mathisfun.com
  • Left = Right,
  • Left > Right, or
  • Left < Right

We have to design an strategy to determine which is the odd ball and is it heavier or lighter. (Here is the solution, recommended to read the solution before continuing further)

  1. Suppose we do 6 balls vs 6 balls on the balance. Then we have following probabilities:
    Left = Right → 0 (one of the ball is either heavy or light)
    Left > Right → 0.5
    Left < Right → 0.5
    So we have entropy = -(0.5*log2(0.5) + 0.5*log2(0.5)) = 1 bit
  2. Suppose we do 5 balls vs 5 balls on the balance. Then we have following probabilities:
    Left = Right → 2/12 (Either of two balls on ground is defected)
    Left > Right → 5/12 (Symmetry)
    Left < Right → 5/12 (Symmetry)
    So we have entropy = -((2/12)*log2(2/12) + (5/12)*log2(5/12) + (5/12)*log2(5/12)) = 1.48 bits
  3. Suppose we do 4 balls vs 4 balls on the balance. Then we have following probabilities:
    Left = Right → 4/12 (One of the four balls on ground is defected)
    Left > Right → 4/12 (Symmetry)
    Left < Right → 4/12 (Symmetry)
    So we have entropy = -((4/12)*log2(4/12) + (4/12)*log2(4/12) + (4/12)*log2(4/12)) = 1.58 bits
  4. Suppose we do 3 balls vs 3 balls on the balance. Then we have following probabilities:
    Left = Right → 6/12 (One of the six balls on ground is defected)
    Left > Right → 3/12 (Symmetry)
    Left < Right → 3/12 (Symmetry)
    So we have entropy = -((3/12)*log2(3/12) + (3/12)*log2(3/12) + (6/12)*log2(6/12)) = 1.5 bits
  5. Suppose we do 2 balls vs 2 balls on the balance. Then we have following probabilities:
    Left = Right → 8/12 (One of the eight balls on ground is defected)
    Left > Right → 2/12 (Symmetry)
    Left < Right → 2/12 (Symmetry)
    So we have entropy = -((2/12)*log2(2/12) + (2/12)*log2(2/12) + (8/12)*log2(8/12)) = 1.25 bits
  6. Suppose we do 1 ball vs 1 ball on the balance. Then we have following probabilities:
    Left = Right → 10/12 (One of the ten balls on ground is defected)
    Left > Right → 1/12 (Symmetry)
    Left < Right → 1/12 (Symmetry)
    So we have entropy = -((1/12)*log2(1/12) + (1/12)*log2(1/12) + (10/12)*log2(10/12)) = 0.82 bits

So 4 vs 4 maximizes the information content. So we go ahead with it.
Now we have three possible outcomes: (L is for possibly light, H is for possibly heavy, G is for surely good ball)

  1. LLLL < HHHH (on balance) GGGG on ground
    For second iteration, according to similar inference as above we now take:
    HHL vs HHL, then we have following probabilities:
    Left = Right → 2/8 (One of the two balls on ground is defected)
    Left > Right → 3/8 (Symmetry)
    Left < Right → 3/8 (Symmetry)
    So we have entropy = -((2/8)*log2(2/8) + (3/8)*log2(3/8) + (3/8)*log2(3/8))

And so on and so on, you can read the solution here (it doesn’t contain the explanation though).

So one thing to notice is:
To maximize
H(X) = -ΣP(x_i)log2(P(x_i)) (i from 1 to n)

We need P(x_i) = 1/n

For the first case, we had n = 3. Either it on LHS or RHS on the balance or on the ground.
We got 4/12, 4/12, 4/12 as the probabilities (each equal to 1/3).

In second case we again had n =3.
We got 3/8, 3/8, 2/8. (As we can only take integer coefficients so this was the optimal solution = 1/3)

Conclusion

So we saw that this is how to extract maximum information when probability values of occuring of event is given. Next time do not use trial and error for this kind of puzzle.

More Resources

  1. https://stats.stackexchange.com/questions/101351/entropy-and-information-content
  2. https://en.wikipedia.org/wiki/Entropy_(information_theory)

--

--

Nishant Nikhil

Learner | Applied Scientist @amazonIN | prev @IITKGP @GSoC @eccvconf EMNLP’18