Information Theory Explained for Machine Learning

12 min readMar 20, 2023

What is Information

Let’s begin with information, the “heart” of information theory. Everything with a specific order of one or more encoding formats can encode information. Let’s say we set ourselves the task of attempting to define the concept of information. What could be a good place to start?

Think about the following exercise. A deck of cards belongs to a friend of ours. They will deal out some cards, shuffle the deck, and then make some generalizations about the cards. We shall make an effort to evaluate each statement’s informational value.

They first turn over a card and say, “I see a card,” to us. This gives us absolutely no information. As we already knew this to be the case, we anticipate that the information will be zero.

The next move is to turn over a card and announce, “Here is the 3 of spades.” This provides more details. There were 52 equally plausible outcomes, and our friend was able to reveal which one it was. This ought to have a fair bit of information.

Let’s apply logic to its logical conclusion. Imagine that they eventually turn over all of the cards in the deck and read the shuffled deck’s order. As there are 52 possible ordering for the deck, each of which is equally plausible, we need a lot of details to determine which one it is.

This intuition must guide whatever concept we construct for information. In fact, we will learn how to calculate that these occurrences have 0 bits, 2 bits, 5.7 bits, and 225.6 bits of information, respectively, in the following sections.

We can see a natural idea if we read through these thinking experiments. We might start with the notion that information conveys the level of surprise or the speculative possibility of the event rather than worrying about the knowledge itself. For instance, there is a lot of detail needed to describe an exceptional incident. We might not require much information for a typical occurrence.

A Mathematical Theory of Communication, published in 1948 by Claude E. Shannon, introduced the concept of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.

Self Information

Since data exemplifies the theoretical chance of an occasion, how would we plan the likelihood of the number of pieces? Shannon presented the wording bit as the unit of data, which was initially made by John Tukey. So what is a “little” and for what reason do we utilize it to quantify data? All things considered, an antique transmitter can send or get two sorts of code: 0 and 1. Without a doubt, paired encoding is still in like manner used on all cutting-edge computerized PCs. Along these lines, any data is encoded by a progression of 0 and 1. Furthermore, thus, a progression of double digits of length n contains n pieces of data.

Presently, assume that for any series of codes, every 0 or 1 happens with a likelihood of 1/2. Thus, an occasion X with a progression of codes of length n occurs with a likelihood of (1/2)^n. Simultaneously, as we referenced previously, this series contains n pieces of data. All in all, might we at any point sum up to a numerical capability that can move the likelihood p to the number of pieces? Shannon gave the answer by defining self-information

as the bits of information we have received for this event X. Note that we will always use base-2 logarithms in this section. For the sake of simplicity, the rest of this section will omit subscript 2 in the logarithm notation, i.e., log(.) always refers to log2(.). For example, the code “0010” has a self-information

Self-information can be calculated as illustrated below.

import random
from mxnet import np
from mxnet.metric import NegativeLogLikelihood
from mxnet.ndarray import nansum


def self_information(p):
    return -np.log2(p)

self_information(1 / 64)

#OUTPUT 

6.0

Entropy

We need a more universal measure for any random variable with a discrete or continuous distribution since self-information only captures the information of a single discrete event.

Motivating Entropy
To be more clear about what we want, let’s attempt. The axioms of Shannon entropy shall be formally stated in the following. It turns out that the following group of rational assertions compel us to define information in a particular way. These axioms and numerous others are presented in the formal form in Csiszar (2008).

The information we gain by observing a random variable does not depend on what we call the elements or the presence of additional elements which have a probability of zero.
The information we gain by observing two random variables is no more than the sum of the information we gain by observing them separately. If they are independent, then it is exactly the sum.
The information gained when observing (nearly) certain events is (nearly) zero.

It is crucial to understand that this independently defines the shape that entropy must take, even though demonstrating this fact is outside the purview of our text. They only allow for one area of ambiguity, the selection of the fundamental units, which is typically normalized by selecting the option we previously discussed, according to which the information delivered by a single fair coin flip is one bit.

Definition

We assess the expected amount of information through entropy for every random variable X that follows a probability distribution P with a probability density function (p.d.f.) or a probability mass function (p.m.f.) p(x) (or Shannon entropy)

How do we reach to above formula?

You may be wondering why we utilize an expectation of a negative logarithm in the entropy These are a few inferences.

First off, why do we employ the logarithm function? Assume that each component function fi(x) in the equation p(x)=f1(x)f2(x)…, fn(x) is independent of the others. This indicates that each fi(x) makes an independent contribution to the overall information gleaned from p(x). We want the entropy formula to be additive across independent random variables, as was previously mentioned. Fortunately, a log can naturally convert a probability distribution product to a sum of the individual parts.

Secondly, why do we employ a negative log? As we frequently learn more from a rare example than from an average one, it makes sense that more regular events should hold less information than less frequent ones. Yet, the log is actually negative for all values in the range [0,1] and monotonically increases with the probabilities. The likelihood of events and their entropy must be constructed in a monotonically declining relationship that, ideally, is always positive (for nothing we observe should force us to forget what we have known). Hence, we prefix the log function with a negative sign.

And finally, from where does the expectation function originate? Think about the random variable X. The self-information (log(p)) might be understood as the degree of surprise we experience when we observe a given event. Indeed, the surprise grows limitless as the likelihood gets closer to zero. Similarly, the average level of surprise from viewing X can be interpreted as entropy. Consider a slot machine system that, for instance, randomly generates the symbols s1,…, sk with probabilities p1,…, pk. Thus, this system’s entropy equals the average self-information gleaned by monitoring each output, or

Types of Entropy

How about the entropy of a pair of random variables (X, Y)? We previously defined entropy for a single random variable, X. These methods can be thought of as attempting to respond to the questions listed below “What information is contained in X and Y collectively as opposed to X and Y individually? Is there duplicate information, or is everything original?”

For the following discussion, we will always refer to (X, Y) as a pair of random variables that follow a joint probability distribution P with a p.d.f. or a p.m.f. pX,Y(x,y), whereas X and Y follow probability distributions pX(x) and pY(y), respectively.

Joint Entropy

we define the joint entropy H(X, Y) of a pair of random variables (X, Y) as

Precisely, on the one hand, if (X, Y) is a pair of discrete random variables, then

On the other hand, if (X, Y) is a pair of continuous random variables, then we define the differential joint entropy as

If two random variables X and Y are identical, then the information in the pair is precisely the same as the information in one, and we get H(X, Y)=H(X)=H(Y) as a pair of extremes. On the other hand, H(X, Y)=H(X)+H(Y) if X and Y are independent. In fact, we will always know that the information contained in a pair of random variables is neither greater nor less than the total of their individual entropies.

Implementation

def joint_entropy(p_xy):
    joint_ent = -p_xy * np.log2(p_xy)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(joint_ent.as_nd_ndarray())
    return out

joint_entropy(np.array([[0.1, 0.5], [0.1, 0.3]]))

# OUTPUT

[1.6854753]

Conditional Entropy

The amount of information contained in two random variables is known as the joint entropy, which was previously defined. Although this is helpful, it is frequently not what we are interested in. Think about the machine learning environment. Consider X as the random variable (or vector of random variables) that represents the pixel values in a picture and Y as the random variable that represents the class label. The information in X should be extensive because a natural image is a complicated object. However, after the image has been displayed, the amount of information in Y should be minimal. In fact, unless the digit is illegible, the information about what digit it is should already be there in the image of the digit.

Hence, to continue to broaden our vocabulary of information theory, we need to be able to reason about the information content of a random variable conditional on another.

We learned how to define conditional probability in the context of probability theory, which is used to assess the relationship between variables. We now want to define the conditional entropy H(Y|X) in a comparable manner. This can be written as

where

is the conditional probability. Specifically, if (X, Y) is a pair of discrete random variables, then

The question of how the conditional entropy H(YX) relates to the entropy H(X) and the joint entropy H(X, Y) is now logical. With the definitions given above, we can succinctly state this:

The information in Y given X (H(Y|X)) is the same as the information in both X and Y combined (H(X, Y)) minus the information already present in X, according to an intuitive interpretation of this. This provides us with the data in Y that isn’t also represented in X.

Let’s Implement it in python:

def conditional_entropy(p_xy, p_x):
    p_y_given_x = p_xy/p_x
    cond_ent = -p_xy * np.log2(p_y_given_x)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(cond_ent.as_nd_ndarray())
    return out

conditional_entropy(np.array([[0.1, 0.5], [0.2, 0.3]]), np.array([0.2, 0.8]))

# OUTPUT
[0.8635472]

Mutual Information

You might be wondering: “Now that we know how much information is included in Y but not in X, can we similarly ask how much information is exchanged between X and Y?” given the previous configuration of random variables (X, Y). The mutual information of (X, Y), which we will abbreviate as I(X, Y), will be the solution.

Let’s develop our intuition by trying to derive an expression for the mutual information purely based on concepts we have defined earlier rather than jumping right into the formal definition. We are looking for information that two random variables share. Starting with all the data present in both X and Y together and then removing the portions that are not shared is one approach we might try to follow. H(X, Y) represents the information that both X and Y together hold. The information in X that isn’t in Y, as well as the information in Y that isn’t in X, should be subtracted from this.

This is provided by H(XY) and H(YX), respectively, as we saw in the preceding section. Hence, we have that the reciprocal information should

This is a legitimate definition of mutual information, in fact. A little algebra demonstrates that this is the same as if we combine and broaden the definitions of these terms.

We can summarize all of these relationships in the image

Let’s implement it in python:

def mutual_information(p_xy, p_x, p_y):
    p = p_xy / (p_x * p_y)
    mutual = p_xy * np.log2(p)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(mutual.as_nd_ndarray())
    return out

mutual_information(np.array([[0.1, 0.5], [0.1, 0.3]]),
                   np.array([0.2, 0.8]), np.array([[0.75, 0.25]]))

# OUTPUT

0.7194

Properties of Mutual Information

You need to be aware of the following salient characteristics of mutual information:

I(X, Y)=I(Y, X) means that mutual information is symmetric.
Mutual information is non-negative i.e I(X,Y)≥0
If X and Y are independent, then and only if they are I(X, Y)=0. When X and Y, for instance, are independent of one another, knowing Y does not reveal anything about X, and vice versa, hence their mutual information is zero.
In contrast, if X is an invertible function of Y, Y, and X share all knowledge and

Application of Mutual Information

In its purest sense, mutual information may seem a little abstract, so how does it relate to machine learning? The difficulty of a word’s meaning being obscured by context, or ambiguity resolution, is one of the trickiest issues in natural language processing. For instance, a recent news headline said that “Amazon is on fire”. You can be confused about whether the Amazon rainforest is on fire or whether an Amazon building is on fire.

We start by identifying a collection of phrases, such as e-commerce, technology, and the internet, that have a lot of similar information with the corporation Amazon. The words “rain,” “forest,” and “tropical” are among the second group of words that have a lot in common with the Amazon rain forest. If “Amazon” has to be clarified, we can compare which group appears more frequently when the word is used in that context. In this instance, the article would continue by describing the forest and establishing the situation.

Kullback–Leibler Divergence

Norms can be used to calculate the separation between any two points in space. It would be helpful if probability distributions could be used for a similar task. There are other approaches to take, but information theory offers one of the best. We now investigate the Kullback-Leibler (KL) divergence, which offers a means of determining whether or not two distributions are close to one another.

We estimate P by another probability distribution Q with a p.d.f. or a p.m.f. q(x) given a random variable X that follows the probability distribution P with a p.d.f. or a p.m.f. Then, between P and Q, the Kullback-Leibler (KL) divergence (or relative entropy) is

We can again include an explanation of the logarithmic term, just as we did with the pointwise mutual information:

If we observe x under P much more frequently than we would anticipate under Q, and huge and negative if we observe the outcome much less frequently than anticipated. This allows us to interpret it as our relative surprise at seeing the result in comparison to how astonished we would be to see it based on our reference distribution.

Let’s start from scratch and apply the KL divergence.

def kl_divergence(p, q):
    kl = p * np.log2(p / q)
    out = nansum(kl.as_nd_ndarray())
    return out.abs().asscalar()

Conclusion:

In Shannon’s article, we learned about information theory, entropy, several fascinating ideas, and how the big bang era’s use of the entropy principle in other domains. Further intriguing ideas in entropy include mutual information, KL Divergence, and Jensen-Shannon Distance.

Reference :

1. Wikipedia (n.d.) ‘Information theory’, available at: https://en.wikipedia.org/wiki/Information_theory (accessed 29th July 2021).

2. Wikipedia (n.d.) ‘The laws of thought’, available at: https://en.wikipedia.org/wiki/The_Laws_of_Thought (accessed 19th August 2021).

3. Wikipedia (n.d.) ‘Boolean function’, available at: https://en.wikipedia.org/wiki/Boolean_function (accessed 19th August 2021).