Kullback Leibler (KL) Divergence with Examples (Part I): Entropy

Hossam Hamdy
8 min readJul 1, 2023

--

In the field of information theory, one of the fundamental concepts is the measurement of information and uncertainty. Information theory provides a framework for quantifying and understanding the amount of information contained in a message or a set of data. It has applications in various fields, including statistics, machine learning, and data compression.

At the heart of information theory lies the concept of entropy, which measures the uncertainty or randomness in a set of data. Entropy allows us to quantify the expected value of the information contained in a message. It provides insights into the average amount of “surprise” or “information” that we receive when we learn the outcome of a random variable.

However, entropy alone may not be sufficient to capture the differences between probability distributions or the amount of information gained when moving from one distribution to another. This is where Kullback-Leibler (KL) Divergence comes into play.

KL Divergence, named after Solomon Kullback and Richard Leibler, is a measure of the difference between two probability distributions. It provides a way to quantify how one distribution diverges from another. KL Divergence is often used in various fields, including information theory, statistics, and machine learning, to compare and analyze probability distributions.

In this article, we will explore the concept of KL Divergence in detail. We will start by understanding the basics of information theory, including entropy and the concept of information content. Then, we will delve into the definition and properties of KL Divergence, and discuss its applications in various domains. By the end of this article, you will have a solid understanding of KL Divergence and its significance in measuring the difference between probability distributions.

So, let’s embark on this journey into the world of information theory and KL Divergence, and discover how it can enhance our understanding of data and uncertainty.

Entropy

Entropy is a measure of uncertainty or randomness in a set of data. It’s a fundamental concept in information theory, and it’s used to quantify the expected value of the information contained in a message.

Defining Information Content

The first step in deriving the entropy formula is to define the concept of information content. The information content of an event is inversely proportional to the probability of the event. We can define the information content I(x) of an event x with probability P(x) as follows:

Here, the logarithm can be in any base, but it’s common to use base 2 for information theory (which results in a measure of information in bits). The negative sign ensures that the information content is positive since the logarithm of a probability (a number between 0 and 1) is negative.

Defining Entropy

Entropy is defined as the expected value of the information content. In other words, it’s the average amount of information that we expect to receive when the outcome of the random variable is revealed. The entropy H(X) of a discrete random variable X with probability mass function P(x) is defined as the expected value of the information content:

Here, E[.] denotes the expected value, and the sum is over all possible outcomes x of the random variable X.

Substituting Information Content into the Entropy Formula

Now we substitute the formula for information content from step 1 into the entropy formula from step 2:

This simplifies to:

This is the formula for entropy. It tells us that the entropy of a random variable is the sum over all possible outcomes of the product of the probability of each outcome and the logarithm of the probability of each outcome, with the whole sum multiplied by -1.

The entropy of a random variable is a measure of the average amount of “surprise” or “information” that we receive when we learn the outcome of the random variable. It’s calculated as the expected value of the information content of each possible outcome.

The use of the logarithm in the formula for information content has a few important reasons and implications:

  1. Inversely Proportional to Probability: The logarithm ensures that the information content is inversely proportional to the probability of an event. Rare events (with low probability) have high information content, while common events (with high probability) have low information content. The logarithm function has this property, as the logarithm of a number between 0 and 1 is a negative number. Taking the negative of the logarithm ensures that the information content is a positive number that is large for rare events and small for common events.
  2. Additivity of Independent Events: The logarithm makes the information content of independent events additive. If events A and B are independent, the information content of both events occurring is I(A and B) = I(A) + I(B). This property is useful in information theory.
  3. Quantifying Surprise: The logarithm provides a nice interpretation in terms of “surprise”. Less likely events are more “surprising”, and the logarithm increases dramatically as the probability decreases, reflecting this increased surprise.

The logarithm in the formula for information content helps create a measure that aligns with our intuitive understanding of information: it’s high for rare, surprising events and low for common, unsurprising events, and it’s additive for independent events.

Example

Consider a fair coin toss. The coin has two outcomes: heads (H) and tails (T), each with a probability of 0.5. We can calculate the entropy of this system as follows:

First, we calculate the information content of each outcome:

Then, we calculate the entropy of the system:

So, the entropy of a fair coin toss is 1 bit, which means that each toss of the coin provides 1 bit of information.

Now, consider a biased coin where the probability of getting heads is 0.9 and tails is 0.1. The entropy of this system would be:

The entropy is lower for the biased coin toss because the outcome is less uncertain.

In the given example, we have two scenarios: a fair coin toss and a biased coin toss.

  1. In the case of a fair coin, the probability of getting heads (H) or tails (T) is the same, 0.5. This means that each outcome is equally likely, and thus, the uncertainty or randomness is at its maximum. This is why the entropy, denoted as H(X), is 1 bit. Each toss of the coin provides 1 bit of information because you are completely uncertain about the outcome before the toss.
  2. In the case of a biased coin, the probability of getting heads is 0.9 and tails is 0.1. This means that getting heads is much more likely than getting tails. Because one outcome (heads) is much more likely, there is less uncertainty or randomness in the toss. You can often predict the outcome (it’s likely to be heads). This is why the entropy, H(X), is lower for the biased coin (approximately 0.469 bits). The outcome is less uncertain, so each toss of the coin provides less than 1 bit of information.

So, when the statement says “The entropy is lower for the biased coin toss because the outcome is less uncertain”, it means that because we can more accurately predict the outcome of the toss (due to the bias), there is less new information provided by each toss, and thus, the entropy (a measure of this new information or uncertainty) is lower.

the entropy results are indeed bounded. In the context of a binary system like a coin toss, where there are only two possible outcomes (heads or tails), the entropy is bounded between 0 and 1 bit.

The entropy is 0 when the outcome is certain. For example, if you have a coin that always lands on heads, there’s no uncertainty, so the entropy is 0.

The entropy is 1 bit when the outcomes are equally likely, as in the case of a fair coin. This is the maximum entropy for a binary system because there’s maximum uncertainty — you have no way of predicting whether the next toss will result in heads or tails.

So, when we say the entropy of a fair coin toss is 1 bit, it means that the uncertainty is at its maximum possible value for this system. Similarly, when we say the entropy of the biased coin toss is approximately 0.469 bits, it means the uncertainty is less than half of what it could be in the most uncertain (or random) case.

In general, for a system with more possible outcomes, the maximum entropy would be higher. For example, for a fair six-sided die, the maximum entropy would be log2(6) ≈ 2.585 bits.

In the next part, we will discuss why entropy alone may not be sufficient to capture the differences between probability distributions or the amount of information gained when moving from one distribution to another.

Entropy is a measure of uncertainty or randomness within a single probability distribution. It doesn’t take into account the specific values that the random variable can take, only their probabilities. Therefore, two different distributions with the same entropy can be very different in terms of their actual values.

Moreover, entropy doesn’t measure the difference between two probability distributions. For example, if we have two different distributions over the same set of events, entropy doesn’t tell us how much information we gain or lose when moving from one distribution to the other.

To measure the difference between two probability distributions, we need a different concept, such as the Kullback-Leibler (KL) divergence. The KL divergence measures how one probability distribution diverges from a second, expected probability distribution. It’s often used in machine learning to measure the loss of information when approximating one distribution with another.

while entropy is a powerful concept for quantifying uncertainty within a single distribution, it doesn’t capture the differences between distributions or the information gained when moving from one distribution to another. For that, we need other tools like the KL divergence.

Please feel free to show your support by clapping as much as you can(You can clap up to 50 times per post). Your applause means a lot and helps to spread the message further. Thank you for your generous claps!

--

--

Hossam Hamdy

Computational Statistics Graduate, MSc Financial Engineering candidate, I love Mathematics, Ai, Datascience, Financial engineering!