Kullback Leibler (KL) Divergence with Examples (Part II): KL Mathematics

Hossam Hamdy
12 min readJul 4, 2023

--

The Kullback-Leibler (KL) Divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It’s a way of comparing two probability distributions — often a “true” distribution and an approximation of that distribution. It quantifies the “distance” between these two distributions, or how much information is lost when one distribution is used to approximate the other.

In many fields such as machine learning, statistics, and information theory, we often need to approximate complex distributions with simpler ones for computational efficiency or because we don’t know the complex distributions. KL Divergence helps us quantify the trade-off we’re having with this approximation.

Let’s review some concepts before discussing KL Divergence.

  1. Probability Distribution: A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. For example, in a simple coin toss, the probability distribution of getting heads (H) or tails (T) is {P(H) = 0.5, P(T) = 0.5}.
  2. Entropy: In the context of information theory, entropy is a measure of the uncertainty in a random variable. In other words, it
    measures the “surprise” you expect to have on average when you learn the outcome of the variable. The entropy of a discrete
    random variable X with probability mass function p(x) is defined as:

where the sum is over all possible outcomes of X, and log is the natural logarithm.

For a full review about entropy, please refer to part I

KL Divergence

The formula for KL Divergence for discrete probability distributions P and Q is defined as:

where the sum is over all possible outcomes.

in the above formula, P is the true distribution and Q is the estimated distribution.

Let’s break down the formula:

1.

This is the probability of event x according to the first distribution. This term is used as a weighting factor, meaning events that are more probable in the first distribution have a larger impact on the divergence.

We use this term because we’re interested in the difference between P and Q where P has assigned more probability. If an event is highly probable in P but not in Q, we want that to contribute more to our divergence measure. Conversely, an event is highly improbable in P, we don’t want it much to our divergence measure, even if Q assigns it a high probability. This is because we’re measuring the divergence from P to Q, not the other way around.

As you notice here, P(x) will only give more weight to the events that are more probable in P, that is because we essentially asking “how different is Q from P?” with a bias towards the events that P thinks are more likely. This is useful in many scenarios where we want to penalize models (Q) that fail to assign high probabilities to events that are likely under the true data generating process (P).

2.

This term is the ratio of the probabilities assigned to event x by P and Q. If P and Q assign the same probability to x, then this ratio is 1, and the logarithm of 1 is 0, so events that P and Q agree on don’t contribute to the divergence. If P assigns more probability to x than Q does, then this ratio is greater than 1, and the logarithm is positive, so this event contributes to the divergence. If P assigns less probability to x than Q does, then this ratio is less than 1, and the logarithm is negative, but remember that we’re multiplying this by P(x), so events that P assigns low probability to don’t contribute much to the divergence

The logarithm function is used for a few reasons. One is that it turns ratios into differences, which are often easy to work with. Another is that it ensures that the KL divergence is always non-negative according to Jensen’s inequality. The logarithm also has a nice interpretation in terms of information theory:

is the amount of information you gain when you learn that an event with probability p has occurred.

3. For each outcome, we calculate how much probability P assigns to it, and then multiply it by the log of the ratio of the probabilities P and Q assign to it. This ratio tells us how much P and Q differ on this particular outcome.

4. We then sum over all possible outcomes. This gives us a single number that represents the total difference between P and Q.

This means that the KL Divergence is a weighted sum of the log difference in probabilities, where the weights are the probabilities according to the first distribution.

The KL Divergence has the following properties:

  1. Non-negativity:

is always greater than or equal to 0. This is known as Gibbs’ inequality. The KL Divergence is 0 if and only if P and Q are the same distribution, in which case there is no information loss.

2. Not Symmetric:

As mentioned earlier, D_KL(P ||Q) is not the same as D_KL(Q||P ). This is because the KL Divergence measures the information loss when Q is used to approximate P, not the other way around.

The true distribution P is often a theoretical or actual ground truth that we know or assume to be correct. However, in many practical applications, we don’t have direct access to this true distribution. Instead, we have a sample of data drawn from this distribution, and we try to estimate the distribution based on this sample. This estimated distribution is Q.

So, while we use the true distribution in our calculation, it’s often a theoretical construct or a known ground truth that we’re trying to, rather than something we can directly use for predictions or other tasks.

consider an example using email spam filtering.

In this case, the true distribution `P` could be the actual distribution of words in non-spam (legitimate) emails. For example, in legitimate emails, the word “offer” might appear in 5% of emails, “free” in 10%, and so on.

Now, let’s say you’ve developed a spam filter and it’s trying to estimate this distribution based on a sample of emails it has seen. This estimated distribution is `Q`. Maybe your spam filter estimates that “offer” appears in 7% of non-spam emails and “free” in 12%.

The KL divergence can then be used to measure how well your spam filter’s estimated distribution `Q` approximates the true distribution `P`. If the KL divergence is high, it means your spam filter’s estimates are quite different from the true distribution, and it might not be very good at distinguishing spam from non-spam emails. If the KL divergence is low, it means your spam filter’s estimates are close to the true distribution, and it’s likely doing a good job.

In this, even though you know the true distribution `P`, you can’t directly use it to filter emails, because your spam filter needs to make its estimates based on the emails it sees. The KL divergence is a way to measure how well it’s doing that.

you could indeed use metrics like false positives, true positives, etc., to evaluate the performance of your spam filter. These metrics are often used in practice and provide valuable information about the performance of a classifier.

However, KL divergence provides a different perspective. It doesn’t just tell you how often your spam filter is right or wrong (like false positives or true positives do), but it tells you how well your spam filter’s estimated distribution of words in non-spam emails matches the actual distribution.

This can be particularly useful in cases where the distribution of words is important. For example, if certain words are very common in non-spam emails, but your spam filter thinks they’re rare, it might be more likely to incorrectly classify non-spam emails as spam. By using KL divergence, you can identify and correct these discrepancies.

Moreover, KL divergence can be beneficial in cases where you have a large number of potential features (like all the words that could appear in an email). Metrics like false positives and true positives give you a single number that summarizes the performance of your classifier, but they don’t tell you anything about which features (words) your classifier is handling well and which ones it’s not. KL divergence, on the other hand, can provide this information.

So Basically your classifier learns complex patterns like thousands of features and interactions involved in detecting spam. and then compare it to the test data distribution!

Example (1)

start with a simple example. Suppose we have a binary outcome, say a coin flip, where the outcomes are either heads (H) or tails (T).

Let’s say the real distribution P is a fair coin, so P(H) = 0.5 and P(T) = 0.5. Now, suppose we have an estimated distribution Q where Q(H) = 0.7 and Q(T) = 0.3. This means our model is predicting heads 70% of the time and tails 30% of the time.

The Kullback-Leibler (KL) divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It’s defined as:

For our binary case, this becomes:

Substituting our values in:

This will give us a numerical value that represents the divergence of Q from P. The higher the KL divergence, the worse our estimated distribution Q is doing at approximating the real distribution P.

the KL divergence is not bounded. It can take any value from 0 to infinity. A KL divergence of 0 indicates that the two distributions are identical. As the KL divergence increases, the difference between the two also increases.

in terms of bounds, while the KL divergence itself is not bounded, the difference between the probabilities of specific outcomes under the two distributions is bounded by 0 and 1. That is, for any specific outcome x, the absolute difference between P(x) and Q(x) is between 0 and 1. This can give us some sense of the maximum possible divergence between the two distributions, but it doesn’t give us a bound on the KL divergence itself.

Example(2)

let’s consider a simplified example of a movie recommendation system. We’ll use a very basic form of user preference prediction for the sake of simplicity.

Let’s say we have a user who has rated 5 movies in the past. The ratings are on a scale of 1 to 5, with 5 being the highest. Here are the ratings:

  1. Movie A: 5
  2. Movie B: 4
  3. Movie C: 5
  4. Movie D: 1
  5. Movie E: 2

We can consider these ratings as the “true” distribution of the user’s preferences. We’ll normalize these ratings to turn them into probabilities:

true_distribution = np.array([5, 4, 5, 1, 2])
true_distribution = true_distribution / true_distribution.sum()

Now, let’s say our recommendation system has some features for each movie (like genre, director, actors, etc.) and it uses these features to predict the user’s rating for each movie. Based on these features, it predicts the following ratings:

  1. Movie A: 4
  2. Movie B: 3
  3. Movie C: 5
  4. Movie D: 2
  5. Movie E: 3

Again, we’ll normalize these ratings to turn them into probabilities:

predicted_distribution = np.array([4, 3, 5, 2, 3])
predicted_distribution = predicted_distribution / predicted_distribution.sum()

Now we have two probability distributions: the true distribution based on the user’s actual ratings, and the predicted distribution based on the recommendation system’s predictions. We can use the KL Divergence to measure how close these two distributions are:

kl_divergence = kl_div(true_distribution, predicted_distribution).sum()
print(f"KL Divergence is {kl_divergence}")

The KL Divergence will give us a single number that quantifies the difference between the true and predicted distributions. A lower KL Divergence means the predictions are closer to the true preferences, and our recommendation system is doing a good job.

This is a very simplified example. Real recommendation systems use much more complex models and have to deal with many additional challenges, like dealing with sparse data (users only rate a small fraction of all possible items), temporal dynamics (user’s preferences change over time), and scalability (there can be millions of users and items). But the basic idea of comparing the predicted preferences with the true preferences using something like KL Divergence remains the same.

The Kullback-Leibler (KL) Divergence is primarily a measure of how one probability distribution diverges from a second, expected probability distribution. It’s often used as a way to quantify the “goodness” of a model, similar to how Mean Squared Error (MSE) is used in regression problems.

In the context of a recommendation system, the KL Divergence could be used as part of the evaluation process to compare different models or different versions of a model. For example, you might use it to compare a new version of your recommendation algorithm with the current version to see if it’s better.

However, KL Divergence is not typically used directly in the optimization process of a recommendation system. The optimization process usually involves minimizing a loss function that is specific to the type of recommendation algorithm being used. For example, matrix factorization algorithms, which are commonly used in systems, often use a form of squared error loss function.

That being said, there are some learning models, such as Variational Autoencoders, where the KL Divergence is directly included in the loss function and minimized during the training process. These models are trying to learn a probability distribution that approximates the true distribution, and the KL Divergence is used to measure how well they’re doing.

Also, the Kullback-Leibler (KL) Divergence is often used in situations where we want to approximate a complex, unknown distribution with a simpler, known distribution.

A classic example of this is in the field of Bayesian statistics, where we often want to estimate the posterior distribution of a parameter given some data. The true posterior distribution is usually complex and difficult to work with directly, especially for high-dimensional parameters. So instead, we approximate it with a simpler distribution, like a Gaussian, and then use techniques like Variational Inference to find the parameters of the Gaussian that minimize the KL Divergence between the true and approximate distributions.

Another example is in machine learning, where we often want to learn the underlying distribution of the data. For example, in a Generative Adversarial Network (GAN), the generator network tries to generate data that follows the same distribution as the training data. The KL Divergence can be used as part of the loss function to measure how well the generator is doing, with lower KL Divergence indicating that the generated data is closer to the true data distribution.

In these examples, the KL Divergence provides a way to quantify how well our simple, known distribution is approximating the complex, unknown distribution. By minimizing the KL Divergence, we can find the best approximation within our chosen family of simple distributions and we gonna explore all of these and more in the following parts, so stay tuned.

Please feel free to show your support by clapping as much as you can(You can clap up to 50 times per post). Your applause means a lot and helps to spread the message further. Thank you for your generous claps!

--

--

Hossam Hamdy

Computational Statistics Graduate, MSc Financial Engineering candidate, I love Mathematics, Ai, Datascience, Financial engineering!