Making sense of the Kullback–Leibler (KL) Divergence

Anyone who has ever spent some time working with neural networks will have undoubtedly come across the Kullback-Liebler (KL) divergence. Often written as D(p, q), it describes the divergence between the probability distributions p and q. If you’re trying to find an explanation of what the KL divergence stands for you usually end up with two different explanations.

The most common one is to think of the KL divergence as the “distance” between two distributions. However, this explanation breaks down pretty quickly since the metric isn’t communicative, i.e. in general you have that D(p, q)D(q, p). In other words, the KL divergence from p to q isn’t necessarily the same as from q to p.

The other type of explanation you might come across usually relies on information theory to explain the metric. Now, if you’re familiar with information theory then this might be enough. However, I would guess that most people come from a probabilistic background, which makes the information theory approach hard to understand. As such, in this post I’ll try to give an intuitive explanation of the KL divergence from a probabilistic perspective.

I’ll mainly look at the case where p and q are continuous distributions, but the general idea will still apply for the discrete case as well.


Imagine being tasked with generating a model for p(x) and you end up creating a candidate model, q(x). Now, how do you quantitatively determine how good your model compared to p(x)? This is where the Likelihood Ratio (LR) comes in. In order to answer how well your model describes data compared to p(x), you can calculate the ratio between the likelihoods.

The Likelihood-Ratio

For any sample x the ratio between the likelihoods indicates how much more likely the data-point is to occur in p(x) as opposed to q(x). So, a value larger than 1 indicates that p(x) is the more likely model, whereas a value smaller than 1 indicates the opposite, q(x) is more likely.

For a set of data with independent samples you can compute the likelihood ratio for the entire set by taking the product of the likelihood ratio for each sample.

The Likelihood-Ratio for a set of i.i.d. samples

In order to make this calculation easier to compute you can take the log10 of the entire expression, since that decomposes the product into a sum.

The Log Likelihood-Ratio for a set of i.i.d. samples

In this case, log LR values larger than zero indicate that p(x) better fits the data, whereas a value less than zero tells us that q(x) better fits the data. A value of zero indicates that both models fit the data equally well.

So, to summarize, the likelihood ratio will tell us how many times more probable one model is to another when given data. In order to make it easier to compute this ratio, especially when dealing with an entire set of data, you use the log LR instead. As an example, when setting p(x) and q(x) to be beta distributions with parameters (a=17, b= 6) for p(x) and (a=3, b=3) for q(x) respectively, then their LR and log LR curves look the following way

The gray dotted lines separate the regions where one model is more likely than the other, i.e. points above the gray lines indicate indicate the region where p(x) is the more likely model

Expectation

With log LR you can quantify how much better one model is over the other, given the data you have available. This number will depend on the amount of data you have available.

The question is then, if you have a large set of data sampled from p(x), how much will each sample on average indicate that p(x) better describes the data than q(x)? You can calculate this average “predictive power” by sampling N points from p(x) and normalizing the sum of log LR for each sample.

What happens if you do this for an infinite amount of samples from p(x)? If you let N→infinity, then you get the following

In other words you get the expected value under p(x) of log LR, which analytically is defined as

This expression should be familiar to you, since this is the exact definition of the KL divergence! In other words,

For the previous example where both p(x) and q(x) were beta distributions, you can see how the average log LR, with samples drawn from p(x), approaches the analytically calculated value for D(p(x), q(x)) as you increase the number of samples N.

Dotted black line is the analytically calculated value for D(p(x), q(x)). Gray line is the average log LR. Note that the x-axis is in log10 scale, i.e. for a sample size of N = 10,000 the corresponding value on the x-axis will be log10(10,000) = 4

We have now arrived at what the KL divergence actually stands for. It’s a measure of how much “predictive power” or “evidence” each sample will on average bring when you’re trying to distinguish p(x) from q(x), if you’re sampling from p(x). If p(x) and q(x) are very similar then each individual sample will bring little “evidence” to the table. On the other hand, if p(x) and q(x) are very different then each sample will bring a lot of evidence showcasing that q(x) is not a good representation of p(x).


Hopefully you now have a pretty good understanding of what the KL divergence is and why it makes sense to use it as a metric for the difference between two distributions. In practice you’re often in situations where you want to build a model that’s as close as possible to the “true” model. In that case, you would like it to be as difficult as possible to distinguish the model you built from the real one, especially for samples that have been sampled from the real model.

A final note regarding why D(p(x), q(x)) isn’t necessarily the same as D(q(x), p(x)). Looking at the example of the two beta distributions defined earlier in the post we can highlight a couple of things. Values within 1 std.dev of p(x) can be reasonably explained by q(x), although not equally as well. However, values within 1 std.dev of q(x) cannot be explained equally well by p(x).

In other words, if q(x) is the “real” model, then a large fraction of samples drawn from q(x) will be able to strongly indicate that p(x) isn’t a good model for generating those points. However, when you have p(x) as the real model, then the fraction of samples drawn from p(x) can’t indicate as strongly that q(x) is a poor model. This is why the KL divergence isn’t communicative.


Resources for further information on the KL divergence and the Likelihood-Ratio:

http://www.ism.ac.jp/~eguchi/pdf/KLNP.pdf

http://nowak.ece.wisc.edu/ece830/ece830_fall11_lecture7.pdf