Anyone who has ever spent some time working with neural networks will have undoubtedly come across the Kullback-Liebler (KL) divergence. Often written as D(p, q), it describes the divergence between the probability distributions p and q. If you’re trying to find an explanation of what the KL divergence stands for you usually end up with two different explanations.
The most common one is to think of the KL divergence as the “distance” between two distributions. However, this explanation breaks down pretty quickly since the metric isn’t commutative, i.e. in general you have that D(p, q) ≠D(q, p). In other words, the KL divergence from p to q isn’t necessarily the same as from q to p.
The other type of explanation you might come across usually relies on information theory to explain the metric. Now, if you’re familiar with information theory then this might be enough. However, I would guess that most people come from a probabilistic background, which makes the information theory approach hard to understand. As such, in this post I’ll try to give an intuitive explanation of the KL divergence from a probabilistic perspective.
I’ll mainly look at the case where p and q are continuous distributions, but the general idea will still apply for the discrete case as well.
Imagine being tasked with generating a model for p(x) and you end up creating a candidate model, q(x). Now, how do you quantitatively determine how good your model compared to p(x)? This is where the Likelihood Ratio (LR) comes in. In order to answer how well your model describes data compared to p(x), you can calculate the ratio between the likelihoods.
For any sample x the ratio between the likelihoods indicates how much more likely the data-point is to occur in p(x) as opposed to q(x). So, a value larger than 1 indicates that p(x) is the more likely model, whereas a value smaller than 1 indicates the opposite, q(x) is more likely.
For a set of data with independent samples you can compute the likelihood ratio for the entire set by taking the product of the likelihood ratio for each sample.
In order to make this calculation easier to compute you can take the log10 of the entire expression, since that decomposes the product into a sum.
In this case, log LR values larger than zero indicate that p(x) better fits the data, whereas a value less than zero tells us that q(x) better fits the data. A value of zero indicates that both models fit the data equally well.
So, to summarize, the likelihood ratio will tell us how many times more probable one model is to another when given data. In order to make it easier to compute this ratio, especially when dealing with an entire set of data, you use the log LR instead. As an example, when setting p(x) and q(x) to be beta distributions with parameters (a=17, b= 6) for p(x) and (a=3, b=3) for q(x) respectively, then their LR and log LR curves look the following way
With log LR you can quantify how much better one model is over the other, given the data you have available. This number will depend on the amount of data you have available.
The question is then, if you have a large set of data sampled from p(x), how much will each sample on average indicate that p(x) better describes the data than q(x)? You can calculate this average “predictive power” by sampling N points from p(x) and normalizing the sum of log LR for each sample.
What happens if you do this for an infinite amount of samples from p(x)? If you let N→infinity, then you get the following
In other words you get the expected value under p(x) of log LR, which analytically is defined as
This expression should be familiar to you, since this is the exact definition of the KL divergence! In other words,
For the previous example where both p(x) and q(x) were beta distributions, you can see how the average log LR, with samples drawn from p(x), approaches the analytically calculated value for D(p(x), q(x)) as you increase the number of samples N.
We have now arrived at what the KL divergence actually stands for. It’s a measure of how much “predictive power” or “evidence” each sample will on average bring when you’re trying to distinguish p(x) from q(x), if you’re sampling from p(x). If p(x) and q(x) are very similar then each individual sample will bring little “evidence” to the table. On the other hand, if p(x) and q(x) are very different then each sample will bring a lot of evidence showcasing that q(x) is not a good representation of p(x).
Hopefully you now have a pretty good understanding of what the KL divergence is and why it makes sense to use it as a metric for the difference between two distributions. In practice you’re often in situations where you want to build a model that’s as close as possible to the “true” model. In that case, you would like it to be as difficult as possible to distinguish the model you built from the real one, especially for samples that have been sampled from the real model.
A final note regarding why D(p(x), q(x)) isn’t necessarily the same as D(q(x), p(x)). Looking at the example of the two beta distributions defined earlier in the post we can highlight a couple of things. Values within 1 std.dev of p(x) can be reasonably explained by q(x), although not equally as well. However, values within 1 std.dev of q(x) cannot be explained equally well by p(x).
In other words, if q(x) is the “real” model, then a large fraction of samples drawn from q(x) will be able to strongly indicate that p(x) isn’t a good model for generating those points. However, when you have p(x) as the real model, then the fraction of samples drawn from p(x) can’t indicate as strongly that q(x) is a poor model. This is why the KL divergence isn’t commutative.
Resources for further information on the KL divergence and the Likelihood-Ratio: