Making Sense of the Kullback–Leibler (KL) Divergence

Marko Cotra
6 min readFeb 12, 2017

Anyone who has ever spent some time working with neural networks will have undoubtedly come across the Kullback-Liebler (KL) divergence. Often written as D(p, q), it describes the divergence between the probability distributions p and q. If you’re trying to find an explanation of what the KL divergence stands for you usually end up with two different explanations.

The most common one is to think of the KL divergence as the “distance” between two distributions. However, this explanation breaks down pretty quickly since the metric isn’t commutative, i.e. in general you have that D(p, q)D(q, p). In other words, the KL divergence from p to q isn’t necessarily the same as from q to p.

The other type of explanation you might come across usually relies on information theory to explain the metric. Now, if you’re familiar with information theory then this might be enough. However, I would guess that most people come from a probabilistic background, which makes the information theory approach hard to understand. As such, in this post I’ll try to give an intuitive explanation of the KL divergence from a probabilistic perspective.

I’ll mainly look at the case where p and q are continuous distributions, but the general idea will still apply for the discrete case as well.

Imagine being tasked with generating a model for p(x) and you end up creating a candidate model, q(x). Now, how do you quantitatively determine how good your model compared to p(x)? This is where the Likelihood Ratio (LR) comes in. In order to answer how well your model describes data compared to p(x), you can calculate the ratio between the likelihoods.

For any sample x the ratio between the likelihoods indicates how much more likely the data-point is to occur in p(x) as opposed to q(x). So, a value larger than 1 indicates that p(x) is the more likely model, whereas a value smaller than 1 indicates the opposite, q(x) is more likely.

For a set of data with independent samples you can compute the likelihood ratio for the entire set by taking the product of the likelihood ratio for each sample.

In order to make this calculation easier to compute you can take the log10 of the entire expression, since that decomposes the product into a sum.

In this case, log LR values larger than zero indicate that p(x) better fits the data, whereas a value less than zero tells us that…

--

--

Marko Cotra

Engineer with a passion for probability theory, software development and leadership.