Making Sense of the Kullback–Leibler (KL) Divergence

Marko Cotra
6 min readFeb 12, 2017

Anyone who has ever spent some time working with neural networks will have undoubtedly come across the Kullback-Liebler (KL) divergence. Often written as D(p, q), it describes the divergence between the probability distributions p and q. If you’re trying to find an explanation of what the KL divergence stands for you usually end up with two different explanations.

The most common one is to think of the KL divergence as the “distance” between two distributions. However, this explanation breaks down pretty quickly since the metric isn’t commutative, i.e. in general you have that D(p, q)D(q, p). In other words, the KL divergence from p to q isn’t necessarily the same as from q to p.

The other type of explanation you might come across usually relies on information theory to explain the metric. Now, if you’re familiar with information theory then this might be enough. However, I would guess that most people come from a probabilistic background, which makes the information theory approach hard to understand. As such, in this post I’ll try to give an intuitive explanation of the KL divergence from a probabilistic perspective.

I’ll mainly look at the case where p and q are continuous distributions, but the general idea will still apply for the discrete case as well.

Imagine being tasked with generating a model for p(x) and you end up creating a candidate model, q(x). Now, how do you quantitatively determine how good your model compared to p(x)? This…

--

--

Marko Cotra

Engineer with a passion for probability theory, software development and leadership.