Making sense of the Kullback–Leibler (KL) Divergence

Anyone who has ever spent some time working with neural networks will have undoubtedly come across the Kullback-Liebler (KL) divergence. Often written as D(p, q), it describes the divergence between the probability distributions p and q. If you’re trying to find an explanation of what the KL divergence stands for you usually end up with two different explanations.

The most common one is to think of the KL divergence as the “distance” between two distributions. However, this explanation breaks down pretty quickly since the metric isn’t…