Necessary Probability Concepts for Deep Learning: Part 2

5 min readMar 9, 2022

KL divergence, JS divergence, and Wasserstein metric to compute the difference between probability distributions.

In this blog, I will continue the discussion of essential probability/statistics concepts by including 3 more concepts, which are widely used in deep learning to measure distances between probabilities distributions.

KL (Kullback–Leibler) divergence

KL divergence measures divergence between two probabilities distributions. Let's consider the same notation as our last article, and represent the two distributions with g and h respectively, then KL divergence between these two distributions will be given as:

Here, I am repeating the KL-divergence definition again to make a uniform discussion along with the other two different measures and focus on pros and cons independently.

From the above equation, we can say that:

When g and h are the same then KL divergence will be zero, i.e. the lower value of KL divergence indicates the higher similarity between two distributions.
KL divergence is not symmetrical, i.e.

Necessary Probability Concepts for Deep Learning: Part 2

KL divergence, JS divergence, and Wasserstein metric to compute the difference between probability distributions.

KL (Kullback–Leibler) divergence

Written by Sunil Yadav