KL Divergence to Find the Best

Güldeniz Bektaş
Analytics Vidhya
Published in
5 min readSep 23, 2021

Well, let me tell you, I had NO idea about KL divergence until I participated to a course. Since it’s a pretty complicated concept for me, I would like to write about it!

I read some articles about KL divergence. As I experienced, best way to put something in words is an example. I’ll continue with one I quote from this amazing article (I encourage you to read it). Sometimes we may want to render our data distribution to a distribution with much more simple.

Imagine, NASA has a new mission that sends space-scientist to a new planet to observe biting worms. The scientists detected that worms has the same amount of teeth but over time they lose their teeth due to chomping. Well, anyway. So, they put their data together, and this is what they have end of the mission:

Taken from here.

One problem. They have to send this data to Earth for more examination. But they are far away from Earth. Sending this data will be expensive. What they can do? They can reduce the number of parameters! But, to do that, they have to find the best distribution to fit this data for the less loss of information.

They can try uniform distribution. Only parameter that this distribution need is possible values. Data has 11 possible values.

P_uniform = 1 / possible values = 1 / 11 = 0.0909

Each worm has the same uniform possibility.

Taken from here.

Now, we have no diversity. Clearly, uniform distribution is not the best choice for their data.

They can try Binomial distribution. We can define a mean and variance for a binomial distribution.

mean = np
variance = np(1-p)

They can use mean for binomial distribution. n is the maximum number of teeth observed from the population of worms which is 10. Mean is the expected number of teeth for the worms. We can calculate it like this:

Taken from here.

5.44 = 10 x p

p = 0.544

Taken from here

Seems like it doesn’t fit our original distribution either. What should they do then? How can they find the best distribution to fit their original distribution?

Let’s break it down a little.

What they want is to send the data with less parameters to reduce to cost, and find the best distribution to reduce the loss of information.

Now, it’s more clear to me. To understand that we managed to make the less reduce of information, we need to know the how much information that the data has originally.

Entropy

Entropy is an important metric in information theory. The primary goal of information theory is to quantify how much information is in data.

I wanna quote a really good definition I found online for entropy:

Entropy allows us to make precise statements and perform computations with regard to one of life’s most pressing issues: not knowing how things will turn out. Entropy, in other words, is a measure of uncertainty.

And of course one and only equation:

Shannon Entropy you may say

If we use log​2​​ for our calculation we can interpret entropy as “the minimum number of bits it would take us to encode our information”. In this case, the information would be each observation of teeth counts given our empirical distribution. Given the data that we have observed, our probability distribution has an entropy of 3.12 bits. The number of bits tells us the lower bound for how many bits we would need, on average, to encode the number of teeth we would observe in a single case.[1]

If our goal is to minimize uncertainty (entropy) we should stay away from uniform distribution. Since for uniform distribution all outcomes has the same probability (you can see from above). In our case, for worms’ teeth count probabilities, it would’t represent our data correctly.

The best test of which is better is to ask which distribution preserves the most information from our original data source. This is where Kullback-Leibler Divergence comes in.

KL-Divergence

Kullback-Leibler Divergence is just a slight modification of our formula for entropy. Rather than just having our probability distribution p we add in our approximating distribution q.

Let p(x) and q(x) are two probability distributions of a discrete random variable x. That is, both p(x) and q(x) sum up to 1, and p(x) > 0 and q(x) > 0 for any x in X. D_KL(p(x), q(x)) is defined:[4]

Taken from here

KL Divergence is a measure of how one probability distribution is different from a second, reference probability distribution.

Two perpendicular lines in “ p||g” refers to q distributions difference than p distribution.

You might come to the conclusion that since KL Divergence measures distance between two distribution, it is a distance measure. But it’s not because it’s not a metric measure.

KL Divergence is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x).

DKL(p||q) is a non-negative measure. DKL(p||q) ≥ 0 and DKL(p||q) = 0 if and only if p= q.

Implementation

Code below is the implementation of KL Divergence between two Gaussian distribution. For this implementation we need two sample with Gaussian distribution.

Well, after some time that was it. I will write more about advance statistics concepts. Wait for it!

REFERENCES

[1] Kullback-Leibler Divergence Explained

[2] Light on Math Machine Learning Intuitive Guide to Understanding KL Divergence

[3] Entropy is a Measure of Uncertainty

[4] KL-divergence

[5] AI Labs Joint Program

--

--