# Demystified: Kullback–Leibler Divergence

*A quick primer on Kullback-Leibler Divergence, an important concept to understand in machine learning and information theory*

So, first things first — we need to understand what entropy is, in terms of information theory and not thermodynamic entropy. Both important and *curiously* related, but for ML, and possibly card counting, we’re going with information theory. This entropy (Shannon Entropy) is named after it’s creator. We miss you Claude.

# Shannon Entropy

Let’s start off with a bit of pre-prerequisite understanding

**Discrete (countable) random variables (DRVs)**

This one is pretty simple. A variable with a **finite** sample space. A single dice roll is a good example — it has a sample space of 6: `{1, 2, 3, 4, 5, 6}`

. All outcomes with equal probability (1/6). A coin flip is another good example — a sample space of 2: `{heads, tails}`

. Again, equal probability (1/2). The associated probabilities can be equal, like in a fair coin flip, but don’t have to be, like an unfair coin used by a rogue carny trying to trick you out of your hard earned cash `{heads: 1/3, tails: 2/3}`

Surprisingly, if you were to dive into a textbook definition of DRVs, it would get quite complicated for no reason.

That covers DRVs. Let’s move into entropy.

**Entropy**

Shannon entropy was devised as a method to measure the amount of information in transmitted messages. Therefore, the result of the formula will result in an average # of bits required to represent the information.

You will find a lot of strange intuitions of what this means. I posit this is related to a fundamental misunderstanding of entropy in general. By no means am I an expert on physical and informational entropy. However, this line of conversation could lead to a 6-month long mathematical rabbit hole in which you lock yourself into your home and lose your mind.

Let’s save you from that and keep it simple.

The clearest and most succinct definition came from an *unpredicted* place.

OK. Let’s jump straight in to the formula

`X`

is our sample space. `X_i`

is an element of our sample space. `b`

represents our base, typically in information theory, we’ll use base 2.

So, using our example above, let’s calculate a bit of entropy.

For a single fair coin flip:

`- ((1/2 * log(1/2)) + (1/2 * log(1/2)) = -(-0.5 + -0.5) = 1`

One ‘bit’ of entropy. A fair coin is in a state of ‘maximal entropy’.

So, if we have an unfair coin with the probabilities listed above, we have an increased degree of predictability right? Yes — we can predict tails more often (2/3 vs. 1/2). Given that, we’d expect our entropy measure to be less than one for a flip of the unfair coin.

`- ((1/3 * log(1/3))+(2/3 * log(2/3))= -((-0.52832) + (-0.3899)≈ 0.92`

Yep. The universe is still in consistent order. For now…

By the way, if you’re a data scientist and are working with information entropy, don’t waste time writing your own implementation unless you’re looking to learn/challenge yourself. There’s a nice implementation in the SciPy package. If that’s not fast enough, and you’re working with strings of information, there’s a separate package implemented in C.

If you still want to dive into information entropy, I’d start with this video:

If that hasn’t satiated your voracious appetite for information entropy, then MIT has open sourced an entire course on information and entropy — taught by Seth Lloyd. What a world we’re blessed with.

# Kullback–Leibler (KL) Divergence

If you’re working in the field, you’re probably already aware of KL divergence, sometimes called *information gain *in the context of decision trees and also called* relative entropy*. But, unless you’ve written the implementations of bayesian inference algorithms or have done graduate coursework in information theory or machine learning, you might have not gotten down into the nuts and bolts. Don’t worry, it’s actually pretty straight forward.

The simplest, highest level and nonthreatening way of describing it is: if you have two probability distributions, P and Q the KL divergence measures the similarity of P and Q. If `KL_Divergence(P||Q) = 0`

, the distributions are equal.

Alright, that should conceptualize that in your mind. Now, onto the details.

**Formulae**

**Properties**

- It is non symmetric. Sometimes, you’ll hear KL divergence called a distance metric. It’s a simpler way of understanding, but if we want to keep it technically correct, KL divergence doesn’t satisfy the triangle inequality, so it’s not a true distance.
`KL_Divergence(P||Q) != KL_Divergence(Q||P)`

. If we want a symmetric form, say for clustering or some other interesting use of distances in a topological space— use Jensen Shannon divergence. - It’s a summation for discrete probability distributions and an integral for continuous probability distributions. (formulas above)
- It’s always ≥ 0

As with the entropy calculation above, don’t worry about writing your own implementation unless you’re working on something custom or if you’re doing it for the sheer joy of learning. The SciPy package, actually, the same function will give you KL divergence with an additional parameter.

Now, let’s try to mix all of these ingredients into a mind cake for you to enjoy for the rest of your nerdy days.

Why is it called *information gain*?

I decided to snapshot straight out of my notebook. If we have some sort of bayesian update process, as is typical in bayesian learning processes, we can follow this train of logic.

This is demonstrating literal information measured in additional bits gained by moving from a prior distribution `p(x)`

to a posterior `p(x|y)`

.

Sorry in advance if you can’t read my chicken scratch

The above process can be repeated infinitely with every new piece of Y you can observe.

With enough observation, you eventually have all the bits. What you do with them, is your choice….

Thanks for reading!

I’m on twitter too!