How to find the similarity between two probability distributions using Python
Using Jensen Shannon Divergence to build a tool to find the distance between probability distributions using Python.
I was on a mission to find a good measure of difference between two probability distributions. After doing a lot of research online, taking feedback from my colleagues, and validating various methods, I found one that does a really good job.
My problem statement could be solved by calculating the statistical distance between the two probability distributions. To do this, I found out that Jensen Shannon Distance can be used.
Jensen-Shannon Divergence (JSD)is a metric derived from another measure of statistical distance called the Kullback-Leiber Divergence(KLD). The reason why I couldn’t use the KLD is that it’s an asymmetrical function. Since there might have been a lot of distance calculations required, it posed a risk.
JSD, on the other hand, is a symmetrical function and the square root of JSD gives the Jensen-Shannon Distance. A measure that we can use to find the similarity between the two probability distributions. 0 indicates that the two distributions are the same, and 1 would indicate that they are nowhere similar.
Here is the formula to calculate the Jensen-Shannon Divergence :
Where P & Q are the two probability distribution, M = (P+Q)/2, and D(P ||M) is the KLD between P and M. Similarly D(Q||M) is the KLD between Q and M.
Implementation in Python
Now that we know the formula, it’s time to implement it. First of all, we need to calculate M and also, the KLD between P&M and Q&M.
Scipy is a phenomenal Python Library for scientific computing and it has lots of statistical measures in-built. It turns out that the entropy measure in scipy is implemented using the KLD. Just what we want.
I found it to be quite simple to implement it with python and I got really good results when I tested it with a few distributions.