Measuring the statistical similarity between two samples using Jensen-Shannon and Kullback-Leibler divergences

Published in

DataLab Log

8 min readFeb 28, 2020

Can we quantify the quality of the information we receive every day? How can we measure the distance between two statistical samples?

Torres del Paine National Park (Patagonia) — Photo by Thomas Fields on Unsplash

Introduction

In recent years, the amount of information generated and processed daily easily surpasses that of decades ago. The current stage of collection and storage of very large amounts of data is unprecedented. In this scenario, new challenges arise and require more intelligent solutions to be solved. One of the biggest challenges is to correctly identify the changes in the underlying processes and structures that generate data that are updated every day. Whether the changes in the data are due to “bugs”, glitches, genuinely abrupt or gradual make it difficult to design a single efficient detection mechanism. A possible solution consists of measuring the divergence between two distributions. It is based on the main concepts derived from information theory. Here we introduce two divergence measures, but only one actual distance metric.

Similarity measures for probability distributions

The entropy of a discrete random variable X is a measurement of the amount of information required on the average to describe that variable. It is the most important metric in information theory as it measures the uncertainty of a given variable. Shannon defined the entropy H of a discrete random variable X with probability mass function P(x) as:

Entropy H of a discrete random variable X

In the previous equation, if we set b = 2 in the log expression, we can estimate the minimum value in bits necessary to encode all the information contained in X. The relative entropy measures how distant two distributions are from each other. It is also referred to as the Kullback-Leibler divergence (KL divergence) between two samples. For discrete probability distributions P(x) and Q(x), defined on the same probability space 𝛘, it is given by:

KL divergence of distribution Q from P for two discrete random variables

For two probability density functions p(x) and q(x), the previous equation becomes:

KL divergence for two continuous random variables

Let P(x) and Q(x), x ∈ 𝛘, be two probability mass functions (i.e. discrete distributions). Then D(P||Q) ≥ 0 with equality if and only if P(x) = Q(x) for all x. For the most part, D(P||Q) ≠ D(Q||P). Therefore, KL divergence is not a real distance metric because it is not symmetric and does not satisfy the triangle inequality. It is important to notice that the KL divergence is defined only if for all x, Q(x) = 0 → P(x) = 0.

An alternate approach is the Jensen-Shannon divergence (JS divergence), another method of measuring the similarity between two probability distributions. It is a symmetric and smoothed version of the KL divergence and can be used as a distance metric. Defining the quantity M = (P + Q)*(0.5), we can write the JS divergence as:

JS divergence between distributions P and Q

From the above equations, we see that the JS divergence is equivalent to the entropy of the mixture minus the mixture of the entropy. It is common to compute the square root of JSD as a true metric for distance. All of these metrics are already implemented in Python as the imports below suggest.

from scipy.stats import entropy
from scipy.spatial.distance import jensenshannon
from scipy.special import kl_div

Computing divergence for discrete variables

The Python script below illustrates its use for discrete data, by computing the probability mass function using NumPy’s histogram and then calculating the KL and JS divergences for any discrete data.

JS divergence and KL divergence Python code for discrete variables

To understand its real use, let’s consider the following distribution of some real data with added normal random noise.

mu, sigma = reference_data.mean(), reference_data.std()
noise = np.random.normal(0, sigma, size=reference_data.shape)
pure = reference_data.copy()
signal = pure + noisefig, ax = plt.subplots(1,1,figsize=(12,4))sns.distplot(noise, ax=ax,
             hist=False,
             kde_kws={'lw': 3,'label':'Noise'})sns.distplot(pure, ax=ax,
             hist=False, rug=True,
             kde_kws={"lw": 3,'label': 'Original data'})sns.distplot(signal, ax=ax,
             hist=False,
             kde_kws={"lw": 3, 'label': 'Transformed data'})ax.set_xlabel('Reference data',fontsize=12,weight='bold')

Adding a normally distributed noise to original reference data

We have transformed the original data by adding some additional noise (e.g. transforming the same column name in another dataset). Let’s observe how both the KL divergence and JS divergence behave when compared.

Divergences (KL and JS) for normal random noise with zero mean

Divergences (KL and JS) for normal random noise with different means

The KL divergence seems to be more sensitive to small changes in the original data. Similarly, we can generate some discrete random data and compare its divergence using both metrics.

JS and KL divergence for discrete random data

Here, we can observe the symmetric behavior of the JS divergence. Its value is the same whether we use x_0 or x_1 as our reference data.

Computing divergence for continuous variables

What if we need to compute the divergence for continuous variables as well? Instead of probability mass functions and discrete data (i.e. bins), now we have to deal with probability density functions and continuous data (as we have seen before).

A commonly used method is to approximate the observed statistical distribution with some probability density function (pdf). Well known distributions include normal, Poisson, beta, gamma, etc. Suppose we have two continuous distributions p(x) and q(x) with normal probability density functions, such as:

We can rewrite the KL divergence for continuous data such as

The probability density function of the normal distribution is given by

Which leads to

(Analytical) KL divergence between two normal pdfs

Therefore, we have an analytical expression for the KL divergence instead of the empirical ones. One of the advantages of expressing it analytically is that we get rid of the necessity to “break” the variable into discrete parts (bins) and avoid the problem of support intersection (i.e. whether Q(x)=0 or P(x)=0).

This approach is particularly useful when there is a shift in the transformed data. If instead of multiplying the original reference data we had shifted it (e.g. sum 100 to all of its values), the divergence would still grow monotonically as shown below.

So far, most of the demonstrations shown here can be easily implemented also using the following GitHub package as provided by Michael Nowotny:

https://github.com/michaelnowotny/divergence

Approximating the data with a normal distribution can be heavily biased because nothing guarantees that the vast majority of information present in most data follows a normal density curve. A more general approach can be established by using the generalized gamma distribution function to approximate the data distribution and to compute the KL divergence. Depending on the distribution, it adjusts better to the data points reducing the generalization error. The probability density function of a generalized gamma distribution is given by:

The probability density function of a generalized gamma distribution

Where Γ is the gamma function. It follows from the definition of the Kullback-Leibler divergence that the analytical expression for the KL divergence between two generalized gamma density functions is given by:

KL divergence between two generalized gamma functions

Where Ψ(x) is the digamma function.

Below we show a Python implementation that compares variations in the original data by adding different scalars. Notice that we set p=1 in the original expression to reduce the generalized gamma to a standard gamma distribution.

Python code for the generalized gamma pdf and KL divergence

Approximating the previous distribution using a generalized gamma approach

As we shift the data, the divergence increases monotonically as well, as expected. Python’s scipy package has several implementations for the different distribution functions as shown below.

from scipy.stats import gengamma, gamma, beta #distributions
from scipy.special import gamma, digamma #functions

Computing divergence using a cumulative density function

More recent publications have shown approaches to estimate the KL divergence without solving for the densities first (as it is typically done). That is, this intermediate step is unnecessary as long as the empirical cumulative density function (cdf) is estimated. The empirical cdf is defined as:

Here, U(x) is the Heaviside step function — or unit step function — where U(0) = 0.5. This function can be interpolated linearly, leading to a continuous distribution 𝑃𝑐 (see reference [3]). The final proposed divergence is then:

where

The number 𝜖 (epsilon) is defined such that it is smaller than the smallest distance between the samples. We have implemented a Python code to compute the empirical cumulative density function and its linear interpolation as well as the final divergence estimator. The results can be seen below.

Python code for the cumulative distribution function and its corresponding divergence

As we apply the same transformation shown for discrete data (adding normal noise and changing the standard deviation), we observe that the divergence increases as expected. This is a relatively new approach and more investigations should be performed to better understand its behavior.

Conclusions

We presented three different methods to estimate the similarity between two samples. The first one involves data discretization and the divergence is computed empirically. The second one involves approximating the data with a continuous probability density function. The results seem more stable. It is possible to detect divergence by shifting the data, but its implementation might be more complicated as it depends on the tools and resources available. The third method is the newest one and involves computing the cumulative density function.

Each of these methods has its advantages and disadvantages. Choosing the right one depends on the data and available resources. In Python, many libraries are already implemented. This makes it easier to use more advanced functions (e.g. digamma function). There is a lot of room for improvement and exploration in data quality. Designing the right tools can facilitate your daily job!

References

[1] T. Dasu, S. Krishnan, S. Venkatasubramanian, K. Yi — An information-theoretic approach to detecting changes in multi-dimensional data streams — Interface (2006).

[2]Nielsen, Frank — On the Jensen-Shannon Symmetrization of Distances Relying on Abstract Means — Entropy (2019), vol. 21, issue 5, p. 485.

[3]Fernando Pérez-Cruz — Kullback-Leibler Divergence Estimation of Continuous Distributions — 2007 — Department of Electrical Engineering. Princeton University.

[4]Christian Bauckhage — Computing the Kullback-Leibler Divergence between two Generalized Gamma Distributions — 2014 — University of Bonn.

[5]Thomas Cover, Joy Thomas — Elements of Information Theory — 2nd edition —2006 — Wiley-Interscience