Kernel Density Estimation — A Gentle Introduction to Non-Parametric Statistics

--

Normality is a Myth!

Back in the 20th century, when Statistics was still in its infancy and computers weren’t that popular, it was norm to assume normality as the distribution from which data was generated. It was mostly because it made the calculations less tedious in the age when all results were hand calculated.

But with the rise of computational power these assumptions can safely be put aside and more insights can be drawn directly for the data. Even the availability of data in this Big Data era made Statisticians to adopt more modern techniques — Non-Parametric Statistics. Here we will discuss one such method to estimate the probability distribution, Kernel Density Estimation.

Introduction

As it happens in all of Statistics, we begin with a statement like this,

n Random Variables following a distribution function F

The more we assume about the data, the less we will be close to reality so let’s make the minimal possible assumption regarding the distribution F i.e. it is a distribution function which is absolutely continuous (probability density/mass function viz pdf/pmf exists). We want to reconstruct the pdf of this unknown distribution function F.

What happens in a parametric estimation problem is that we assume (guess) a parametric form of F and estimate the parameters through various statistical methods like Maximum Likelihood Estimation, Method of Moments, etc. But we are not going to do that here. We will instead go about using non-parametric estimation of this density.

Before we delve into Kernel Density Estimation (KDE), which is used to estimate the density non-parametrically, let’s look at a example where a seemingly non-parametric problem can be cast into a parametric inference problem and then we will look at an instance where non-parametric statistics and KDE play an important role. Let,

where we want to test

We want to do this non-parametrically. The following test can intuitively serve the above purpose,

Null Hypothesis: Median of distribution F is 0

A common sense approach for testing this null hypothesis would be to see number of positive and negative observations and see how many lies in each category i.e.

Number of positive observations should follow Binomial(n, 1/2) under the null hypothesis

We have thus reduced the non-parametric testing problem to a parametric testing problem.

Now, let’s turn towards another example

Parametric Estimation is getting the closest estimate of f_theta to g

If g was in the choice of models then distance between estimated f and g would have been 0 for some choice of parameter i.e.

rho is the distance metric between two density function

The above situation happens when modelling is perfect, which is often not the case in real-life. So, even for the best choices in the set of parametric functions of the form f, they will be close to g but not exactly equal to f. Therefore, we do the following,

We find the parameter which minimizes the distance between our assumed parametric model and the actual density

This parameter will often still lead to a positive value of distance in the best case.

One particular choice of distance between two density functions can be the Kullback–Leibler divergence given by,

Kullback-Leibler Divergence between g and f

In the above expression, we see maximizing 2nd term is like minimizing distance as the 1st term is independent of theta. So, minimizing KL(g,f) boils down to

The maximization of second term in KL divergence formula leads to minimizing distance

Here, G is unknown. Observe that the above expression which minimizes the KL divergence is of the form: expectation of ln f(x) w.r.t. the distribution function G.

Always remember whatever be your model, methodology, etc. Data is always discrete. Hence, we need to estimate the above expectation using the sample mean

Maximizing this for theta gives us minimum KL Divergence.

Another cool observation, the above expression needs to be maximized for minimizing KL Divergence but it is same as Maximum Likelihood Estimation, where the above expression gives the Log Likelihood of the sample (ignoring the fractional constant of 1/n).

Above, we somehow bypassed the fact that we are minimizing distance between discrete data and continuous density. But usually it is not possible to do so. For example, if we choose Squared-Hellinger Distance

The last expression comes from the fact that the integral of density functions over R is 1

First question that might come to our mind, why even bother above Hellinger distance. A practical benefit of using it is its ignorance of outliers in the data and a theoretical benefit can be its symmetric expression.

Minimizing Hellinger distance is equivalent to

Maximizing this term in the Squared Hellinger Distance results in minimum distance between f and g

The special feature in KL Divergence is that we got this final object function as an expectation. But here we can not do that. We can not reduce this to a summation form. So, to calculate the above thing we first need to estimate g(x) from the data reliably. Out model may be continuous but its data is always discrete. Using the data we need to find this continuous density estimate of g(x). This is where Density Estimation comes in play.

We can do this estimation parametrically, but here we will focus on non-parametric estimation of g. Some ideas of estimating density non-parametrically can be to take a look at histograms as the estimate of density. If

The number of observation goes to infinity and binwidth goes to 0. Histogram converges to Density.

The above result is mainly due to the Fundamental Theorem of Statistics.

Kernel Density Estimation

So, let’s look at how Kernel Density Estimation works:

  • Take some density K(x) symmetric around 0. This is usually known as Kernel or Window function.
  • Choose a bandwidth (Smoothing Parameter)
  • Overlay the density K(x) at each point (in the observation) and take the average over all the K(x).

We can write f(x) as,

The average of the value of all the kernels at each point in observation

Visually, we can think of the above function like this

Kernel Functions (Yellow) around each observation (Green) averaged at each point to result into an estimate of density f(x) (Blue)

We can improve on the above estimate of density by throwing a scale parameter

The h is a scale parameter

With larger h the density estimate will spread out more but with lower peaks. Very small h will make it more spiky.

The LHS shows large values of h, the RHS shows smaller values of h

The Kernel Function can be chosen to be Normal Kernel. In that case we get

KDE with Normal Kernel

This bandwidth (h) play a critical role in getting a perfect shape. It must be chosen as a function of the sample size.

Let’s calculate the Expectation and Variance of the r.v. X following f(x)

The Expectation of the KDE f(x) is sample mean which is desired

Similarly,

This will be used further in the calculation of variance
Variance of KDE X ~ f(x)

So, ideally we will want h to be function of n such that h tends to 0 as n tends to infinity resulting in a consistent estimator of variance.

The most commonly used Kernel in KDE is the Epanechnikov kernel given by,

Epanechnikov Kernel

Application

Kernel Density Estimation has several interesting application. One of which I worked on is Background Subtraction from Videos. Below is a surveillance camera footage of a man sneaking into someone’s backyard to steal something from the car parked there.

Surveillance Camera Video

Using suitable threshold and KDE on each pixel on the video gives us the following segmentation,

KDE + Thresholding based segmentation

A similar approach can be used for locating fast moving vehicles on a road.

Moving Vehicles on a Road captured by traffic cameras
Similar KDE + Thresholding based approach gives us the following output. Efficient Thresholding can help identify speeding vehicles.

Multi-Dimensional KDE can be used to tackle colored videos too. You can check out more on this work at Background Subtraction Report and for the code checkout My Repository.

--

--

Rishi Dey Chowdhury (RishiDarkDevil)

Aspiring ML & Quant Researcher. I have keen interest in solving complex real life data-driven problems. I enjoy traveling, drawing, cooking and music.