Kernel Density Estimation — A Gentle Introduction to Non-Parametric Statistics
Normality is a Myth!
Back in the 20th century, when Statistics was still in its infancy and computers weren’t that popular, it was norm to assume normality as the distribution from which data was generated. It was mostly because it made the calculations less tedious in the age when all results were hand calculated.
But with the rise of computational power these assumptions can safely be put aside and more insights can be drawn directly for the data. Even the availability of data in this Big Data era made Statisticians to adopt more modern techniques — Non-Parametric Statistics. Here we will discuss one such method to estimate the probability distribution, Kernel Density Estimation.
Introduction
As it happens in all of Statistics, we begin with a statement like this,
The more we assume about the data, the less we will be close to reality so let’s make the minimal possible assumption regarding the distribution F i.e. it is a distribution function which is absolutely continuous (probability density/mass function viz pdf/pmf exists). We want to reconstruct the pdf of this unknown distribution function F.
What happens in a parametric estimation problem is that we assume (guess) a parametric form of F and estimate the parameters through various statistical methods like Maximum Likelihood Estimation, Method of Moments, etc. But we are not going to do that here. We will instead go about using non-parametric estimation of this density.
Before we delve into Kernel Density Estimation (KDE), which is used to estimate the density non-parametrically, let’s look at a example where a seemingly non-parametric problem can be cast into a parametric inference problem and then we will look at an instance where non-parametric statistics and KDE play an important role. Let,
where we want to test
We want to do this non-parametrically. The following test can intuitively serve the above purpose,
A common sense approach for testing this null hypothesis would be to see number of positive and negative observations and see how many lies in each category i.e.
We have thus reduced the non-parametric testing problem to a parametric testing problem.
Now, let’s turn towards another example
If g was in the choice of models then distance between estimated f and g would have been 0 for some choice of parameter i.e.
The above situation happens when modelling is perfect, which is often not the case in real-life. So, even for the best choices in the set of parametric functions of the form f, they will be close to g but not exactly equal to f. Therefore, we do the following,
This parameter will often still lead to a positive value of distance in the best case.
One particular choice of distance between two density functions can be the Kullback–Leibler divergence given by,
In the above expression, we see maximizing 2nd term is like minimizing distance as the 1st term is independent of theta. So, minimizing KL(g,f) boils down to
Here, G is unknown. Observe that the above expression which minimizes the KL divergence is of the form: expectation of ln f(x) w.r.t. the distribution function G.
Always remember whatever be your model, methodology, etc. Data is always discrete. Hence, we need to estimate the above expectation using the sample mean
Another cool observation, the above expression needs to be maximized for minimizing KL Divergence but it is same as Maximum Likelihood Estimation, where the above expression gives the Log Likelihood of the sample (ignoring the fractional constant of 1/n).
Above, we somehow bypassed the fact that we are minimizing distance between discrete data and continuous density. But usually it is not possible to do so. For example, if we choose Squared-Hellinger Distance
First question that might come to our mind, why even bother above Hellinger distance. A practical benefit of using it is its ignorance of outliers in the data and a theoretical benefit can be its symmetric expression.
Minimizing Hellinger distance is equivalent to
The special feature in KL Divergence is that we got this final object function as an expectation. But here we can not do that. We can not reduce this to a summation form. So, to calculate the above thing we first need to estimate g(x) from the data reliably. Out model may be continuous but its data is always discrete. Using the data we need to find this continuous density estimate of g(x). This is where Density Estimation comes in play.
We can do this estimation parametrically, but here we will focus on non-parametric estimation of g. Some ideas of estimating density non-parametrically can be to take a look at histograms as the estimate of density. If
The above result is mainly due to the Fundamental Theorem of Statistics.
Kernel Density Estimation
So, let’s look at how Kernel Density Estimation works:
- Take some density K(x) symmetric around 0. This is usually known as Kernel or Window function.
- Choose a bandwidth (Smoothing Parameter)
- Overlay the density K(x) at each point (in the observation) and take the average over all the K(x).
We can write f(x) as,
Visually, we can think of the above function like this
We can improve on the above estimate of density by throwing a scale parameter
With larger h the density estimate will spread out more but with lower peaks. Very small h will make it more spiky.
The Kernel Function can be chosen to be Normal Kernel. In that case we get
This bandwidth (h) play a critical role in getting a perfect shape. It must be chosen as a function of the sample size.
Let’s calculate the Expectation and Variance of the r.v. X following f(x)
Similarly,
So, ideally we will want h to be function of n such that h tends to 0 as n tends to infinity resulting in a consistent estimator of variance.
The most commonly used Kernel in KDE is the Epanechnikov kernel given by,
Application
Kernel Density Estimation has several interesting application. One of which I worked on is Background Subtraction from Videos. Below is a surveillance camera footage of a man sneaking into someone’s backyard to steal something from the car parked there.
Using suitable threshold and KDE on each pixel on the video gives us the following segmentation,
A similar approach can be used for locating fast moving vehicles on a road.
Multi-Dimensional KDE can be used to tackle colored videos too. You can check out more on this work at Background Subtraction Report and for the code checkout My Repository.