Kernel Density Estimation with Python using Sklearn
Kernel Density Estimation often referred to as KDE is a technique that lets you create a smooth curve given a set of data. So first, let’s figure out what is density estimation.
In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. In laymen terms, density estimation refers to mapping data points to a curve or function that would be best representation of the data.
In the above animation, red lines show the data points present and the dashed line shows the corresponding estimated density function using some technique. The technique in our case is KDE that we will discuss now.
Kernel Density Estimation
The simplest non-parametric density estimation is a histogram. Divide the sample space into a number of bins and approximate the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin.
In a more general mathematical sense, a histogram is a function mi that counts the number of observations that fall into each of the disjoint categories (known as bins), whereas the graph of a histogram is merely one way to represent a histogram.
Problems with histogram
- Not smooth
- Depend on endpoints of bins
- Depend on the width of bins
Kernel density estimates are closely related to histograms but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points:
Value -2.1, -1.3, -0.4, 1.9, 5.1, 6.2
For the histogram, first, the horizontal axis is divided into sub-intervals or bins which cover the range of the data. In this case, we have 6 bins each of width 2. Whenever a data point falls inside this interval, we place a box of height 1/12. If more than one data point falls inside the same bin, we stack the boxes on top of each other.
For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points xi. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.[6]
Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.
To remove the dependence on the endpoints of the bins, kernel estimators center a kernel function at each data point. And if we use a smooth kernel function for our building block, then we will have a smooth density estimate. This way we have eliminated two of the problems associated with histograms. The problem of bin-width still remains which is tackled using a technique discussed later on.
More formally, Kernel estimators smooth out the contribution of each observed data point over a local neighborhood of that data point. The contribution of data point x(i) to the estimate at some point x* depends on how apart x(i) and x* are. The extent of this contribution is dependent upon the shape of the kernel function adopted and the width (bandwidth) accorded to it. If we denote the kernel function as K and its bandwidth by h, the estimated density at any point x is
In laymen terms, the KDE is calculated by weighting the distances of all the data points we’ve seen. If we’ve seen more points nearby, the estimate is higher, indicating that the probability of seeing a point at that location.
Changing the bandwidth changes the shape of the kernel: a lower bandwidth means only points very close to the current position are given any weight, which leads to the estimate looking squiggly; a higher bandwidth means a shallow kernel where distant points can contribute.
There are various types of kernels you can use for density estimation.
Python code using Sklearn
Explanation of the code
Here we are using the current mnist digit dataset to sample new points from the distribution. We are using PCA that is principal component analysis to reduce the features. Here for selecting the bandwidth parameter, we are using the technique called grid search cross-validation. In the end, generating 48 new samples from the estimation and plotting them.