Kernel Density Estimation with Python using Sklearn

Vishal Bidawatka
Intel Student Ambassadors
5 min readAug 14, 2019

Kernel Density Estimation often referred to as KDE is a technique that lets you create a smooth curve given a set of data. So first, let’s figure out what is density estimation.

In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. In laymen terms, density estimation refers to mapping data points to a curve or function that would be best representation of the data.

Example of density estimation

In the above animation, red lines show the data points present and the dashed line shows the corresponding estimated density function using some technique. The technique in our case is KDE that we will discuss now.

Kernel Density Estimation

The simplest non-parametric density estimation is a histogram. Divide the sample space into a number of bins and approximate the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin.

Histogram

In a more general mathematical sense, a histogram is a function mi that counts the number of observations that fall into each of the disjoint categories (known as bins), whereas the graph of a histogram is merely one way to represent a histogram.

Problems with histogram

  1. Not smooth
  2. Depend on endpoints of bins
  3. Depend on the width of bins

Kernel density estimates are closely related to histograms but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points:

Value -2.1, -1.3, -0.4, 1.9, 5.1, 6.2

For the histogram, first, the horizontal axis is divided into sub-intervals or bins which cover the range of the data. In this case, we have 6 bins each of width 2. Whenever a data point falls inside this interval, we place a box of height 1/12. If more than one data point falls inside the same bin, we stack the boxes on top of each other.

For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points xi. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.[6]

Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.

Comparison of histogram and kernel function.

To remove the dependence on the endpoints of the bins, kernel estimators center a kernel function at each data point. And if we use a smooth kernel function for our building block, then we will have a smooth density estimate. This way we have eliminated two of the problems associated with histograms. The problem of bin-width still remains which is tackled using a technique discussed later on.

More formally, Kernel estimators smooth out the contribution of each observed data point over a local neighborhood of that data point. The contribution of data point x(i) to the estimate at some point x* depends on how apart x(i) and x* are. The extent of this contribution is dependent upon the shape of the kernel function adopted and the width (bandwidth) accorded to it. If we denote the kernel function as K and its bandwidth by h, the estimated density at any point x is

In laymen terms, the KDE is calculated by weighting the distances of all the data points we’ve seen. If we’ve seen more points nearby, the estimate is higher, indicating that the probability of seeing a point at that location.

Changing the bandwidth changes the shape of the kernel: a lower bandwidth means only points very close to the current position are given any weight, which leads to the estimate looking squiggly; a higher bandwidth means a shallow kernel where distant points can contribute.

Very low bandwidth
Very high bandwidth
Optimal bandwidth

There are various types of kernels you can use for density estimation.

Types of kernel

Python code using Sklearn

Explanation of the code

Here we are using the current mnist digit dataset to sample new points from the distribution. We are using PCA that is principal component analysis to reduce the features. Here for selecting the bandwidth parameter, we are using the technique called grid search cross-validation. In the end, generating 48 new samples from the estimation and plotting them.

48 sampled points

Link to Github repository.

--

--