Photo by Prince Abid on Unsplash

Kernel Density Estimation

Kernel Construction and Bandwidth Optimization using Maximum Likelihood Cross Validation

Niranjan Pramanik, Ph.D.
Published in
7 min readSep 24, 2019

--

In this article, fundamentals of kernel function and its use to estimate kernel density is explained in detail with an example. Gaussian kernel is used for density estimation and bandwidth optimization. Maximum likelihood cross-validation method is explained step by step for bandwidth optimization. All computations are coded in R from scratch and the code is provided in the last section of the article.

What Is a Kernel?

Kernel is simply a function which satisfies following three properties as mentioned below. Kernel functions are used to estimate density of random variables and as weighing function in non-parametric regression. This function is also used in machine learning as kernel method to perform classification and clustering.

  1. The first property of a kernel function is that it must be symmetrical. This means the values of kernel function is same for both +u and –u as shown in the plot below. This can be mathematically expressed as K (-u) = K (+u). The symmetric property of kernel function enables the maximum value of the function (max(K(u)) to lie in the middle of the curve.

2. The area under the curve of the function must be equal to one. Mathematically, this property is expressed as

Gaussian density function is used as a kernel function because the area under Gaussian density curve is one and it is symmetrical too.

3. The value of kernel function, which is the density, can not be negative, K(u) ≥ 0 for all −∞ < u < ∞.

Construct Kernels

In this note, I am going to use Gaussian kernel function to estimate kernel density and to optimize bandwidth using example data sets. The equation for Gaussian kernel is:

Where xi is the observed data point. x is the value where kernel function is computed and h is called the bandwidth.

Example

Let’s say, we have marks obtained by six students in a subject. I am going to construct kernel at each data point using Gaussian kernel function.

xi = {65, 75, 67, 79, 81, 91}

x1 = 65, x2 = 75 … x6 = 91.

Three inputs are required to develop a kernel curve around a data point. They are:

i. The observation data point which is xi

ii. The value of h

iii. A linearly spaced series of data points which houses the observed data points where K values are estimated. Xj = {50,51,52 …. 99}

Calculation of K values for all values of Xj for a given values of xi and h is shown in the table below; where xi = 65 and h = 5.5.

Xj and K are plotted below to visualize the kernel.

Similarly, at all six observed data points, kernel values are estimated as shown in the table and plotted below.

It is observed from the table that value of kernel function is nearly 0 for Xj values those are quite far from xi. For instance kernel density value at Xj = 99 is zero when xi = 65.

Kernel Density Estimation (KDE)

So far we discussed about computing individual kernels over data points. Now, composite density values are calculated for whole data set. It is estimated simply by adding the kernel values (K) from all Xj. With reference to the above table, KDE for whole data set is obtained by adding all row values. The sum is then normalized by dividing the number of data points, which is six in this example. Normalization is done to bring the area under KDE curve to one. Therefore, the equation to calculate KDE for every Xj is expressed as:

Where n is the number of data points. The KDE after adding all six normalized kernels is shown below in the plot.

Bandwidth Optimization

Bandwidth (h) of a kernel function plays an important role to fit the data appropriately. A low value of bandwidth estimates density with lot of variance whereas high value of h produces large bias. Therefore estimation of an optimal value of h is very important to build most meaningful and accurate density. As shown in the plot below, three different values of bandwidths produce three different curves. The black one gives lot of variation in the density values which doesn’t look realistic whereas the purple one fails to explain the actual density by hiding information.

There are several methods proposed by researchers to optimize the value of bandwidth in kernel density estimation. One among those is maximum likelihood cross validation method. In this article, more about this method is explained clearly and applied for an example data set.

Maximum likelihood cross- validation (MLCV)

This method was proposed by Hobbema, Hermans, and Van den Broeck (1971) and by Duin (1976). In this method kernel function is estimated on a subset of Xj based on leave-one-out cross-validation approach. The objective function which maximizes MLCV is expressed as:

The equation looks complex but it is made easy by explaining each term in detail with example.

Term 1:

Values of kernel function as discussed in the earlier section are computed. Here at single observation of x1, K values are calculated for certain h over the range of Xj which excludes x1 from Xj. The values of K are added and finally logarithm of the sum is calculated.

Example

Consider h = 3 (h has to be optimized. I have taken h = 3 just to explain how optimization function is calculated).

xi = {65, 75, 67,79, 75,63,71,83,91, 95}

The values K for Xj = Xi are set to zero to make sure that they are excluded while making the sum.

Term 2:

In this example, n = 10 and h = 3. We can easily estimate the term as:

The final value of the objective function (MLCV) is calculated by taking the mean of the differences obtained by subtracting Term 2 from Term 1 as shown in the table below.

The value of the objective function i.e. MLCV is -2.3438539 for bandwidth, h = 3. Same process is repeated by selecting different values of h so that MLCV value approaches to a finite maximum to optimize h. Golden section search optimisation algorithm in “optimize” function in R is used to maximize MLCV function.

The optimized value of h in this example is 6.16 and the value of the objective function is -2.295783.

KDE is estimated and plotted using optimized bandwidth (= 6.16) and compared with the KDE obtained using density function in R. As shown in the plot below, KDE with optimized h is pretty close to the KDE plotted using R density function.

R-Code

KDE

#KDE
data <- c(65, 75, 67, 79, 81, 91)
plot(NA,NA,xlim = c(50,120),ylim = c(0,0.04),xlab = 'X',ylab = 'K (= density)')
h = 5.5
kernelpoints <- seq(50,150,1)
kde <- NULL
for(i in 1:length(data)){
z <- (kernelpoints-data[i])/h
multi <- 1/(sqrt(2*pi))
kerneld <- ((multi)*exp(-0.5 * z^2))/(length(data)*h)
lines(kernelpoints,kerneld, lwd = 3)
kde <- cbind(kde,kerneld)
}
kde_sum<- rowSums(kde)
lines(kernelpoints,kde_sum, lwd = 3, col = 'red')
grid(20,20)

MLCV — Bandwidth optimization

#MLCV
x <- c(65, 75, 67, 79, 75, 63, 71, 83, 91, 95)
y <- seq(50,100,1)
x <- sort(x)
n <- length(x)
u = outer(y,x,"-")
fmlcv <- function(h) {
D <- (1/sqrt(2*pi))*exp(-0.5 *(u/h)^2)
colnames(D) <- x
rownames(D) <- y
for(i in 1:n){
kk <- which(rownames(D)==colnames(D)[i])
D[kk,i] <- 0 }
D1 <- log(colSums(D))
D2 <- log((n-1)*h)
Fx <- D1-D2
mean(Fx)}
lower = 1
upper = 10
tol = 0.01
#Using optimize function in R
obj <- optimize(fmlcv ,c(lower,upper),tol=tol,maximum = TRUE)
print(obj)
plot(density(x), lwd = 4, col = 'purple') #From R library
lines(density(x, bw = obj$maximum), lwd = 2, col = 'red', lty = 1)
legend("topright",legend=c("KDE from R", "KDE from MLCV"),
col=c("blue","red"),lty=1, cex=0.8, lwd = c(4,2),text.font=4)
print(obj$maximum)

References

Kernel Estimator and Bandwidth Selection for Density and Its Derivatives by Arsalane Chouaib Guidoum (2015)

Density Estimation: Histogram and Kernel Density Estimator by Yen-Chi Chen (2018)

--

--

Niranjan Pramanik, Ph.D.
Analytics Vidhya

Scientist with interest in both physics based and data driven modelling