**Kernel Density Estimation**

## Kernel Construction and Bandwidth Optimization using Maximum Likelihood Cross Validation

In this article, fundamentals of kernel function and its use to estimate kernel density is explained in detail with an example. Gaussian kernel is used for density estimation and bandwidth optimization. Maximum likelihood cross-validation method is explained step by step for bandwidth optimization. All computations are coded in R from scratch and the code is provided in the last section of the article.

**What Is a Kernel?**

Kernel is simply a function which satisfies following three properties as mentioned below. Kernel functions are used to estimate density of random variables and as weighing function in non-parametric regression. This function is also used in machine learning as kernel method to perform classification and clustering.

- The first property of a kernel function is that
**it must be symmetrical**. This means the values of kernel function is same for both +u and –u as shown in the plot below. This can be mathematically expressed as K (-u) = K (+u). The symmetric property of kernel function enables the maximum value of the function (*max(K(u)*) to lie in the middle of the curve.

2. The **area under the curve of the function must be equal to one**. Mathematically, this property is expressed as

Gaussian density function is used as a kernel function because the area under Gaussian density curve is one and it is symmetrical too.

3. The value of kernel function, which is the density, **can not be negative**, K(u) ≥ 0 for all −∞ < u < ∞.

# Construct Kernels

In this note, I am going to use Gaussian kernel function to estimate kernel density and to optimize bandwidth using example data sets. The equation for Gaussian kernel is:

Where ** xi** is the observed data point.

**is the value where kernel function is computed and**

*x***is called the bandwidth.**

*h*## Example

Let’s say, we have marks obtained by six students in a subject. I am going to construct kernel at each data point using Gaussian kernel function.

** xi **= {65, 75, 67, 79, 81, 91}

** x1** = 65,

**= 75 …**

*x2*

**= 91.**

*x6*Three inputs are required to develop a kernel curve around a data point. They are:

i. The observation data point which is *xi*

ii. The value of *h*

iii. A linearly spaced series of data points which houses the observed data points where* K* values are estimated. ** Xj** = {50,51,52 …. 99}

Calculation of** K** values for all values of

**for a given values of**

*Xj***and**

*xi***is shown in the table below; where**

*h***= 65 and**

*xi*

**= 5.5.**

*h*** Xj **and

**are plotted below to visualize the kernel.**

*K*Similarly, at all six observed data points, kernel values are estimated as shown in the table and plotted below.

It is observed from the table that value of kernel function is nearly 0 for ** Xj** values those are quite far from

**. For instance kernel density value at**

*xi*

*Xj**= 99 is zero when*

**= 65.**

*xi***Kernel Density Estimation (KDE)**

So far we discussed about computing individual kernels over data points. Now, composite density values are calculated for whole data set. It is estimated simply by adding the kernel values (*K*) from all ** Xj**. With reference to the above table,

*KDE*for whole data set is obtained by adding all row values. The sum is then normalized by dividing the number of data points, which is six in this example. Normalization is done to bring the area under

*KDE*curve to one. Therefore, the equation to calculate

*KDE*for every

**is expressed as:**

*Xj*Where *n** *is the number of data points. The ** KDE **after adding all six normalized kernels is shown below in the plot.

**Bandwidth Optimization**

Bandwidth (** h**) of a kernel function plays an important role to fit the data appropriately. A low value of bandwidth estimates density with lot of variance whereas high value of

**produces large bias. Therefore estimation of an optimal value of**

*h***is very important to build most meaningful and accurate density. As shown in the plot below, three different values of bandwidths produce three different curves. The black one gives lot of variation in the density values which doesn’t look realistic whereas the purple one fails to explain the actual density by hiding information.**

*h*There are several methods proposed by researchers to optimize the value of bandwidth in kernel density estimation. One among those is **maximum likelihood cross validation** method. In this article, more about this method is explained clearly and applied for an example data set.

**Maximum likelihood cross- validation (MLCV)**

This method was proposed by Hobbema, Hermans, and Van den Broeck (1971) and by Duin (1976). In this method kernel function is estimated on a subset of ** Xj **based on

**leave-one-out cross-validation approach**. The objective function which maximizes MLCV is expressed as:

The equation looks complex but it is made easy by explaining each term in detail with example.

**Term 1:**

Values of kernel function as discussed in the earlier section are computed. Here at single observation of *x1**, *** K** values are calculated for certain

**over the range of**

*h***which excludes**

*Xj***from**

*x1***. The values of**

*Xj***K**are added and finally logarithm of the sum is calculated.

## Example

Consider** h** = 3 (

**has to be optimized. I have taken**

*h***= 3 just to explain how optimization function is calculated).**

*h*** xi **= {65, 75, 67,79, 75,63,71,83,91, 95}

The values ** K **for

**=**

*Xj***are set to zero to make sure that they are excluded while making the sum.**

*Xi***Term 2:**

In this example, ** n** = 10 and

**= 3. We can easily estimate the term as:**

*h*The final value of the objective function (MLCV) is calculated by taking the mean of the differences obtained by subtracting Term 2 from Term 1 as shown in the table below.

The value of the objective function i.e. MLCV is -2.3438539 for bandwidth, ** h** = 3. Same process is repeated by selecting different values of

**so that MLCV value approaches to a finite maximum to optimize**

*h***. Golden section search optimisation algorithm in “**

*h**optimize*” function in R is used to maximize MLCV function.

The optimized value of ** h** in this example is

**6.16**and the value of the objective function is

**.**

*-2.295783*KDE is estimated and plotted using optimized bandwidth (= 6.16) and compared with the KDE obtained using *density *function in R. As shown in the plot below, KDE with optimized** h** is pretty close to the KDE plotted using R

*density*function.

# R-Code

*KDE*

#KDE

data <- c(65, 75, 67, 79, 81, 91)

plot(NA,NA,xlim = c(50,120),ylim = c(0,0.04),xlab = 'X',ylab = 'K (= density)')h = 5.5

kernelpoints <- seq(50,150,1)

kde <- NULLfor(i in 1:length(data)){

z <- (kernelpoints-data[i])/h

multi <- 1/(sqrt(2*pi))

kerneld <- ((multi)*exp(-0.5 * z^2))/(length(data)*h)

lines(kernelpoints,kerneld, lwd = 3)

kde <- cbind(kde,kerneld)

}

kde_sum<- rowSums(kde)

lines(kernelpoints,kde_sum, lwd = 3, col = 'red')

grid(20,20)

MLCV — Bandwidth optimization

#MLCV

x <- c(65, 75, 67, 79, 75, 63, 71, 83, 91, 95)

y <- seq(50,100,1)x <- sort(x)

n <- length(x)

u = outer(y,x,"-")fmlcv <- function(h) {

D <- (1/sqrt(2*pi))*exp(-0.5 *(u/h)^2)

colnames(D) <- x

rownames(D) <- y

for(i in 1:n){

kk <- which(rownames(D)==colnames(D)[i])

D[kk,i] <- 0 }

D1 <- log(colSums(D))

D2 <- log((n-1)*h)

Fx <- D1-D2

mean(Fx)}lower = 1

upper = 10

tol = 0.01#Using optimize function in R

obj <- optimize(fmlcv ,c(lower,upper),tol=tol,maximum = TRUE)

print(obj)

plot(density(x), lwd = 4, col = 'purple') #From R library

lines(density(x, bw = obj$maximum), lwd = 2, col = 'red', lty = 1)

legend("topright",legend=c("KDE from R", "KDE from MLCV"),

col=c("blue","red"),lty=1, cex=0.8, lwd = c(4,2),text.font=4)

print(obj$maximum)

# References

Kernel Estimator and Bandwidth Selection for Density and Its Derivatives by Arsalane Chouaib Guidoum (2015)

Density Estimation: Histogram and Kernel Density Estimator by Yen-Chi Chen (2018)