Essential Math for Machine Learning: Kernel Density Estimation

The Smoothed Histogram

5 min readFeb 3, 2024

Source: https://en.wikipedia.org/wiki/Kernel_density_estimation

This article is part of the series Essential Math for Machine Learning.

Introduction

In the expansive domain of machine learning, understanding the foundational mathematical concepts is crucial for developing intuitive and effective models. Among these concepts, Kernel Density Estimation (KDE) stands out as a powerful non-parametric way to estimate the probability density function (PDF) of a random variable. This blog post delves into what KDE is, why it’s important, how it works, when to use it, and provides an illustrative example of using KDE for outlier detection in Python.

What is Kernel Density Estimation?

Kernel Density Estimation (KDE) is a technique used to estimate the probability density function (PDF) of a continuous random variable. It is a non-parametric method, meaning it does not assume any underlying distribution for the data. Instead, KDE smooths the observed data points using a kernel function, essentially providing a gentle curve that represents the density of data points at different values.

Why KDE?

KDE is particularly useful because it offers a flexible approach to understanding the distribution of data points in a dataset. Traditional histogram-based methods can be overly reliant on the choice of bin size and placement, which can significantly affect the representation of the data’s distribution. KDE overcomes these limitations by providing a smooth estimate that can adapt to the actual distribution of the data, making it a more reliable tool for data analysis, especially in cases where the underlying distribution is unknown.

How Does KDE Work?

The essence of KDE lies in how it uses the kernel function to create a smooth estimate of the density. For each point x where we want to estimate the density, KDE calculates the contribution of each data point xi to the density at x, weighted by the kernel function. The sum of these contributions, normalized by the number of data points and the bandwidth, gives us the density estimate at x.

The KDE formula for a data point x is:

d(x) = (1 / (nh)) * ∑ K((x - xi) / h)

where:

d(x): Estimated density at point x
n: Number of data points
K: Kernel function
h: Bandwidth
xi: Individual data points

Mathematically, the bandwidth h is a scaling parameter that affects the spread of the kernel function over the data. It controls how wide or narrow the kernel is around each data point, thus influencing how much each point contributes to the density estimate at any given location. Choosing the right bandwidth is crucial; too small a bandwidth leads to a density estimate that is too jagged, reflecting noise rather than the underlying distribution, while too large a bandwidth oversmooths the data, potentially obscuring important features of the distribution.

When to Use KDE?

KDE is incredibly versatile and can be used in various scenarios, including but not limited to:

Data visualization: KDE can provide a more intuitive understanding of the data distribution compared to histograms.
Outlier detection: By examining the regions of low probability density, one can identify outliers in the data.
Data preprocessing: KDE can be used to transform skewed data into a more uniform or normal distribution, which can be beneficial for certain types of models.

Example: Outlier Detection with KDE

Let’s implement a simple example of using KDE for outlier detection from scratch in Python. In this example, we generate some sample data with 2 centers and a few outliers and apply KDE to estimate its density. We then use a simple threshold on the estimated density to identify and highlight outliers. This illustrates how KDE can serve as a basis for more complex statistical analyses and machine learning tasks.

The code is available in this colab notebook.

import numpy as np
import matplotlib.pyplot as plt

# Define the multivariate Gaussian kernel function
def gaussian_kernel(u):
    # u is the standardized distance between data points and evaluation points
    # d is the number of dimensions (features) of the data
    d = u.shape[1]
    # normalization constant for the Gaussian kernel in d dimensions
    normalization = (2 * np.pi) ** (-d / 2)
    # compute the Gaussian function
    return normalization * np.exp(-0.5 * np.sum(u**2, axis=1))

# Multidimensional KDE implementation
def kde_2d(x, data, bandwidth):
    # n: number of data points
    n = data.shape[0]
    # d: dimensionality of the data
    d = x.shape[1]
    # Initialize an array to store the density estimate at each point in x
    estimate = np.zeros(x.shape[0])
    # Loop over each data point to compute its contribution to the density estimate
    for i in range(n):
        # Calculate the standardized distance u using the bandwidth
        u = (x - data[i]) / bandwidth
        # Accumulate the Gaussian kernel contributions from each data point
        estimate += gaussian_kernel(u)
    # Normalize the density estimate by the number of points and the bandwidth raised to the dimensionality
    estimate /= (n * (bandwidth ** d))
    return estimate

# Generate sample data with 2 features and 2 centers
np.random.seed(42)  # For reproducibility
data_center1 = np.random.normal(0, 1, (500, 2))
data_center2 = np.random.normal(5, 1, (500, 2))

# Manually add a few outliers
outliers = np.array([[10, 10], [11, 11], [-3, 10], [7, -5]])
data = np.vstack((data_center1, data_center2, outliers))

# Create a grid for KDE evaluation and visualization
x = np.linspace(min(data[:, 0]) - 2, max(data[:, 0]) + 2, 100)
y = np.linspace(min(data[:, 1]) - 2, max(data[:, 1]) + 2, 100)
X, Y = np.meshgrid(x, y)
xy = np.vstack([X.ravel(), Y.ravel()]).T

# Compute KDE on the grid
bandwidth = 0.5  # Bandwidth affects the smoothness of the KDE
pdf = kde_2d(xy, data, bandwidth).reshape(X.shape)

# Detect outliers: points with very low probability density
threshold = 0.001  # Threshold for density to consider a point an outlier
outlier_mask = kde_2d(data, data, bandwidth) < threshold
detected_outliers = data[outlier_mask]

# Plotting the results
plt.figure(figsize=(10, 8))
plt.contourf(X, Y, pdf, levels=50, cmap='Blues')  # Density plot
plt.colorbar(label='Density')
plt.scatter(data[:, 0], data[:, 1], s=5, color='k', alpha=0.5, label='Data Points')  # Data points
plt.scatter(detected_outliers[:, 0], detected_outliers[:, 1], s=50, color='red', label='Detected Outliers')  # Outliers
plt.title('2D KDE and Outlier Detection with Two Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Conclusion

KDE is a potent tool in the machine learning arsenal. But remember it is not a one-size-fits-all solution. Experiment with different kernels, bandwidths, and scenarios to unlock its full potential in your machine learning journey!