A Brief Introduction: Gaussian Process Regression

Patrick Sandoval Romero
LatinXinAI
11 min readApr 29, 2024

--

So far in all my previous posts of regressors, I covered techniques that require us to provide a family of functions that we believe will fit the data appropriately. Whether it's a theoretically motivated model or an empirically motivated function we must select a single class of functions to fit our observations. In all these cases our regressor has some sort of loss function that needs to be minimized through the optimization of parameters in said model function. These types of models are what we call parametric and are widely used in the field of science.

A Gaussian process is a non-parametric model that does not find the optimal parameters for a specific class of functions. Instead, it finds the function that best fits the observations while also providing confidence bands for regions that lack observations.

This article will provide a high-level overview of Gaussian processes and their implementation through Python using the sklearn library. Some statistics and Python background would be good for this as I will be covering multi-variate Gaussian distributions, joint probability, marginal probability, and covariance matrices. You are welcome to check out my code on my GitHub repository!

Medium/GPR at main · 159Patrick159/Medium (github.com)

Motivation

Gaussian processes (GP) are an incredibly powerful tool when it comes to modeling complex data with few observations. The power of this method comes from its ability to explore an infinite set of functions with finite computational resources. As previously mentioned, this type of regressor is called non-parametric meaning it makes no strong assumptions about the distribution of data, allowing us to model more complex problems. The main setback of using GPR is that their results are less interpretable than common parametric models.

Introduction to Gaussian Distributions

This section will serve as a quick refresher of some basic statistical concepts, if you feel comfortable with your stats knowledge you may skip this section.

A Gaussian is a type of continuous probability distribution that can be fully described with a mean and a standard deviation. Gaussians are symmetric distributions about the mean and looks bell-shaped. The probability density of a 1D Gaussian is fully described by a scalar mean and a scalar standard deviation.

This expression describes the distribution for the continuous random variable X, with mean mu and standard deviation sigma, we call this a univariate normal distribution.

We can extend this definition for two dependent continuous random variables X and Y with a 2D mean vector “mu” and a 2x2 covariance matrix “sigma”.

Where mu and sigma can be explicitly written as:

Each entry of the mean vector represents the mean for the ‘ith’ variable and sigma represents the covariance matrix also known as kernel. The diagonal entries of the covariance matrix use the variation of the ‘ith’ variable with itself also known as variance, and the off-diagonal entries represent the variation of the ‘ith’ variable with the ‘jth’ variable also known as covariance. In this particular construction of the kernel the greek letter “rho” represents the correlation between X and Y and “sigma_x” and “sigma_y” represent the standard deviation of each variable. We note that the kernel for a Gaussian distribution is always symmetric.

Let us visualize bivariate normal distribution with the following mean vector and covariance matrix.

Figure 1: Bivariate normal distribution (BVN) for random variables X and Y with color bar representing the joint probability distribution. Line plots indicate the marginal probability distributions. Contour plot of BVN located at Z=-0.2 indicates a negative correlation between X and Y as seen in the formerly mentioned kernel. Note the different shapes of the marginal distributions for different values.

We can extend the definition of a bivariate normal distribution (2-dimensions i.e 2-variables) to a multivariate normal distribution (k-dimensions i.e k-variables). The probability density is then given by:

Probability density of multivariate normal distribution with mean vector mu with dimensions k and covariance matrix sigma with dimensions k x k. Here k represents the number of dimensions of the Gaussian distribution.

The Python implementations of an MVN can be found on my GitHub repository. I would suggest becoming acquainted with the Einstein summation convention and the documentation of “numpy.einsum()” before looking at my code.

High-Level Overview of Gaussian Process Regressor

In order to explore all possible functions that can accurately describe our observations we need to change our view of functions to simply arrays of numbers that vary smoothly. When we think of functions as simply numbers in an array, we can get a better grasp of what this technique really does.

Say we have a few observations of something we are interested in studying as seen in the figure below.

Figure 2: Observations of some phenomenon over variable X

Our objective is to create a model that uses the observations and can predict new data. One perfectly valid approach is to connect all the points with straight lines, and then use these lines to predict data points within the range of X=0.5 to X=4.5.

Figure 3: Linear interpolation of observed data.

However, to many of us this solution should be unsatisfactory, for three main reasons. Firstly, we cannot predict data points outside of the region [0.5, 4.5], secondly, we have no evidence to believe that the observations are related in this linear manner and lastly, we have no uncertainty measurements on predicted data points.

So now I will propose a different solution. We will randomly generate observations in the empty regions by sampling points from a Gaussian distribution. And then we will connect these generated points to our observations with straight lines.

Figure 4: Randomly generated data points for X=1.3, X=1.8, and X=2.8. Sampled points follow a Gaussian distribution with same standard deviation but different means.

A Gaussian process follows a similar narrative to the one I’ve proposed above, with the main difference being that it generates the new data by sampling from MVN instead of creating independent Gaussians at each X column. A Gaussian process is able to create smooth lines by using a kernel function which states how data points are related to one another. This prevents discontinuities in our predictions since it ensures the variation from data point to data point is continuous. This means that the key aspect of a Gaussian process is selecting the right kernel function. The following figure samples 30 different points at each X-column from a 100-dimensional MVN distribution with a radial basis function kernel.

Figure 5: Thirty samples of the posterior distribution of a Gaussian Process regressor with radial basis function as kernel and vertical scale length of 9.08 and horizontal scale length of 0.71. Notice that I am sampling 100 data points of which 5 are observations, therefore, the gaussian I am sampling from has 100 dimensions, with a covariance matrix of dimensions (100 x 100)

Although this may seem like we are plotting analytical functions we are really just connecting the points sampled from the MVN. Now it should become clearer how a Gaussian Process is able to test all possible functions that can represent a dataset. Lastly, the prediction given by a Gaussian process is the mean value of all these sampled functions.

Understanding Kernels

Kernels are essential for Gaussian Processes, as they dictate how uncertain a data point is with respect to a given observation. A good way to understand how correlation helps us determine the uncertainty of a new data point is to realize that highly correlated points vary less than uncorrelated data points.

The radial basis function (RBF) and the rational quadratic function (RQF) are the most common kernels used in practice as the correlation between data points is given by their distance from one another. In other words, the further two data points are from one another the less correlated they are.

Here I’ll show 3 kernels widely used in practice, those being the RBF, RQF and sine square exponential kernel widely used for periodic observations.

Figure 6: Different kernels used for Gaussian Processes. Top 2 panels show a square exponential kernel also known as RBF and covariance value of a random variable X and X=0. Middle 2 panels show an exponential square sine kernel and covariance value between X and X=0. The bottom two panels show a rational quadratic exponential kernel and the covariance values between X and X=0. The plots of the right column are obtained by taking a horizontal slice at X=0 of the plots on the left column.

The plots on the left column are analogous to the 2x2 covariance matrix represented in the statistics background section. Here the diagonal entries represent the variance of the variable with itself while the off-diagonal entries represent the covariance between two different variables. The main difference between the RBF and RQF is that the covariance value drops faster for the RBF kernel than it does for the RQF. Lastly the square sine exponential function has periodic covariance fluctuations which allows distant points to be highly correlated depending on the periodicity factor of this kernel. Here is the Python code for each kernel where we notice that they all are dependent on the Euclidian distance between two points.

# Define kernels through functions
def SquaredExponential(x1,x2,a=1,l=1):
'''
Definition of Squared Exponential or RBF kernel with
hyperparameters a (vertical scale) and l (horizontal scale)
'''
val = -1/2/l**2 * scipy.spatial.distance.cdist(x1,x2,'sqeuclidean')
return a**2*np.exp(val)

def RationalQuadratic(x1,x2,alpha=1,l=1):
'''
Definition of Rational Quadratic kernel with hyperparameters
alpha(vertical scale) and l (horizontal scale)
'''
return (1 + scipy.spatial.distance.cdist(x1,x2,'sqeuclidean')/(2*alpha*l**2))**(-alpha)

def ExpSine(x1,x2,a=1,l=1,p=1):
'''
Definition of Sine Squared Exponential kernel with
hyperparameters a:(vertical scale), l:(length scale)
and p:(periodicity)
'''
val = -2*np.sin(np.pi*scipy.spatial.distance.cdist(x1,x2,'sqeuclidean')/p)**2/l**2
return a**2*np.exp(val)

Implementing a Gaussian Process Regressor with Python

Now that we have a better understanding of GPRs as samplers from high dimensional Gaussians. And we understand how this process makes new predictions and calculates uncertainties I will demonstrate how to implement it in Python. First, we will start with all the relevant imports needed.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, RationalQuadratic as RQ

I’m only importing the RBF and RQF kernels because the data we are seeing does not appear to be periodic. Kernels like most machine learning models have hyperparameters that need to be fine-tuned. In the case of the former two kernels, we have two hyperparameters: vertical length scale and horizontal length scale. These two parameters tell the kernel how close the grouping of the data is and how strong the variation should occur within these scales. From an initial inspection, I set the vertical scale to 1 and a horizontal length scale of 0.1 for both kernels. They will be optimized later on.

# Define squared exponential kernel with set length scale and bounds given data
RBFkernel = 1.0*RBF(length_scale=0.1,length_scale_bounds=(1e-5,1))
RQkernel = 1.0*RQ(length_scale=0.1,length_scale_bounds=(1e-5,1))

# Create a Gaussian Process Regressor for specific kernel
gp_RBF = GaussianProcessRegressor(kernel=RBFkernel, random_state=42)
gp_RQ = GaussianProcessRegressor(kernel=RQkernel, random_state=42)

We can take a look at the prior samples of functions for each GPR by using the following helper function initially provided by sklearn and slightly modified by myself.

def plot_gpr_samples(gpr_model, n_samples, ax):
"""Plot samples drawn from the Gaussian process model.

If the Gaussian process model is not trained then the drawn samples are
drawn from the prior distribution. Otherwise, the samples are drawn from
the posterior distribution. Be aware that a sample here corresponds to a
function.

Parameters
----------
gpr_model : `GaussianProcessRegressor`
A :class:`~sklearn.gaussian_process.GaussianProcessRegressor` model.
n_samples : int
The number of samples to draw from the Gaussian process distribution.
ax : matplotlib axis
The matplotlib axis where to plot the samples.
"""
x = np.linspace(0, 5, 100)
X = x.reshape(-1, 1)

y_mean, y_std = gpr_model.predict(X, return_std=True)
y_samples = gpr_model.sample_y(X, n_samples)

for idx, single_prior in enumerate(y_samples.T):
sns.lineplot(
x=x,
y=single_prior,
linestyle="--",
alpha=0.7,
label=f"Sampled function #{idx + 1}",ax=ax,
)
sns.lineplot(x=x, y=y_mean, color="black", label="Mean",ax=ax)
ax.fill_between(
x,
y_mean - y_std,
y_mean + y_std,
alpha=0.1,
color="black",
label=r"$\pm$ 1 std. dev."
)
ax.legend()
ax.set_xlabel('X')
ax.set_ylabel('y')
Figure 7: Prior samples for GPRs with a mean of zero and radial basis function kernel (top plot) and rational quadratic function kernel (bottom plot). These functions have been sampled from a MVN distribution that has not been trained with any observations.

These plots are drawing samples from a 100-D Gaussian with a zero mean vector and RBF and RQF kernel respectively. The plots seen above correspond to the top-left and bottom-left panels of figure 6. Where the square root of the diagonal entries represents the standard deviation of the “ith” variable.

We can fit the MVN distributions to our observations through the following Python code.

# Fit both models to observations
gp_RBF.fit(X,y)
gp_RQ.fit(X,y)

Using the fit() method we are asking to sklearn to adjust the covariance matrix to the observations and to find the optimal hyperparameters for the kernel by maximizing the log-likelihood. Which is a topic I’ve covered in my previous posts. Here are some descriptive statistics for each trained Gaussian process regressor.

Figure 8: Descriptive statistics of different GPR fits to observed data. Notice that both models have a R-square value of one implying a perfect fit on the data. However, the log likelihood value for the GPR with a rational quadratic kernel is higher than the GPR with the radial basis function kernel. It’s important to remember that a log likelihood value by itself is mostly meaningless, the best way to understand this statistic is by comparing it between models.

Now we can ask the newly fit model to make predictions on a more finely sampled domain. Once the predictions have been made, we can plot the mean prediction and its corresponding uncertainty.

# Make predictions on new finely sampled data
X_new = np.linspace(0, 5, 100)[:, np.newaxis]
y_predRBF, covRBF = gp_RBF.predict(X_new, return_cov=True)
y_predRQ, covRQ = gp_RQ.predict(X_new,return_cov=True)
Figure 9: Posterior sample functions of GPR with RBF kernel (top plot) and RQF kernel (bottom plot). Dashed lines are functions sampled from their respective posteriors. Solid black line is the mean prediction of all functions from the MVN distribution. And shaded areas represent a one-sigma confidence band for the prediction in that region.

We can also take a look at the covariance matrix of the posterior MVN distribution to verify the confidence bands of our predictions and see how our x-values are correlated to one another.

Figures 10: Covariance matrix for GPR with RBF kernel (left plot) and RQF kernel (right plot). If we look at the diagonal elements of these matrices, we notice there are four bulges that stand out, these bulges are the regions of high uncertainty in our plots due to the lack of data. Additionally, we see that the scale of the covariance varies slightly with each kernel initially seen in Figure 6.

We notice a strong resemblance between the covariance matrix of these two GPRs however the main difference lies in the scale of the covariance between each variable. As previously mentioned in Figure 6, the covariance value of the RBF has a steeper rate of change the closer variables are, meaning the variation is highly sensitive to distance. This causes nearby points to be highly correlated and thus leads to less variation between these points.

Discussion and Evaluation

From the descriptive statistics presented in Figure 8, we can conclude that the GPR with an RQF kernel is the superior model, due to it having a higher log likelihood. This would imply that for this particular set of observations we want to avoid strongly correlating our variables in order to allow for more variation in areas of uncertainty. Since we have such few data points it’s counterproductive to impose too many restrictions on the variability of our data.

Figure 11: Comparison of posterior distribution predictions to true distribution. Top plot is the posterior of MVN distribution with RBF kernel and bottom plot is posterior distribution of MVN distribution with RQF kernel.

In Figure 11 we can observe how both models fail to predict the decreasing behavior of the true data for points past the last observation. What led our model to perform so poorly in this region is the lack of observations past this point. With no other points to further correlate the predictions are slowly converging back to zero. This happened because the MVN is initially defined with a zero-mean vector before it’s trained on the observations.

For this reason alone, we need to be cautious of using Gaussian Processes for extrapolation purposes. However, this tool is exceptional for interpolation tasks since we make no assumptions about the nature of the data, and we get uniform uncertainty measurements on newly generated data.

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--