Bayesian Optimisation with Gaussian Processes Part 2

Gaussian Process Regression

Wim de Villiers
10 min readDec 29, 2021

This post is article 2 in a series of articles on Bayesian optimization using gaussian process regression. The series contains:

  1. Multivariate Gaussians

2. Gaussian Process Regression (this article)

3. Bayesian Optimisation using Gaussian Process Regression

If you have not read the first article I highly recommend you go back and read that article first before continuing.

All the code I used to generate plots in this blog series can be found in this repo.

Recap

In the first article we stated the steps that are undertaken in Bayesian optimisation:

1. Randomly evaluate some points across the optimisation domain.

2. Use these evaluations to regress a function across the domain. In this article series we will, at times, refer to this regressed function as the mean function.

3. Calculate the uncertainties / variances / standard deviations associated with your regressed function across the optimisation domain.

4. Use these uncertainties and the mean function to evaluate which point in the domain is most likely to move us towards the desired optima.

5. Evaluate at this point and add the evaluation to the set of all evaluated points.

6. Return to step 2 and repeat.

These steps can be summarised in the following image:

Steps to Bayesian Optimisation

In steps 2 and 3 we see that a method of statistical regression is required. In this blog series we have decided to use gaussian process regression as the method of statistical regression. This isn’t an unusual choice: gaussian processes are extremely good at regressing nonlinear functions using only a few data points as input. But what are they? We are going to take quite a detour to explain this but keep in mind all we need at the end of the detour is a way to regress non-linear functions which also gives us variances across the domain.

Stochastic Processes: Overview

As gaussian processes are a type of stochastic processes it is important to understand stochastic processes more generally. The best definition I could find for stochastic processes is:

A stochastic process refers to a family of random variables indexed against some other variable or set of variables.

As this is a little abstract, imagine you have n random variables. You order them 1 to n. You also have a domain which you segment into n points. If you sample 1 value from each random variable and plot them on their corresponding point in the domain the result would be a stochastic process. If the random variables you sampled come from a multivariate gaussian you would have a gaussian process. Now this is a slightly naive definition of a gaussian process. I will dig into the full definition later but for now let’s get more comfortable with stochastic processes and look at a well known example of a stochastic process: the Wiener process.

Wiener Process

The Wiener process famously describes brownian motion. Brownian motion is the motion a small particle undergoes when suspended in a fluid. Such particles move haphazardly around the fluid and it can be shown that the distance they travel (Δd) between time step t and t+Δt are distributed according to Δd ~ N(0, Δt) . This means that if we checked on a particle at fixed time intervals Δt and noted it’s position it could be described by:

Hence W(t) is a stochastic process. It can also be formulated as a gaussian process. However, a more obvious gaussian process could be achieved by plotting Δd against time.

We could view each Δd as being sampled from N(0, Δt) or we can view all Δd as being sampled simultaneously from a multivariate gaussian. If we say that n = T/ Δt then the multivariate gaussian will have an n dimensional covariance matrix Δt*I and the n dimensional zero vector as a mean.

A depiction of Brownian motion and Δd against time can be seen below.

The Wiener Process

Processes are Functions

In the above plots of the Wiener process each “path” can be seen as a function f(t) = d. In the case above the “paths” are functions which map the finite set of domain points onto a set of distance values. However each time we “run” the process we get a different “path”. In this way we can view the process as a distribution over functions. This distribution will have its own mean and covariance.

The fact that a gaussian process “path” can be viewed as a function will allow us to regress functions later. What is more, the covariance of the process will give us the variance we need for Bayesian optimization. It is now time to look at the full definition.

The Full Definition

I have looked everywhere for the best definition of a Gaussian Process. None has been more intuitive than the one from this blog which is the main source I used when making this post. I will quote it here verbatim:

Gaussian processes are distributions over functions, f(x). The distribution is defined by a mean function m(x) and positive definite covariance function k(x, x’), with x as the domain values and (x, x’) all possible pairs in the input domain :

For any finite subset X = {x1, x2, ... xn} of the domain of x, the marginal distribution is a multivariate gaussian distribution:

with mean vector mu=m(X) and covariance matrix K(X, X’).

Note that the first part of the definition can be applied to any domain, say an interval on the real numbers. However sampling from the multivariate gaussian which represents a gaussian process over such a domain would mean sampling over an infinite dimensional multivariate gaussian. That being said when working on a computer we never deal with domains made up of infinitely many points. We segment domains to a fixed number of points. Hence the second part of this definition becomes the most important part: If we want to sample from a gaussian process over a domain with a finite number of points we only need to sample from a multivariate gaussian. All we need to define a multivariate gaussian is a mean vector and a covariance matrix. Hence if we have those we can sample from a gaussian process.

At this point you must be thinking. That is all very good but there is one problem. How will sampling from a multivariate gaussian ever give you a result that in any way reflects a useful function? Surely, all you will ever get is noisy nonsense. Well the magic is in the covariance function / matrix.

Covariance Functions

In the following examples we will be using gaussian processes with a mean function of f(x)=0. When considering a finite domain that implies that our mean vector is the zero vector.

Now that we have that sorted let’s consider two covariance functions:

and

The first function will result in a covariance matrix which is a scaled Identity matrix. The second function is called the squared exponential. For the domain [-3, 3] segmented into 51 segments the function will give the following matrix:

Matrix plot of squared exponential covariance matrix (right) and line plot of center row of the matrix (left).

What can be seen from this plot is that adjacent points are extremely correlated. As the distance between points increases the correlation drops off. What is more, the correlation is so strong when points are close together that one could easily imagine that the “paths” resulting from this covariance matrix would be extremely smooth due to the local correlations. From a global point of view, due to the lack of correlation between distant points, one can expect some variance. That is in fact exactly what we see when we sample from a gaussian process defined by a mean function of zero and squared exponential covariance matrix.

The result above should be understandable but surprising. It is quite counter intuitive to achieve such a smooth function by sampling a multivariate gaussian. In fact this smoothness really is the magic of this particular covariance matrix. If we look at the second plot of the first figure, which demonstrates sampling from a gaussian process defined by a scaled identity matrix, we see nothing nearly as special.

At this point you should be thinking: these smooth “paths” are fantastic. I wonder if we could use them to regress an unknown function. The answer is: yes, easily.

Gaussian Process Regression

We have just learnt that deriving “paths” from a gaussian process is as easy as sampling from a multivariate gaussian. We have also learnt that certain gaussian processes give extremely smooth “paths” when sampled. What if we knew an unknown function’s value at 4 locations but nowhere else and wanted to regress that function. Well 4 function values is not very many so we would have to make some other assumptions. Let’s say that the function is non-linear and smooth. We still wouldn’t have enough to go on. So let’s make a leap: assume that the function could be well described by a gaussian process with a squared exponential covariance function and a mean of zero. Now regression becomes easy. Let’s look at why.

What we would want to achieve is a gaussian process “path” with the correct characteristics (smoothness etc) which passes through our known points. As a “path” is achieved by sampling from a multivariate gaussian we could simply condition the assumed gaussian on the values of the known points. But which of the potential “paths” should we take as the regressed function? The answer is the new conditional mean. In my previous post I took a deep dive into conditional multivariate gaussians. For the sake of simplicity I will now state the formulas for a conditional covariance matrix and conditional mean.

Conditional Mean
Conditional Covariance

Note that in the above y1 is the vector of known points. If you are confused at this point it would be a good thing to go back and look at the previous post.

If we were to use these formulas an example of a regressed function with a squared exponential covariance matrix would look something like the below. (Note the points in black are the points we condition on).

In the plot above I have included the standard deviations in the plot. Where did these come from? They are simply the diagonal components of the conditioned covariance matrix. This makes sense. The value of the regressed function at each point in the domain is sampled from a random variable. Hence the variance of the j-th random variable will just be the j-th j-th entry in the covariance matrix. To achieve the standard deviation we merely square root that value. This gives us a standard deviation for each point in the domain.

We now have a regression technique which can approximate nonlinear functions, which needs only a small number of input points to produce good results and which supplies us with variances across the domain. We are ready to move onto the final step in Bayesian optimization!

Sources

Almost all the content in this blog was taken directly from this blog post. All I have done is focus the topic more towards what one needs to understand for bayesian Optimisation.

In researching this topic I also found another blog post which was quite helpful.

Epilogue: Covariance Functions

There is still an open question here: is a squared exponential the only covariance function you should be using. Generally it is the most popular covariance function when using gaussian process regression for Bayesian optimization. This is generally due to the fact that it works rather well in a low data environment. That doesn’t make it the best covariance function for all regression applications though. In fact if your only objective was to regress certain data points as well as you could there are techniques for deriving your covariance matrix from your data.

Epilogue: Historical Notes

Gaussian process regression was invented by a student in his master’s thesis. His name was Danie G. Krige. The technique was initially used in his geostatistics thesis to estimate gold grades across the Witwatersrand reef complex in South Africa. This really speaks to how effective these methods can be in a low data environment. Drilling out an ore core is an expensive and difficult process. This technique of regression not only gave great accuracy it also allowed for Krige to estimate where the best place would be to drill next in order to gain the most information. Today gaussian process regression is also called Kriging.

--

--

Wim de Villiers

Data Scientist / ML Engineer working at DataProphet. Leveraging machine learning to optimise global manufacturing.