Bayesian Optimization with Gaussian Processes Part 1

Multivariate Gaussians

7 min readDec 29, 2021

Bayesian optimization is a relatively easy concept to understand. However whilst digging into it I struggled to find articles that struck a nice balance between too much detail and too little. This article series seeks to strike that balance.

The series in broken up into three parts:

1. Multivariate Gaussians (this article)

2. Gaussian Process Regression

3. Bayesian Optimization using Gaussian Process Regression

You might be tempted to “flip” to the end and read the section on Bayesian optimization. If you are extremely familiar with statistical regression techniques you potentially could too. However the first two sections will introduce a method of regression which is simple, neat and will allow us to understand every step in the process of Bayesian optimization for a certain regression case. I would strongly encourage you to read all sections.

So grab a tea, set aside a portion of the day and let’s get started.

All code to generate the below plots can be found in this repo. This blog post heavily influenced my current post.

Bayesian Optimization: A Quick Overview

The popularity of Bayesian optimization can be attributed to the fact that as a optimization framework it can be applied to black box functions, does not require the calculations of gradients and achieves an optima with relatively few function evaluations. This makes it a great candidate for hyperparameter optimization in machine learning. The broad strokes algorithmic steps for bayesian optimization can be divided up as follows:

1. Randomly evaluate some points across the optimization domain.

2. Use these evaluations to regress a function across the domain. In this article series we will, at times, refer to this regressed function as the mean function.

3. Calculate the uncertainties associated with your regressed function across the optimization domain.

4. Use these uncertainties and the mean function to evaluate which point in the domain is most likely to move us towards the desired optima.

5. Evaluate at this point and add the evaluation to the set of all evaluated points.

6. Return to step 2 and repeat.

These steps can be shown visually below:

At this point you should have a few questions. Namely:

How do we regress a mean function?
How do we calculate uncertainties across the entire optimization domain?
What is the voodoo that uses the uncertainties and the mean function in order to determine where to look next?

The first two questions are answered through the regression technique used. In our case gaussian process regression. The final question is answered through the chosen acquisition function in your Bayesian optimization framework. So let’s get started with gaussian process regression by revising multivariate gaussians.

Multivariate Gaussians

Most are all extremely familiar with the form of the univariate gaussian which is parameterized by its mean and standard deviation. It defines the probability density function for a single random variable. Multivariate gaussians are just extensions of univariate gaussians to higher dimensions. They represent the probability density functions of a multivariate random variable. Think of a multivariate random variable as a vector of, possibly correlated, random variables.

The formula for a multivariate gaussian can be seen below:

Fomula for a Multivariate Gaussian.

Let’s restrict ourselves to 2 dimensions and explore the formula a little.

μ: The mean value of the univariate gaussian is now replaced by a mean vector as we are in 2D.

Σ: The variance of the univariate is now replaced by a covariance matrix. The diagonal components of the matrix depict how much each random variable varies within itself and the off-diagonal components depict the level of correlation between random variables.

|Σ|: The determinant of the covariance matrix.

Looking at the exponentiated expression one can see that any vector in the domain can be plugged in and a scalar value achieved. This means that at any point in the domain we can be evaluated the probability density function at that point.

There is only one slightly new concept when moving from one dimension to two: the covariance matrix.

Below we can see 3 plots: one where all off-diagonal elements are negative, another where they are zero and finally one where the off-diagonal elements are positive. By looking at the plots the definition of the off-diagonal elements of the matrix as a measure of how correlated the random variables are makes immediate sense.

Bivariate Normal Distributions with Varying Covariance Matrices

There are a few important features to note about the covariance matrix:

It is symmetric. The i-th random variable is as correlated to the j-th as the j-th is to the i-th. (Apologies, mathematical formatting on medium is a nightmare).
It is positive semi-definite. This enforces the requirement that the variance of a weighted sum of random variables is positive. That might sound confusing. Don’t worry too much about it and if you want to know more there is a really easy explanation here.

Now that we know the basics of multivariate Gaussians we will explore two key concepts which will allow us to move from multivariate Gaussians to Gaussian process regression. They are really straight forward so where necessary we will explore them visually and symbolically.

Marginalization

This could not be simpler. If we have a 2D multivariate gaussian depicting random variables x and y and we wanted to know what the marginal distribution of the random variable x we would integrate out the y random variable as follows:

Formula for Marginalizing onto x.

In the 2D case the marginalized distribution p(x) is equivalent to plotting a univariate gaussian with mean μ[0] and variance Σ[0,0]. Which can be depicted graphically as follows:

Note that in higher dimensions we can marginalize out as many dimensions as we want. This is achieved by just dropping those dimensions from the mean vector and covariance matrix.

Conditioning

This concept is similar to that of marginalization except that instead of ignoring the random variables we marginalize out we set them to some given value. The resulting distributions are then normalized. This is easy to depict graphically:

To say that we cut through the distribution at a certain point and then normalise isn’t really how we go about things though. It makes the graphical interpretation easier but in reality we re-compute the mean and covariance matrix for the unrealized random variables based on the realised random variables. We then use these to plot the new distribution.

Say we are in n dimensions. We know the realised values of the first l random variables where l < n. We want to construct the distribution for the rest of the m random variables where l+m = n given that l of the random variables already have realised values. How do we do that?

Let’s organise the vector of our random variables as follows:

Now there is a theorem that the conditional distribution of a normal distribution is again normal. Hence if we are able to calculate the conditional mean and covariance matrix we will have all the tools we need to describe the new conditional distribution!

At this point we could get heavily involved in some maths in order to derive conditional means and covariance matrices. I want to circumvent this for the sake of brevity and point an interested reader to the two derivations found here. I will simply state the final forms of the conditional means and covariance matrices.

We can decompose the mean vector as follows: