Bayesian Optimization with Gaussian Processes Part 3

Bayesian Optimization

9 min readDec 29, 2021

This post is article 3 in a series of articles on Bayesian optimization using gaussian process regression:

Multivariate Gaussians

3. Bayesian Optimization (this article)

If you have not read the first two articles I highly recommend that you go back and read those articles first before continuing.

All the code I used to generate plots in this blog series can be found in this repo.

Recap

In the first article we stated the steps that are undertaken in bayesian optimization:

1. Randomly evaluate some points across the optimization domain.

2. Use these evaluations to regress a function across the domain. In this article series we will, at times, refer to this regressed function as the mean function.

3. Calculate the uncertainties / variances / standard deviations associated with your regressed function across the optimization domain.

4. Use these uncertainties and the mean function to evaluate which point in the domain is most likely to move us towards the desired optima.

5. Evaluate at this point and add the evaluation to the set of all evaluated points.

6. Return to step 2 and repeat.

These steps can be summarised in the following image:

In the previous article we managed to get as far as step 3. We now have a regressed mean function and variances / uncertainties across the domain. The next step is to input all the data we have available (mean function, variances and sample points) into a function and have it return the next evaluation point. This function is called the acquisition function. There are many popular acquisition functions which you can use. In this article we will look at three: expected improvement, probability of improvement and lower confidence bound. Before we get to that though we should revise one function in particular: the cumulative distribution function of the normal distribution.

Φ(x) and the Error Function

Two of the acquisition functions we will be looking at rely on the cumulative distribution function of the normal distribution (Φ). This function is closesly related to the error function:

This is equivalence is handy as python has built in support for the error function.

from math import erf

This means that if we ever want to evaluate Φ we can just use the above relationship and python’s built in math function erf. A plot of the Φ function can be seen below.

Acquisition Functions

Once we have a mean function, some samples and variances the next question is how can we combine these things in order to determine where to look next. Well the mean function and the variances give us a distribution of functions. It would be quite natural therefore to ask: is there an expectation value which, when maximised, will give us a point in the domain which is likely to have a lower value than our current lowest sample? Turns out: Yes! Consider the below function where f_min is the minimum value we have achieved in our sampled points and μ(x) is our mean function. (Please note the difference between μ(x) and u(x). u(x) denotes utility function. Which is the technical name for the function below. μ(x) denotes the mean function. Which is our regressed function.)

This function will be 1 wherever our mean function is below our minimum sample point and 0 wherever it isn’t. Now we know the mean function and the variances across the domain. This allows us to calculate the expectation value of the above function across the domain. The result of this calculation is the Probability of Improvement. The formula can be seen below:

The x value at which the above function obtains a maximum is the point in the domain where we have the highest probability of improving on our current sample’s minimum value.

This is a good start but we can do better. Our mean function is our estimate of the unknown function. Knowing it doesn’t just let us know when our estimate is below the lowest sampled point. It also lets us know how much lower our estimate is from our lowest sampled point. Rather than calculating the expectation value of the function u(x) why don’t we rather calculate the expectation value of the below:

The expectation value of the above is known as the Expected Improvement. The formula for the Expected Improvement can be seen below:

Where φ (lowercase phi) is the probability density function of the normal distribution. The x value where the above function achieves its maximum is the point in the domain where we expect the greatest improvement to be achieved relative to our sample points.

Finally what if you are really lazy and don’t want to evaluate expectation values? Well then we can just use the Lower Confidence Bound:

Lower Confidence Bound

In the above κ is a tunable parameter which, if increased, will force the acquisition function to weigh areas of high variance more than areas of low variance.

That is all a lot of information. Let’s look at what these functions look like in practise and investigate that:

Let’s start with the bottom: LCB. What you can easily see from the plot is that LCB is just a scaled version of the lower bound of the shaded region in the first plot. In fact, in the plots above the lower bound of the shaded region and the LCB are the same as we have set κ=2. LCB doesn’t do anything smart and as such the point it chooses to evaluate next is solely based on the fact that there is high variance there. It will not always be the case that a point of high variance will be chosen but you can see how the linear combination of the mean function and variance has its drawbacks.

The probability of improvement: PI. Here we see that the function has a huge bias towards points in the neighbourhood of the minimum observed sample point. This makes sense when looking at the formula and considering that there are infinitesimal variances in that neighbourhood and μ(x) — f_min will evaluate to a positive number there too. In technical parlance we say that the probability of improvement over values exploitation.

The expected improvement: EI. EI seems to perform the best. In general it is one of the most widley used acquisition functions in Bayesian optimization. This makes sense as it is taking into account the value of the regressed mean function whilst not overvaluing the variance. Further unlike LCB there is a much stronger signal in the domain for points which are likely to improve on our current samples.

Exploration vs. Exploitation

In the previous plots we saw how some algorithms valued exploitation (looking at points close to our lowest valued sample) and others preferred exploration (valuing points of high variance). The plots showed how EI struck a nice balance between the two. Finally we saw that LCB has a tunable parameter which allows us to determine how much to value exploitation versus exploration.

The thing I didn’t state was that all the acquisition functions have a tunable parameter. I just left the parameter out for EI and PI as it made the maths less confusing. This tunable parameter behaves the same for all acquisition functions: increasing it increases the tendency towards exploration.

Finally, what if you wanted to initially prioritise exploration and there after prioritise exploitation? This is an extremely valid strategy and is often employed. We would call this creating a schedule for κ. The schedule that would reflect this would be using a high value for κ early on in the optimization and on slowly decreasing it to zero as we progressively add more sample points to our sample set.

Variations

There we have it. A full explanation of Bayesian Optimization using Gaussian Processes. But Gaussian Processes are not the only kind of regression one can use for Bayesian optimization. Any form of statistical regression which can both regress a function and produce uncertainties will do. So are Gaussian Processes the best choice? Not really. They are just really neat and I couldn’t resist using them in this explanation. They are also a great introduction into stochastic processes in general. Their major drawback is that they assume a continuous domain. When doing hyperparameter optimization for any machine learning model there will most likely be categorical variables you will want to optimize over (for example the activations of a neural network layer or the splitting criteria of a tree). As these are not continuous variables we cannot use gaussian processes to regress them.

Luckily there are many other methods which can generate distributions over categorical variables. One of those is Tree Parzen Estimator. They are used in Sequential Model Based Optimization which is a form of Bayesian optimization and is used by the HyperOpt python library.

Python Libraries

The python library I would recommend for optimization above all others is Ray Tune. It is a meta library. It provides a unified interface into all the most popular python optimization libraries. This includes bandit based methods and Bayesian optimization methods. Further it allows a user to deploy the optimization routine to a cpu, gpu or multiple nodes of either. It is fantastic and I will be making a repo public which uses it to optimise a deep learning model. Watch out for the link.

If you want to use a library which exactly mimics the optimization procedure we have gone over today you could look into dragonfly-opt or scikit learn’s gp_minimize. In fact sklearn’s scikit-optimize has a fantastic array of optimization procedures you can choose from.

If you want something which can handle categorical variables and intelligently check points your results hyperopt is always a good library.

Hyperopt, scikit-optimize and dragon fly are all available through Ray Tune.

That all being said. If you really want to optimize a machine learning model for work and you are on a deadline nothing beats AWS Sage Maker. Throwing money at a problem always yields results. Give Jeff his due and get stuff done. Words to live by.

Caveats

There are two caveats when using Bayesian optimization.

Although regressing a mean function is very cheap it can become more and more expensive the more sample points you add to your evaluated set. This can easily be seen by looking at the following formulas again:

Conditional Mean

Conditional Covariance

What you can see is that we need to invert the covariance matrix associated with all our sample points each time we want to regress. This means that if we want to regress 10 000 points we need to invert a 10 000 X 10 000 matrix.

2. Check that your convergence is sensible. At the end of each evaluation for your hyper-parameters search / Bayesian optimization routine make sure your program dumps out the hyper parameters to file. Once your optimization routine is complete make sure that all the best performing results came from similar hyperparameters. This is especially important when optimising models. You can often “get lucky” in training and have a neural net which trains well but has bad hyperparameters. This model can land up being your best model but you’ll never be able to recreate it’s performance. By checking the distribution of the hyperparameters for your top n models you should be able to spot this and avoid this. Some tools for helping with this are tensor boards HPParams Dashboard or Facebook’s HiPlot.

Blogs:

JCM Wave Bayesian Optimization

Lectures:

“Papers”:

A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

Bayesian Optimization with Gaussian Processes Part 3