Bayesian Optimization
Introduction
The earliest work of Bayesian Optimization is dated back to 1964 in Kushner’s work¹. Now it is a very popular technique in machine learning. When optimizing objective function f(x) with no closed-form expression and one can only obtain observations (possibly noisy) of this function f(x) at sampled values, gradience descent method to find optimum fails. One can numerically estimate gradient or use grid search/random search when computing f(x) is not that expensive. However, there are cases that computation values of f on given points can be expensive. Let me list some examples when the evaluation is expensive: x is geographic coordinates and f(x) is the amount of oil; x is the hyper-parameter of a deep neural network and f is some objective function, then it will take a long time to compute f(x); x is a kind of drug and f(x) is the effectiveness against disease, then it is not only time-consuming, money-consuming, but also rat-consuming 🙃. This is when Bayesian optimization is useful.
Bayesian optimization is an iterative process to find optimum and it is very good at finding the global extremes with minimum number of trials, which is an advantage over grid search or random search. In this article, we will introduce Bayesian Optimization in the following order:
- What is Bayesian Optimization?
- Surrogate function: Gaussian Process (GP) to approximate the original objective function
- Acquisition function: a guidance for sampling positions
- Illustration of the Optimization iterative process
What is Bayesian Optimization?
Bayesian Optimization is an algorithm to find extrema of an objective function, especially when the function is expensive to evaluate. The main idea of Bayesian optimization is that first using a surrogate function to approximate the target function f based on current sampled points. The surrogate function should be able to model arbitrary complex functions and it is also cheap to evaluate. Then find an acquisition function to estimate the profit for optimization based on the current surrogate model, and the next sampled point is where the acquisition function is maximized.
It is called Bayesian because Bayesian Optimization uses the famous “Bayes’ theorem”, which states that the posterior probability of a model M given evidence (or data, or observations) E is proportional to the likelihood of E given M multiplied by the prior probability of M:
In Bayesian optimization, the prior P(f) represents our belief about the space of possible objective functions. Together with the observed data
one obtains the posterior distribution:
The posterior distribution reflects our updated belief about the unknown function f. It is also called the surrogate function, which acts as an approximation of the true objective function f. As for prior, Gaussian Process (GP) is well-suited to a large family of common optimization tasks. The blog² by Martin Krasser is a good resource for Gaussian Process. Gaussian Process is completed specified by its mean function and covariance function. An intuition of GP is that it is analogous to a function, but instead of returning a scalar at x, it returns a normal variable with mean equals f(x). Note that surrogate function is not only a deterministic function: we can also measure the uncertainty of it when it is used to approximate the true objective function.
The second question is how to effectively sample the next point where to evaluate the function f. We will introduce a kind of function called acquisition function, which is used as a guidance to determine the next location to sample. It is defined so that high acquisition corresponds to potentially high values of the objective function (if the goal is to find maximum), either because the prediction is high, the uncertainty is great, or both. When the search is in the region with high uncertainty, it is exploration; when the search is in the region with high estimated value (suppose we are looking for maximum of the objective function), it is exploitation. One should pay attention to the trade-off of exploration and exploitation when determine the next sampling point. The next sampling point is where the acquisition is maximized:
where u is the generic symbol for an acquisition function.
Let’s put everything together.
Bayesian Optimization
For t = 1, 2, …, do
- Find the next sampling point x_t by optimizing the acquisition function over the Gaussian Process.
- Sample the objective function at x_t.
- Augment the observation and update the Gaussian Process.
end for loop.
Surrogate function: Gaussian Process (GP) to approximate the original objective function
The posterior distribution is still a Gaussian Process from the Bayesian’s formula and the assumption of the Gaussian prior². It is an approximation to the original objective function with uncertainty estimation, refer to Figure 1.
Acquisition function: a guidance for sampling positions
There are many options for the acquisition function. Popular options are maximum probability of improvement (MPI), expected improvement (EI) and upper confidence bound (UCB). We will introduce them one by one below.
Maximum Probability of Improvement (MPI)
Maximum probability of improvement is the maximum probability of the surrogate function is no less than the current maximum value. Since we know the distribution of the surrogate function at any point is a normal distribution, one has the acquisition function as
In the formula above, f* is the current maximal value of the objective function, and f-hat is the surrogate function with standard deviation sigma.
The parameter epsilon is a trade-off parameter. Without it, the formula is pure exploitation. Points that have a high probability of being infinitesimally greater than f* will be drawn over points that offer larger gains but less certainty.
Expected Improvement(EI)
The drawback of MPI is that it only considers the probability of improvement without taking into consideration of the magnitude of the improvement a point can potentially yield. EI takes into account both. The improvement function is defined as
so I(x) is the positive part of how much the prediction is higher than the best value known so far. The next sampling point is
Since the posterior distribution is normal with mean µ(x) and variance Var(x), one can calculate the expectation of the improvement function and it turns out to be:
where
Phi and phi are the CDF and PDF of the standard normal distribution.
Similar to Maximum probability of improvement, we can incorporate a parameter epsilon for exploration-exploitation trade-off: when exploring, we should chose points where the surrogate variance is large; when exploiting, we should chose points where the surrogate mean is high. In the formula above, all f* are replaced by f* + epsilon.
Upper Confidence Bound (UCB)
The last popular acquisition function is upper/lower confidence bound, which is defined as below:
Users have the freedom to tune the parameter k in the UCB acquisition function.
Illustration of the Optimization process
Note: The code to generate Figure 4 is from Martin Krasser.
Reference
[1] Kushner, Harold J. “A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise.” Journal of Basic Engineering 86.1 (1964): 97–106.
[2] Martin Krasser. “Gaussian Process”. http://krasserm.github.io/2018/03/19/gaussian-processes/
[3]Overview of Bayesian Optimization https://soubhikbarari.github.io/blog/2016/09/14/overview-of-bayesian-optimization
[4] Gaussian Process https://en.wikipedia.org/wiki/Gaussian_process
[5] Jones, D.R., Schonlau, M. & Welch, W.J. Journal of Global Optimization (1998) 13: 455. https://doi.org/10.1023/A:1008306431147
[6]Martin Krasser. https://github.com/krasserm/bayesian-machine-learning