When do I need to use Bayesian optimization? (BO series I)

Eduardo C. Garrido Merchán
4 min readAug 22, 2022

--

Bayesian optimization [1] is a class of methods that have been used in the recent years with state-of-the-art results to optimize the estimation of the generalization error of machine learning algorithms. A clear example of its application is the optimization of AlphaGo by Google [2].

Driven by the amazing results of Bayesian optimization in practice, the community has fallen into a hype about the application of it. Consequently, Bayesian optimization is not applied accurately and it ends up failing. The following step for practitioners is trying to understand how Bayesian optimization works, which is a source of true despair by those that do not have a mathematical background. Believe me, truly understanding all the features of Bayesian optimization is difficult. However, I will try to teach you all the details in following short entries such as this one.

As I believe that Bayesian optimization is a really nice technique, I will publish a series of posts such as this one (The BO series) where I will explain how Bayesian optimization works and how can we easily apply it to a problem.

In particular, the first important thing to know is to answer this simple question. When do I need to use Bayesian optimization? And the answer is suprising, because if we consider all the global optimization scenarios, it is not very probable that Bayesian optimization is the correct tool for your problem.

Let us begin with the basics, Bayesian optimization is a class of methods to optimize global black-box functions (x* = argmin f(x) / f: R^n->R) in a set of variables. An example is to find the optimum number of employees to be hired with respect to the total budget of the company, the sector, the cost per employee, fixed costs, the number of costumers, etc… I will upload in another posts some amazing scenarios where Bayesian optimization may be used. (It is obviously not only fit for machine learning!)

But, what is exactly a black-box for the BO literature? Basically Bayesian optimization needs f(x) to validate these assumptions:

  1. f(x) does not have an analytical expression. If you have an analytical expression then you would be able to compute gradients and to apply other optimization algorithms of the literature! For example, Bayesian optimization can be applied to optimize the flavour of a cookie or of a particular recipe with respect to their ingredientes [3]. Obviously, we do not have an analytical expression to make the best cookies!
  2. f(x) is smooth and continuous. What is exactly this? Basically that we expect to obtain a more similar result if we only vary a little the parameters of f(x) than if we vary them more. For example, if we just add a little bit more of sugar to the cookie, its taste would change slighlty. However, if we add a new ingredient, it would change dramatically. As we will se in further post, this feature of the black-box is the key that makes Bayesian optimization work.
  3. f(x) is expensive to evaluate. For example, if we want to optimize the estimation of the generalization error of a deep neural network with respect to their hyper-parameters, as number of layers or neurons, activation functions, regularization mechanisms, type of layers, learning rate, momentum and others every time that we fit a neural network to a huge dataset we demand a lot of computational time. In other words, we can not afford a lot of different hyper-parameter set of values as every different configuration is very expensive. This is the main difference of Bayesian optimization with respect to meta-heuristics like genetic algorithms. In the case of genetic algorithms, f(x) is cheap to evaluate.
  4. f(x) can be noisy. But what is exactly a noisy function? Easy, it is a non-deterministic function, or a stochastic function, that for the same set of parameter values we obtain a different but similar value. That is, y=f(x) is corrupted by noise. If this noise is gaussian, e=N(0, sigma), making y=f(x)+e, then adding a gaussian likelihood to the probabilistic surrogate model of Bayesian optimization would the technique work well. But I will explain that in another post of the BO series. f(x) can also be deterministic but it is a cool ability of Bayesian optimization to handle this noise. For example, the same cookie can have a different evaluation provided even for the same person.

Do you want to know more about Bayesian optimization? Stay tuned!!! And leave in the comments any particular doubt that you have about this cool class of methods.

[1] Garrido Merchán, E. C. (2021). Advanced methods for Bayesian optimization in complex scenarios.

[2] Chen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrittwieser, J., Silver, D., & de Freitas, N. (2018). Bayesian optimization in alphago. arXiv preprint arXiv:1812.06855.

[3] Garrido-Merchán, E. C., & Albarca-Molina, A. (2018, November). Suggesting cooking recipes through simulation and bayesian optimization. In International Conference on Intelligent Data Engineering and Automated Learning (pp. 277–284). Springer, Cham.

--

--

Eduardo C. Garrido Merchán

PhD on Machine Learning. Assistant Professor of Statistics, Econometrics and Machine Learning on Universidad Pontificia Comillas.