Implementation of Gaussian Processes based Bayesian Optimization

3 min readJul 22, 2019

During this weeks I have been working mainly on the implementation of the Gaussian Processes based Bayesian Optimization using the scikit-optimize package (also known as skopt).

Regarding the hyperparameter space, this implementation only supports uniform, quniform, loguniform and choice options, while the Tree Parzen Estimator based Bayesian Optimization also supports the use of qloguniform, normal, qnormal, lognormal, qlognormal and, most importantly, condicional search spaces. Thus, this new optimization method doesn’t allow the creation of nested conditional spaces, which come in handy in situations where some combinations of hyperparameters are invalid; for example, when we are optimizing the number of layers in our neural network.

However, it is not all bad news about this new implementation, since it provides several optimization algorithms, adquisition functions and adquisition optimizers to try on the optimization of our neural network.

Optimization algorithms:

Bayesian Optimization using Gaussian Processes: the expensive objective function is modelled by a Gaussian Process, in other words, the function values are assumed to follow a multivariate gaussian and the covariance of the function values are given by a Gaussian Process kernel between the hyperparameters. Then the next hyperparameters combination to try is chosen by the acquisition function over the Gaussian prior.
Random forest regressor, extra trees regressor and gradient boosted trees can also be used as surrogate model of the objective function. They provide predictions on many hyperparameter sets, then the next best point is chosen based on the acquisition function.

Adquisition functions:

This function is used to decide which hyperparameters combination is the most promising to try next according to the surrogate model.

LCB: Lower confidence bound. By tuning the kappa parameter we can balance the exploration-exploitation trade-off. The higher this value is, the more we are favouring exploration over exploitation, and vice versa.
EI: Negative expected improvement and PI: Negative probability of improvement. By tweaking the xi parameter we can control how much improvement one wants over the previous best values.The higher this value is, the bigger the improvement is expected.
gp_hedge: probabilistically choose one of the above three acquisition functions at every iteration.
EIps: negated expected improvement per second and PIps: negated probability of improvement per second. By using these two adquisitions functions we can take into account the function compute time.

Adquisition optimizer

We can set the method used to minimize the adquisition function in order to get the next hyperparameter set to try. The adquisition functions is computed at a specific number of points sampled randomly in the search space.

sampling: the point where the adquisition function is minimum is selected.
lbfgs: this algorithm takes some number of the best randomly tried points, then, using this points as initial points, lbfgs is run for 20 iterations to find local minima, finally, the optimal of these local minima is used to update the prior.

In order to enable parallel optimization I have used the ask and tell interface provided by skopt, that allows to ask the library for a suggestion of the point at which to evaluate the objective using the method ask() of the class Optimizer and report the value back using the method tell(). This way we can obtain several promising hyperparameter sets at the same time, thus allowing parallel optimization if used in conjunction with Ray, a package for building and running distributed applications.

It seems to me that the Ray library has been a huge discovery, since it allows to specify a custom scheduling algorithm that can early stop trials or perturb parameters, and includes implementations of early stopping algorithms such as Median Stopping Rule, HyperBand, and Population Based Training (PBT). I am sure this will be useful in the future.

Implementation of Gaussian Processes based Bayesian Optimization

Optimization algorithms:

Adquisition functions:

Adquisition optimizer

Written by Juan Redondo