Comparing hyperparameter optimization frameworks in Python: a conceptual and pragmatic approach

Gerben
19 min readApr 3, 2020

--

When training Machine Learning models, hyperparameter optimization (HPO for short) is a useful step in the modeling process in order to make sure the model is configured in such a way that it maximizes performance. However, it is more often than not a somewhat tedious process that can become increasingly complex and computationally expensive.

Photo by Jordan Whitfield on Unsplash

To start off, I would like to note that this article is by no means a comprehensive explanation of the many subjects that are touched upon. It leaves out a lot of complexities (math!). It is more like a representation of my learning journey as I make myself familiar with these topics. I hope this journey resonates with — and might be useful for some readers, whom I would invite to make remarks or provide feedback for my benefit, and that of other readers. Thanks.

When performing HPO, there are several methods to choose from. There are plenty of great articles and papers on the inner workings of the underlying (Bayesian) methods. While it is interesting to take a deep dive into the specifics, a critical understanding is not required to just ‘get going’ and perform HPO on your task at hand. However, I think it is useful to get — at the bare minimum — a conceptual understanding of what’s going on under the hood. The goal of this article is to simply compare some HPO frameworks that are implemented in Python in terms of ease-of-use, performance, and stability, while providing a quick glance at the conceptual mechanisms on which they are founded. As not everyone has access to AutoML procedures, hopefully this piece can be of value for anyone working in the Data Science trenches who just wants to optimize a model with a Python implementation that is either quicker or yields a better performance than some of the more conventional methods. Here are some keywords regarding the topics in order for you to decide if reading any further is time well spent.

Grid Search | Random Search | Sequential model-based optimization | Gaussian Processes | Tree-Parzen Estimators | Scikit-learn | Scikit-Optimize | Hyperopt | Optuna | Bayesian Optimization | LightGBM

About hyperparameters

Hyperparameters are those parameters of a model that are not updated during the learning procedure of a model. It can be considered as the ‘configuration’ of a model. Hyperparameters can be parameters that specifically concern the model itself, but may also concern its objective function that is either to be minimized or maximized. For example: a tree-based estimator can be configured using a hyperparameter concerning the maximum depth of the tree, while a hyperparameter concerning the objective function could be the ‘learning rate’ or ‘step size’ with which the function is to be optimized. The domain from which several configuration of hyperparameter values are to be sampled is called the search space, configuration space, sampling domain, or simply hyperparameter space. This search space can become increasingly complex, and can be comprised of continuous, categorical, and conditional hyperparameters. To add a layer of complexity: the optimization algorithm of the model itself can be a hyperparameter. This causes some hyperparameters (often just their values) to be dependent on a value that is selected for another hyperparameter. These are often called conditional hyperparameters creating a conditional or tree-like hyperparameter space. There are several HPO frameworks implemented in Python that support methods that are able to handle nested search spaces.

Why bother?

Excellent question. Generally, one can assume that HPO results in better model performance than the default hyperparameter configuration provided by many ML libraries (such has Scikit-Learn). However, this benefit comes at a cost: HPO can be quite computationally expensive and take a lot of time. Besides the time spend on estimation, one also has to think about what hyperparameters (and their values) might be useful to search over. The time needed to optimize increases exponentially as new hyperparameters or values are added to the search space (curse of dimensionality). Furthermore, HPO is often performed in conjunction with cross-validation to ensure that the optimal set of hyperparameters is robust in terms of generalization to new data, which requires even more estimation iterations. Especially with complex models with plenty of hyperparameters, it can sometimes feel more an art than a science. If you decide that HPO is worthwhile for your problem, you can choose a method that will — roughly speaking — fall in one of three categories:

  • Exhaustive Search
  • Sequential model-based optimization
  • Other / mixed approaches (such as evolutionary algo’s).

We will discuss and compare methods (and their implementation in various Python libraries) out of the former two categories based on multilabel classification tasks. First, we briefly describe the underlying methods to develop a healthy intuition and conceptual understanding without going too deep down into the various rabbit holes. As the comparison is quite a prominent part of this piece, I decided to not embed any code to benefit the reading experience. However, all the code related to this project can be found on my Github.

1. Exhaustive Search

• Grid Search

Grid Search is often the go-to method for HPO, and it’s idea is quite simple. You define a set of hyperparameters and their values, train a model for each possible combination, and subsequently select the combination with the best (cross-validated) score/performance. All hyperparameter combinations can be layed out in a Cartesian grid. However, this approach suffers from the curse of dimensionality, meaning that for each dimension/hyperparameter that is added to the search space — or for each value that is added to a hyperparameter — the number of evaluations that have to be performed increases exponentially. Therefore, this approach can become increasingly computationally expensive, especially for models that have many relevant hyperparameters that influence its performance.

• Random Search

A less expensive approach is Random Search, which applies random sampling to the search space so that only a limited number of models is estimated. What’s different when compared to Grid Search is that you have to specify a certain number of trials (i.e. a budget). Notice that the random sampling is performed w.r.t. the distributions or vectors that are provided. Random Search might be a good choice whenever the length of the value-vectors of the hyperparameters that make up the search space differ strongly. In other words: some hyperparameters that have a lot of values to try out, and others that have only a few. Furthermore, it might also be suitable if it is known that some hyperparameters are of greater importance than others (1, 2). A benefit of Random Search when compared to SMBO methods is that it is less prone to concentrate on a local minimum/maximum.

However, both Grid Search and Random Search have some limitations. The most prominent one being that both methods waste time looking for hyperparameter configurations in search space areas that in reality are not that promising. This is the result of the fact that no information about the quality of each search iteration is communicated from one search iteration to the next. They simply try out combinations of hyperparameter values, not tracking in which direction of the search space the loss might be gradually minimized, nor being able to adjust their behaviour based on this information. This is often called an uninformed search. In some cases, even manual tuning can outperform these methods because when iterating hyperparameter values, a Data Scientist or ML Engineer is actually able to make an informed decision for the next value to try based on each iteration’s output. Sequential model-based optimization (SMBO) methods account for the latter shortcoming to a certain extent. They keep score of promising areas in the search space because they build up a history of configuration settings and their scores, adjusting their behaviour each iteration (an informed search).

2. Sequential model-based optimization (SMBO)

SMBO is a group of methods that fall under the Bayesian Optimization paradigm. These methods use a surrogate model (probabilistic model) and an acquisition function that work together to find the best model by iteratively selecting the most promising hyperparameters in the search space to approximate the actual objective function.

This sentence contains several concepts that define the process. Each iteration, the selected hyperparameters from the search space are evaluated w.r.t. the objective function. This objective function takes in a hyperparameter configuration and outputs a scalar metric which is to be minimized or maximized. The probabilistic model is called a surrogate model/function of which there are several types. Gaussian Processes and Tree-Parzen Estimators are most often used, but Random Forests and Gradient Boosting are not uncommon. These surrogate models output predictions at each iteration, which are used by an acquisition function to help determine which configuration of hyperparameters to select for the next iteration. The Expected Improvement (EI) is often used as a good default acquisition function. The central thought to this whole approach is that the extent to which the surrogate function is able to approximate the objective function is assumed to depend on how many iterations the method is allowed to make. In simple words: it is presumed the performance increases as the number of iterations increases.

• Gaussian Processes

Gaussian Processes (GPs) are often the default when considering Bayesian Optimization. They are renowned for their ability to approximate complex functions, and are non-parametric in nature.

A GP determines (or is provided with) a prior over a set of functions (i.e. probability distributions w.r.t. the search space). At each iteration, values from the search space are sampled as training data, to which a kernel function is applied — of which there are several types — to get a posterior. The resulting covariance-matrix gives the multivariate Gaussian distribution its shape. The sampled/observed values serve as evidence to update the belief about the quality of the prior. A GP assumes that all functions are jointly Gaussian. Gaussian distributions have some attractive properties, one of which is the ability to determine the joint probability of several variables (hyperparameters) and the conditional probability of one hyperparameter given any other hyperparameter. This concept can be extended to all functions, so that a GP results in a (posterior) predictive distribution that represents the joint probability of observing its output values y (i.e. loss) for all input values (i.e. hyperparameter configurations) in x. Cholesky Decomposition can be applied to enable one to sample from this posterior. Each function that is sampled from the set of all possible functions will fall within the uncertainty estimates, and will fit perfectly on the points that are seen during training. In other words: the uncertainty is higher in regions that are not well explored. The acquisition function (EI) will support the GP by reducing uncertainty by iteratively selecting training values in areas that have proven to minimize the function. Each iteration, a value is sampled where the acquisition function is maximized. I came across this excellent visualization, for which I like to credit Criteo AI lab. The figure shows how the mean and uncertainty (based on variance) change as training points are added.

Gaussian Process Regression as visualized by Criteo AI Lab

The blue line f represents the unknown objective function, which is iteratively approximated by the surrogate model (GP) represented by the orange line (+ confidence intervals). As the number of iterations increases, the better the model is able to determine where f is minimized. The acquisition function (EI) indicates where in the search space the surrogate model should orientate itself next (where the Expected Improvement is maximized).

As far as a conceptual understanding, this suffices for now. However, we did barely touch the surface when it comes to GPs. GPs can be fairly hard to understand thoroughly, but it might be worthwhile to put in some effort to get a deeper understanding in this rich subject. There is plenty of material out there that explains GPs far better than I ever could. For starters, you might look here, here, here, or give this classic a chance. As GPs touch upon several key statistical concepts (stochastic processes, kernel functions, and bayesian inference), I personally found it quite useful to do some more research.

• Tree-Parzen Estimators (TPE)

GPs model the loss y given hyperparameter configurations in x, resulting in p(y|x). TPE works somewhat different. It was introduced in this paper. The authors were using hyperparameters of a Deep NN that were continuous, categorical, and furthermore, conditional on one another. If you would visualize such as conditional hyperparameter configuration space, it would be tree-like. They were looking for a SMBO method that can take this into account; introducing TPE.

TPE uses Bayes rule to determine p(x, y) by first determining p(y) and p(x|y). TPE first uses random samples in x to get an initial set of loss scores. Subsequently, it singles out the best loss scores (the ‘good’ group) based on some threshold y* for the loss (often a percentile), and appoints the remaining ones to the ‘remainder group ‘— for lack of better terms. By creating these two groups, two (non-parametric) densities can be constructed to model p(x|y). Let l(x) be the density of the ‘good’ group and g(x) be the density of the ‘remainder group’. Then g(x) is modeled by p(x|y ≥ y*) and l(x) by p(x|y < y*). In other words, l(x) represents the density of hyperparameter configurations with ‘good’ losses, and g(x) the density of those with ‘bad’ losses. With each iteration of the TPE process, a sampled hyperparameter configuration is evaluated and the densities are adjusted to approximate the true loss function.

TPE models the densities of l(x) and g(x) using Parzen estimators (kernel density estimators). TPE — by nature –can handle conditional hyperparameter spaces because Parzen Estimators are organized in a tree structure. This means that for each hyperparameter in the configuration, it provides a fit for each density g(x) and l(x). A Parzen estimator is a window method and is used to estimate probability density functions (pdf’s) based on some sample, without being sure about the underlying distribution. This makes the use of Parzen estimators a non-parametric approach to building densities.

As with GPs, in each iteration, the acquisition function (EI) determines which values are chosen next. Intuitively, one would like to choose a hyperparameter configuration from l(x), since this is the ‘good’ group that yields the lowest loss. EI does this automatically for us. It is maximized where the ratio of l(x)/g(x) is maximized. This means that the hyperparameter configuration that is chosen for the next iteration is the one for which the probability of it belonging to the ‘good’ group divided by the probability of it belonging to the ‘remainder group’ is highest (see the bottom part of the plot). After each iteration, they values that were already used can be reassigned to either the ‘good’ or ‘remainder group (see top part of the plot).

TPE as visualized by Criteo AI Lab

3. Comparison

So.. which method should be used when optimizing hyperparameters in Python? I tested several frameworks (Scikit-learn, Scikit-Optimize, Hyperopt, Optuna) that implement both Exhaustive Search methods as well as SMBO. I aimed to score this comparison loosely based on the following metrics:

  • Ease-of-use
  • Performance (minimizing multinomial log loss)
  • Speed (Estimation time)
  • Stability & Robustness

It’s almost impossible to make a fair comparison — some methods require grids, others distributions; some methods require an iteration-budget, others a time or number-of-fit budget. However, I tried to make this as fair as possible. This comparison was performed on a machine with 32GB RAM and 12 cores. All mentioned, all code can be found on Github.

We will be using three datasets (the Otto group dataset, the SatImage dataset, and Shuttle dataset) and a single model: LightGBM (with early stopping). During initial estimation I found the stability of some frameworks to be somewhat suspect, so I decided to use 5 different random seeds for each method/dataset combination to determine the stability of the outcome. I used a budget of 30 iterations. Let’s review the results..

Dataset #1: SatImage

This dataset has 5144 observations and 36 features in the training set. First, let’s take a look at how the best loss that was found by the various methods evolves throughout iterations. I have taken a plot from one iteration of the five random seeds for which these methods were scored. The Hyperopt methods seem to perform well, followed by implementations from the Scikit-Optimize (Skopt) library. Furhermore, these methods were in the lead even after 10 iterations or so. Notice that the top three at the final iteration use three different surrogate models: TPE, GB, and RF.

It is more interesting to examine the performance over all random seeds. Overall, you may notice that there is quite a gap between the train and test scores, which is interesting. It might be caused by the fact that the dataset is somewhat small, and there is too little data in each of the folds. Also, in case of underfitting, the number of iterations — 30 iterations for this dataset — might be increased. Besides this, it seems that the libraries implement methods in quite a different manner. The best performing method is Random Search as implemented by Scikit-learn. However, there are also Random Search implementations that perform worse than the base classifier — which took less than 1 minute to estimate. Overall, it is rather difficult to conclude anything spectacular from this. In terms of stability and performance, Optuna does not perform as well as I would have hoped. Some test scores seem to be way off. Alongside our winner, the Hyperopt TPE EI, as well as the Skopt implementations could be considered the best performers.

On the of the reasons I started this endeavour was to determine if there were any methods that could shorten the estimation time of the ‘old faithful’ Grid Search, while yielding better performance. It is rather difficult to make a fair comparison, as for Grid Search we are dependend on the size of the grid — and not the number of iterations — but we’ll try to compare them anyhow. Performance and the time to achieve performance is a trade-off. Given the performance in the previous plot, we have to declare Scikit-learn’s implementation of Random Search as the winner here. Estimation time for Optuna methods is also relatively short. Interestingly, the deviation amongst seeds for the Skopt trials is larger than that of other frameworks.

Dataset #2: OttoGroup

This dataset is larger than the first: it has 49502 observations and 94 features in the training set. Also notice the estimation times has increased.

Again, we examine the plot for how the best loss changes over time for one of the five random seeds. At the last iteration, there are three ‘Random Search’ implementations in the Top-5 of best scores. For this specific run, the Optuna CMA-ES algorithm finds the best loss. To be honest, CMA-ES does not fall under either paradigm described earlier (Exhaustive Search, SMBO), but is an evolutionary approach to HPO.

First of all, notice here that HPO is effective in general for classification on this dataset: all HPO methods perform better than the baseclassifier. Examining the total set of train and test scores for all random seeds, we find that there is still some difference in the average test and train scores, with better scores on the test set — as with the previous dataset. However, the difference seems to be smaller, especially in the case of Optuna and Hyperopt implementations. Methods from these two libraries perform similarly well, with the Hyperopt methods having somewhat more stable results. These two libraries can be considered the ‘winners’ here, but Optuna TPE EI specifically does well. The ‘winner’ from previous dataset Sklearn RandomizedSearchCV underperforms relative to other methods, as does Sklearn GridSearchCV. Notice that the scores for the Skopt implementations have a large spread indicating unstable results over all seeds. Some questions still remain. How is it that OptunaSearchCV TPE EI structurally underperforms relative to the other Optuna method that implements TPE EI? Also, in general the Random Search methods yield quite competitive scores, with Scikit-learn as an exception. The reverse was the case for dataset 1..

The estimation time for Scikit-learn implementations are very stable and relatively quick. Sadly, their scores are not as good. The others are really all over the place. Note that I did not run any other processes whilst estimating models, so this is quite surprising to me. Again, we see differences in the same methods for different libraries. As an example: Random Search as implemented by Optuna takes almost two times as long as the one implemented by Scikit-learn. Methods in Optuna seem to be relatively slow compared to dataset #1. The Hyperopt estimation times are OK — especially taking the scores of the previous plot into account. The estimation times for the Gaussian Process as implemented by Scikit-Opt are quite OK, but the other surrogate models in Scikit-Opt are not winning any trophees here.

Dataset #3: Shuttle

This dataset has 46400 observations and 9 features in the training set.

Compared to the other datasets, some methods are still finding better losses in the last iterations. To make sure that all models converge, it would be appropriate to increase the number of iterations. For now, we’ll keep it at 30 as with the other datasets.

* Notice that I set a limit on the x-axis so the all scores for the baseclassifier and Sklearn GridSearchCV are not visible in the plot

HPO has an extreme impact on the model performance w.r.t. the baseclassifier (average test log loss: 2700). Furhermore, the difference between average test and train scores is a lot smaller than in previous datasets. In general, many models are able to minimize the loss well, some even having a log loss of 0 (or near 0) in their test scores. The two methods that perform best are Sklearn RandomizedSearchCV and Optuna OptunaSearchCV TPE EI, closely followed by Skopt BayesSearchCV GP gp_hedge. Again, Hyperopt TPE EI performs OK and is not far behind.

Given the score results, Sklearn RandomizedSearchCV is clearly the winner here. Also notice that Skopt BayesSearchCV GP gp_hedge is quite fast and stable. Notice with any of these time-based dot plots that some differences within each method are to be expected, as a model configuration can be sampled with — for instance — a large tree depth parameter, causing estimation time to be longer.

4. Conclusions

Well.. let’s summarize. The overall results are not as decisive as I would have liked. Both the scores and the estimation times are somewhat depended on the random seeds that I used.

First, let’s review ease-of-use, which we have not considered up to this point. This refers to the use of the API’s, the documentation, and the various features that are provided (plotting, distributed computing, support for nested search spaces, etc..). Personally, I have found to have a slight preference for Optuna. Furthermore, Optuna is able to use Samplers from Skopt as surrogate models, which puts points for flexibility on the scoreboard. Most libraries have good documentation, but I found the documentation for Hyperopt not be as good as the others. Check out this article for a lot of great details regarding this subject.

In terms of performance, there doesn’t seem to be a clear winner, as you might expect. Personally, I will be relying more on Random Search than Grid Search for future projects. Also, I will be experimenting with TPE EI as implemented by Hyperopt and Optuna. Hyperopt TPE EI was showing relatively OK/good performance on both datasets, while not being the absolute fastest method, but providing quite stable results. For the Optuna library in general, we found a relatively bad performance on dataset #1 with estimation time being relatively short, while performance on dataset #2 was good with estimation being relatively long.

Overall, it also seems to be making a difference which library is used, even if the same choices are made for the surrogate model and acquisition function. I would like to invite the reader to participate in the discussion as to why we see the results we see. For example: some SMBO methods perform relatively bad when compared to Random Search when the search space consists of a lot of dimensions (a lot of hyperparameters). In this case study, we tested almost all relevant hyperparameters of the LightGBM model. If we restricted the amount of hyperparameters to a maximum of — let’s say — five, results might be somewhat different.

5. Towards AutoML

In the examples we used just one model (LightGBM). However, the model itself can also be part of the optimization process. Several models can be compared that are each individually optimized using HPO, while rotating varying features as inputs. And to get a bit more Inception on you: HPO methods themselves have hyperparameters than can alter the solution. At this stage we move more towards AutoML and speak more of an ‘optimization pipeline’ than strictly HPO. AutoML procedures that are often provided by Cloud platforms abstract away most of these complexities.

6. Concluding thoughts & further reading

There are some methods that were not included in this study, most notably: Spearmint, SMAC, and HpBandster.

It should be mentioned that even though HPO is quite useful, one should not underestimate the effect of proper feature engineering on model performance. Thinking critically about how to engineer the features that function as input for ML models w.r.t. the problem is a vital skill, even more so as AutoML gets democratized.

HPO might have found you an optimized configuration of hyperparameters, but in doing so have stumbled upon many configurations with great performance. These can be combined in an ensemble to further increase performance. If you find this interesting, you might take a look at an approach called Automatic Frankensteining, or Frankenstein ensembles. There is also some work on how to optimize base-learners themselves for an already existing ensemble.

Most often, one is not able to infinitely explore the search space to find the best hyperparameter configuration. There is a trade-off between model performance and the budget that is to spend (time or iterations). This trade-off can be optimized by fixing one of the elements (often time) by constrained optimization. Likewise, you might want to optimize hyperparameters using multiple loss functions. For the latter case, a Pareto front might provide optimal trade-offs based on which a configuration can be chosen.

Another area which wasn’t covered is the use of population-based methods. These are methods that maintain a population of hyperparameter configurations, tweaking and combining them to find better configurations. Examples of populations-based methods are: genetic algorithms, evolutionary algorithms, and particle swarm optimization.

Sources

(1) Hutter, F., Hoos, H., Leyton-Brown, K.: An efficient approach for assessing hyperparameter importance. In: Xing and Jebara [157],754–762 (2014)

(2) Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, 281–305 (2012)

--

--