AutoML with Successive Halfing and Hyperband

4 min readJan 13, 2022

After having given a short introduction into Automated Machine Learning (AutoML) in my last two Medium posts (see Automated Machine Learning and Hyper-Parameter Optimisation and Blackbox and Bayesian Optimisation), I would like to continue and discuss the Hyperband algorithm, which was developed by Li et al. [1] and published in 2017.

According to them, efficient Automated Machine Learning can be done with two main objectives:

Configuration Selection. In coniguration selection approaches, we try to speed up the AutoML algorithm by finding a way to predict the estimated performance of a configuration at a very early stage and invest in testing more promising ones instead (as opposed to e.g. Grid Search, where all configurations are tested). A very common example for a configuration selection approach is Bayesian Optimisation.
Configuration Evaluation. In configuration evaluation approaches, certain amounts of resources are allocated for model evaluation. Therefore, more resources will be given to promising solutions, while less promising ones will only get a smaller amount of resources and will be removed as soon as possible. Successive Halfing [2] and the Hyperband algorithm [1] are examples for such approaches.

Successive Halfing

Successive Halfing was proposed by Jamieson et al. [2] in 2016. The algorithm represents a relatively simple and robust approach for tuning hyper-parameters. The information below is taken from the respective paper and for further information on Successive Halfing, I would redirect you to the paper.

In Successive Halfing, a certain amount of resources is given to a set of model configurations. These are then trained and evaluated and the the worst half (hence, Successive Halfing) is thrown out. In the next round, more resources are allocated to train and evaluate the remaining models. This is repeated until only one configuration remains.

The authors tested their approach to similar hyper-parameter optimisation methods including uniform allocation strategies (e.g. Grid Search). Even though a uniform allocation strategy only trains about half as many models as Successive Halfing does, they were able to show that Successive Halfing returns comparable results in less time.

Potential for Improvement

While Successive Halfing already leads to good results, there is still some room for improvement.

Depending on the given ML problem (and the algorithms used), it should be considered, whether it is better to test either

more configurations and spend fewer resources on training them or
fewer configurations but spend more resources on training the models [1].

This is the problem that the Hyperband algorithm tries to solve.

Hyperband

The Hyperband algorithm is a relatively easy-to-understand and straightforward algorithm. It resembles a more advanced version of a Random Search. The following information can be found in Li et al. [1].

The screenshot below shows the pseudocode for the Hyperband algorithm.

Screenshot taken from Li et al. [1], p.8

The algorithm takes the following two inputs and returns a single (so-far) best configuration.

R is the maximum amount of resources to be used for one single configuration
η controls how many models will be removed in each round of Successive Halfing (the authors suggest using either 3 or 4, therefore Halfing does not necessarily apply here)

It can be seen that the algorithm is made up of two nested for-loops. The inner one performs Successive Halfing. In Hyperband the Successive Halfing algorithm is basically called as a subroutine without major adjustments except for the amount of configurations discarded in each round.

The outer one iterates over different values of n (amount of configurations evaluated in each round of Successive Halfing) and r (amount of resources used to train the models). This is exactly tackling the problem described in Successive Halfing of finding a balance between training many models with few resources or vice versa.

Additionally, it can be seen that the algorithm implements three additional functions.

get_hyperparameter_configurations: returns n model and hyper-parameter configurations (can be chosen randomly or following a certain distribution).
run_then_return_val_loss: trains a given model using some training data and evaluates the model’s performance on a validation set. The validation loss (some arbitrary performance measurement) is then returned.
top_k: returns the top k configurations of a given set of configurations based on the validation loss.

What can be observed is that one of the advantages of hyperband is to stop investing resources into configuration which are not promising, and rather use them to explore more promising ones. Therefore, Hyperband implements an Early-Stopping approach, which leads to performance gains.

Li et al. [1] ran some experiments and observed that, when comparing Hyperband to popular Bayesian Optimisation techniques, Hyperband is usually about 5 to 30 times faster. These observations were made on deep learning and kernel-based techniques.

It can be seen quite easily that the algorithm could be run in parallel, as training and evaluating the different configurations can be done independently from each other.

Additionally, instead of random sampling, techniques to favour more promising solutions could be implemented. Such include, for example, Meta Learning. Meta Learning tries to incorporate information on former datasets and their high-performing solutions through e.g. comparing properties of the dataset, to “warm-start” the search for optimal model and hyper-parameters.

[1] Li et al. Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, 1: 6765–6816. 2017.

[2] Jamieson et al. Non-stochastic Best Arm Identification and Hyperparameter Optimization. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:240–248, 2016.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com