Comparison of Hyperparameter Tuning algorithms: Grid search, Random search, Bayesian optimization
In the model training phase, a model learns its parameters. But there are also some secret knobs, called hyperparameters, that the model cannot learn on its own — these are left to us to tune. Tuning hyperparameters can significantly improve model performance. Unfortunately, there is no definite procedure to calculate these hyperparameter values. This is why hyperparameter tuning is often regarded as an art than science.
In this article, I discuss the 3 most popular hyperparameter tuning algorithms — Grid search, Random search, and Bayesian optimization.
What is Hyperparameter Tuning?
Model training is a process through which a model learns its parameters. Besides this, every model also has some hyperparameters which it cannot learn, but can be tuned for. In contrast to model parameters which are learned during training, model hyperparameters are set by the data scientist ahead of training. This process of tuning various hyperparameter values is called hyperparameter tuning. (Note the usage of the term hyperparameter tuning, and not hyperparameter training).
Model parameters are learned from data automatically during training.
Model hyperparameters are set and tuned manually, and are used during training to help learn the model parameters.
Hyperparameter Tuning Algorithms
1. Grid Search
This is the most basic hyperparameter tuning method. You define a grid of hyperparameter values. The tuning algorithm exhaustively searches this space in a sequential manner and trains a model for every possible combination of hyperparameter values.
For example, to train a SVM model we can define our hyperparameter grid space (C, gamma, kernel) as follows. The grid search algorithm trains multiple models (one for each combination) and finally retains the best combination of hyperparameter values.
'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.1, 0.01 ,0.001, 0.0001],
'kernel': ['rbf', 'linear']
Grid search is not very often used in practice because the number of models to train grows exponentially as you increase the number of hyperparameters to train. This can be very inefficient — both in computing power and time.
2. Random Search
Random search differs from grid search in that we no longer provide an explicit set of possible values for each hyperparameter; rather, we provide a statistical distribution for each hyperparameter from which values are sampled. Essentially, we define a sampling distribution for each hyperparameter to carry out a randomized search.
Using random search, we can also control or limit the number of hyperparameter combinations used. Unlike grid search, in which every possible combination is evaluated; in random search, we can specify to train only a fixed number of models and terminate the tuning algorithm post that. The number of search iterations can also be set based on time or resources.
For example, to train a SVM model we can define our hyperparameters (C, gamma) as log-uniform distributions. (Some other common distributions are
random). The random search algorithm samples a value for C and gamma from their respective distributions, and uses it to train a model. This process is repeated several times and multiple models are trained. The best combination of hyperparameter values are finally retained.
'C': loguniform(1e-1, 1e3),
'gamma': loguniform(1e-4, 1e-1),
'kernel': ['rbf', 'linear']
3. Bayesian Optimization
In the previous two methods, we performed individual experiments by building multiple models with various hyperparameter values. All these experiments were independent of each other. Since each experiment was performed independently, we are not able to use the information from one experiment to improve the next experiment.
Bayesian optimization is a sequential model-based optimization (SMBO) algorithm that uses the results from the previous iteration to decide the next hyperparameter value candidates.
So instead of blindly searching the hyperparameter space (like in grid search and random search), this method advocates the usage of intelligence to pick the next set of hyperparameters which will improve the model performance. We iteratively repeat this process until we converge to an optimum.
An interesting analogy is to compare this to Bagging Vs Boosting. If you think about it, the idea is very similar!
In bagging, we build a lot of trees in parallel and independent to each other. Boosting, on the other hand, is a sequential process where with each additional tree, we learn to correct the mistakes from its predecessor tree.
Bayesian optimization creates a probabilistic model, mapping hyperparameters to a probability of a score on the objective function. For more mathematical details, refer this.
Bayesian optimization methods are efficient because they select hyperparameters in an informed manner. By prioritizing hyperparameters that appear more promising from past results, Bayesian methods can find the best hyperparameters in lesser time (in fewer iterations) than both grid search and random search.
Selecting the best hyperparameters
We have been talking a lot about the best hyperparameters. But how we do choose the best? Which metric is used to differentiate between good and bad hyperparameters?
Well, the simplest approach is as follows:
- Split your dataset into train (70%) and validation (30%) sets.
- Choose any hyperparameter tuning algorithm — grid search, random search or bayesian optimization.
- Decide and create a list of the hyperparameters that you wish to tune.
- Train multiple models — One model is trained for every hyperparameter value combination.
- Calculate the validation set accuracy for each of these models.
- Choose the model which achieves the highest accuracy on the validation set.
This is your best model and the hyperparameters that were used to train this model are your best hyperparameters.
Cross Validation (CV)
While the above approach is logical, splitting your dataset directly into train and validation sets is not advisable as you might lose a lot of important information in the data points that go into the validation set and are never used for training. To avoid this, we generally use cross-validation (CV) to tune our hyperparameters.
Scikit-learn is a popular library that provides ready-to-use implementations of GridSearchCV, RandomizedSearchCV and BayesSearchCV.
As much as we would like to improve our model performance through hyperparameter tuning using cross-validation, it is important to remember that it significantly increases the overall model training time.