Hands-on Tutorials

Improve your model performance with Bayesian Optimization Hyperparameter Tuning

Michele Cavazza
Towards Data Science
8 min readMar 3, 2021

--

Introduction

“Why should I read this article?”

If you have started using ML for your projects or simply for fun you might have realized how challenging the task of tuning your model can be and especially it is quite time-consuming. If you are not an expert ML practitioner it is going to be challenging to make good educated guesses on which hyperparameters should be used. Also, the algorithm that you are using might be quite sensitive to the hyperparameters selection, therefore it is useful to test more implementations. In this article, I will empirically show the power of Bayesian Optimization for hyperparameter tuning and compare it to more common techniques.

Background

“A quick recap on hyperparameter-tuning”

In the field of ML, the most known techniques to evaluate several sets of hyperparameters are Grid search and Random search. A nice visualization of how they works can be seen below:

Image by Author, inspired by Random Search for Hyper-Parameter Optimization (James Bergstra, Yoshua Bengio)

The x-axis and the y-axis represent the value range of two hypothetical hyperparameters, in the left image you see that the algorithm tries pre-defined combinations of parameters (GridSearch), while in the right image the hyperparameters are chosen randomly from a range (RandomSearch).

In the image, you can also see that for every hyperparameter there is a function that defines the relationships that exist between the hyperparameter itself and its contribution to the model performance. Your goal is to optimize that function to find the maximum. However, the more hyperparameters you have and the harder it will get to find the sweet spot.

However, this example is not very realistic because ML algorithms can have a long list of hyperparameters that can be tuned and the number of possible combinations can become unfeasible to estimate. Our goal is indeed to try to find the best performing model with the least amount of combinations tried.

Bayesian Optimization

“Make the hyperparameter search more data-driven”

Grid search and Random search are not the only two techniques that exist to do hyperparameter tuning. I would indeed like to show another one that can be used to achieve that goal, which is Bayesian Optimization.

I will not dive into how the algorithm works, I will just limit my-self in giving a general overview of it. Hereby you can find the steps that the algorithm takes:

  • It starts by sampling random values for the hyperparameters.
  • It observes the output that is generated (model performance).
  • Based on those observations it fits a Gaussian process.
  • It uses the mean of this Gaussian process as an approximation of the function that is unknown.
  • To understand which hyperparameters should be sampled next, it uses an acquisition function (we could see it as a sampling policy). The acquisition function also defines how much the algorithm should explore the hyperparameter space or exploit the known areas.
  • Fit the model again and observe the output and iterate over the same process until the max number of iterations is reached.

By reading the above points you should have already understood why this method is theoretically better compared to the ones I mentioned in the previous paragraph. This method uses information from the previous run to inform the next ones. This is a very powerful tool because it allows us to narrow down the hyperparameter search across the process to mainly focus on very profitable areas of the hyperparameter range.

More generally, Bayesian Optimization can be used every time you need to optimize over a black-box function. This means that you can use this method to search the global optima over any function of which you can observe only the input and the output.

To do hyperparameters tuning using Bayesian Optimization, I am using the module ‘scikit-optimize’ that I found quite intuitively and user-friendly. There are not a lot of examples in their documentation but there is all you need to implement it.

Experiment Setup

“Nice theory but what does it means in practice?”

In this experiment, we will compare the results of GridSearch, RandomSearch, and Bayesian optimization for hyperparameter tuning. The data that I am using is the California housing dataset where the goal is to predict the median house value. To achieve this result, I am using the GBM algorithm.

Is good to mention that this exercise aims to compare the hyperparameter algorithm and not to get the best model, therefore I am not doing any transformation to the data. I am doing a blindfolded plug-in and play that of course is not advised to do in a real-world scenario (always eat your carrots and look at the data).

To address the model performance, I use the MAE metrics and I use K-fold cross-validation.

After every model, I plot the hyperparameters per iteration and as well the average negative MAE metrics across the K-folds (what I am plotting is actually the difference between the negative MAE and the highest negative MAE across the iterations). I do this to get a bit more clarity on the influence of hyperparameters on the model performance.

The hyperparameters that I choose to tune are:

  • max_depth
  • min_sample_split
  • learning_rate
  • subsample
  • max_feature
  • n_estimators

The Experiment

“Let’s look at some plots”

In the following block of codes, I am importing the necessary modules and I am generating some functions that I will use to visualize the results.

import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from scipy.stats import randint, uniform
import seaborn as sns
import matplotlib.pyplot as plt
def parameter_over_iterations(model_result):
'''
This function is generating a subplots with the hyperparameter values for each iteration and the overall performance score.
The performance score is the difference between the best performing model and the worst performing model

model_result: CV object
'''
param_list = list(model_result.cv_results_['params'][0].keys())
max_col_plot = 2
row_plot =int(np.ceil((len(param_list) + 1)/max_col_plot))
fig, axs = plt.subplots(nrows=row_plot, ncols=np.min((max_col_plot, (len(param_list) + 1))), figsize=(30,12))
for i, ax in enumerate(axs.flatten()):
if i == len(param_list):
break
par = param_list[i]
param_val = list()
for par_dict in model_result.cv_results_['params']:
param_val.append(par_dict[par])
sns.barplot(y=param_val, x=np.arange(len(param_val)), ax=ax)
ax.set_title(par)
dt = pd.DataFrame({key:val for key, val in model_result.cv_results_.items() if key.startswith('split')})
mean_metric = dt.mean(axis=1)
sns.barplot(y=(mean_metric.values + abs(np.min(mean_metric.values))), x=np.arange(len(mean_metric) ), ax=axs.flatten()[i])
axs.flatten()[i].set_title('overall metric')

Grid Search

In this section, we will see the results of using GridSearch to select the best hyperparameters.

param_test = {'max_depth':range(5,15,5), 'min_samples_split':range(200,800,300), 'learning_rate': np.arange(0.05,0.55,0.25), 'subsample': np.arange(0.4,1,0.4),
'max_features': np.arange(0.4,1,0.3), 'n_estimators': np.arange(40,160,60)}
gsearch = GridSearchCV(estimator = GradientBoostingRegressor(random_state=10),
param_grid = param_test, scoring='neg_mean_absolute_error',n_jobs=4,iid=False, cv=5)
gsearch.fit(X_train,y_train)
parameter_over_iterations(gsearch)

In the above picture, we can see the model performance for every hyperparameter selection. In the first 6 subplots, I am plotting the hyperparameter values that were used in each particular iteration (the total amount of iterations is 64).

The last plot (on the bottom left) is the performance plot, the way to interpret this plot is that the higher the bar and the better the model is performing.

From the plot above we can clearly see two things a learning rate of 0.3 affects positively the model performances (you can see that the values on the right of the plot are higher) and that a higher number of estimators also leads to better performances (at the beginning of the plot you see that the highest peaks are when the number of estimators is high)

Anyway, out of this search, I find that the best hyperparameter combination is

  • learning_rate: 0.3
  • max_depth: 5
  • max_features: 0.4
  • min_samples_split: 500
  • n_estimators: 100
  • subsample: 0.8

And it led to a MAE of 35212.30.

Random Search

RandomSearch should lead to better results than GridSearch, even though typically it is not able to reach the global optima of the unknown function.

param_distrib = {'max_depth':randint(5,15), 'min_samples_split':randint(200,800), 'learning_rate': uniform(loc=0.05, scale=0.50), 'subsample': uniform(loc=0.4, scale=0.6),
'max_features': uniform(loc=0.4, scale=0.6), 'n_estimators': randint(40,160)}
rsearch = RandomizedSearchCV(estimator = GradientBoostingRegressor(random_state=10),
param_distributions = param_distrib, scoring='neg_mean_absolute_error',n_jobs=4, n_iter=64,iid=False, cv=5)
rsearch.fit(X_train,y_train)
parameter_over_iterations(rsearch)

Indeed the RandomSearch approach led to a better hyperparameter selection, leading to a MAE of 33942.

The selected hyperparameters are:

  • learning_rate: 0.23
  • max_depth: 5
  • max_features: 0.88
  • min_samples_split: 337
  • n_estimators: 125
  • subsample: 0.75

Bayesian Optimization

Now is time to test the Bayesian optimization algorithm to tune the model.

As you can see in the script below, in addition to the dictionary where I specify the value range for every hyperparameter I specify some values that are going to influence the behavior of the Bayesian algorithm. I specified some arguments for the acquisition function (xi and kappa). These two arguments control how much the acquisition function favors exploration to exploitation. Higher values of xi and kappa mean more exploration while lower values mean more exploitation.

More information on this subject can be found here: https://scikit-optimize.github.io/stable/auto_examples/exploration-vs-exploitation.html?highlight=kappa

from skopt import BayesSearchCV
from skopt.space import Real, Integer
optimizer_kwargs = {'acq_func_kwargs':{"xi": 10, "kappa": 10}}space = {'max_depth':Integer(5, 15),
'learning_rate':Real(0.05, 0.55, "uniform"),
'min_samples_split':Integer(200, 800),
'subsample': Real(0.4, 1, "uniform"),
'max_features': Real(0.4, 1, "uniform"),
'n_estimators': Integer(40, 160)}
bsearch = BayesSearchCV(estimator = GradientBoostingRegressor(random_state=10),
search_spaces = space, scoring='neg_mean_absolute_error',n_jobs=4, n_iter=64,iid=False, cv=5, optimizer_kwargs=optimizer_kwargs)
bsearch.fit(X_train,y_train)
parameter_over_iterations(bsearch)

If we look at the hyperparameter across iterations we indeed see that the values tend to become more stable over time for some parameters. This indicates that the algorithm is converging toward optimal values. However, it seems that the algorithm gets stuck on some local optima for quite some iterations (suggesting that more exploration would be beneficial for the algorithm). Indeed, if we look at the learning rate we see that between iteration 10 and iteration 53 the value barely changed, but the best result is achieved with a learning rate of 0.19.

It seems that this method would benefit from more iterations, however, this method was able to find a group of hyperparameters that could lead to a MAE of 32902.

  • learning_rate: 0.19
  • max_depth: 7
  • max_features: 0.73
  • min_samples_split: 548
  • n_estimators: 134
  • subsample: 0.86

Conclusion

“Is Bayesian Optimization worth it?”

In this short (and definitely not exhaustive) example we could clearly see that GridSearch is definitely the worst method to explore the hyperparameters space, while both RandomSearch and the Bayesian optimization performed well. However especially when the number of hyperparameters is high, RandomSearch might not deliver the best result while potentially Bayesian Optimization will do since it informs future hyperparameters selection based on observed performances. Also, Bayesian Optimization should take less iteration to get to the global optima while Random search might take a high amount of iterations to get there.

--

--

Data Enthusiast, Nature Lover, Bad Guitarist (slowly improving), Passionate Basketballer