Image Credit: https://github.com/fmfn/BayesianOptimization

Introduction to Automatic Hyperparameter Optimization with Hyperopt

What are parameters & hyperparameters of machine learning model?

Rakesh Sukumar
Published in
11 min readMar 19, 2020

--

Every machine learning model has a set of parameters & hyperparameters that are set during the training process. Parameters are values that are learned directly from the training data such as coefficients in a linear/logistic regression model, split variables & split values in a tree based model, weights in a neural network etc. Parameter values are estimated by the machine learning algorithm without any manual involvement.

Hyperparameters are values that determine the complexity of a machine learning model. An optimal choice of hyperparameters ensure that the model is neither too flexible where it picks up the noise in the training data (over-fitting), nor too rigid where it may loose important signals in the training data (under-fitting). Unlike parameters, hyperparameters cannot be learned directly from the training data and needs to be set by the machine learning practitioner. Number of predictors in a random forest/gradient boosting models, learning rate in neural networks, amount or regularization to use etc, are a few examples of hyperparameters. Hyperparameters are usually set by iteratively training the model for different sets of hyperparameter values and evaluating the model’s performance on a held out validation set or using cross validation.

Methods of Hyperparameter Tuning

Following are the common methods of hyperparameter tuning. Hyperparameter values that gives the best performance on a validation set will be chosen in all of these methods:

  • Manual Tuning: Machine learning practitioner sets hyperparameter values based on his domain knowledge. He may try different sets of values before choosing the best one.
  • Grid Search: Models are tuned over a predefined grid of hyperparameter values set by the practitioner. All possible combinations of hyperparameter values defined by the grid are evaluated to select the best set.
  • Random Search: The machine learning practitioner provides a probability distribution of values to be evaluated for each hyperparameters. A specified number of samples are drawn from these distributions and the performance of the model is evaluated for each sample.
  • Bayesian Optimization: Bayesian optimization keeps track of the results of previous evaluations to choose the next set of hyperparameter values to be evaluated. We will discuss about Bayesian optimization in details in the following sections.

Bayesian Optimization

Bayesian Optimization (also known as Sequential Model-Based Optimization (SMBO) ) uses the results of the past evaluations to form a probabilistic model of the objective function & uses this model to choose the next set of hyperparameter values. The probabilistic model is called the “Surrogate model” and is represented by p(y|x); y being the performance metric for the model and x being the hyperparameter values. Here, the objective function is a function that maps the hyperparameter values to model’s chosen performance metric (such as RMSE, accuracy, ROC AUC etc) on a validation set (or using cross validation). The surrogate model is much easier to optimize compared to the actual objective function. The surrogate model approach can be used in scenarios where it is too expensive to evaluate the actual objective function in term of time or money (bayesian optimization techniques were originally developed for applications in oil exploration). Basic steps in bayesian optimization are:

  1. Build a surrogate model of the objective function using the results of past evaluations.
  2. Find the hyperparameters that perform best on the surrogate model.
  3. Evaluate the actual objective function (i.e. train the model & evaluate the performance metric) with the hyperparameter values selected in step 2.
  4. Update the surrogate model by adding the new results obtained in step 3.
  5. Repeat 2 to 4 until a stopping criteria such as maximum iterations or time is reached.
  6. Select the best performing hyperparameters from all the trials.

A number of variants of bayesian optimization methods (hyperopt being one of them) are available for hyperparameter optimization for machine learning models. These methods differ by the type of surrogate model they build and the optimization criteria they use in step 2 of the algorithm above. Hyperopt uses Tree Parzen Estimator (TPE) as the surrogate model and Expected Improvement (EI) as the criteria to optimize the surrogate model.

Tree-structured Parzen Estimators (TPE): Instead of directly modeling p(y|x), TPE model p(x|y) and p(y). TPE defines p(x|y) using two densities:

where l(x) is the density formed by using observations xᵢ from past evaluations, such that the corresponding loss (i.e. the performance metric for the model) is less than some threshold y* and g(x) is the density formed by using the remaining observations. Here, we assume that we want to minimize the performance metric for the model (such as RMSE loss). In case we wish to use a metric that we need to maximize (such as accuracy, f1 score, ROC AUC etc), we just take the negative value of the metric and try to minimize the same. The TPE algorithm uses y* as some quantile γ of the observed values so that p(y < y*) = γ.

Expected Improvement (EI): Expected Improvement is the expectation under some model M of f(x) that f(x) will exceed (negatively) some threshold y*.

Under TPE algorithm, it can be shown that maximizing EI amounts to choosing x values that minimizes g(x)/l(x). i.e. we would like points with high values under l(x) and low value under g(x). At each iteration, the algorithm draws several samples from l(x), evaluates them in terms of g(x)/l(x) and returns the candidate with highest EI¹.

Hyperopt

As explained above, Hyperopt uses Tree-Parzen Estimator to build the surrogate model and Expected Improvement as the optimization criteria. In addition, hyperopt requires the machine learning practitioner to define the following:

  1. Define the objective function that maps the hyperparameter values to model’s chosen performance metric.
  2. Define a configuration space. A configuration space describes the domain (i.e. probability distribution for hyperparameters.) over which Hyperopt is allowed to search. This enables the machine learning practitioners to encode his domain expertise to assist hyperopt in identifying best hyperparameter values. A number of options are available in the hyperopt library for describing the probability distribution.
  3. Choose a search algorithm. Hyperopt currently supports TPE algorithm and random search.
  4. Specify a trials object to store intermediate results. This is optional.

The hyperopt.hp module defines several hyperparameter distributions that can be used to specify the configuration space. Available options are:

  • hp.choice(label, options): Returns one of the options. Options can be a list or a tuple or even a nested expression. Nested expression format allows us to specify conditional parameters which comes handy if we need to optimize between different types of machine learning models. Example: Between boosting or random forest or between different architectures of neural network models. We will see a code example that illustrates this option.
  • hp.pchoice(label, p_options): Similar to hp.choice() but with a probability specified for each option.
  • hp.uniform(label, low, high): A uniform distribution between low and high (both ends included).
  • hp.quniform(label, low, high, q): Distribution given by q*round(uniform(low, high)/q). Suitable for hyperparameters that take discrete values.
  • hp.loquniform(label, low, high): Distribution given by exp(uniform(low, high)). The hyperparameter will be constrained to the interval [eˡᵒʷ, eʰⁱᵍʰ]
  • hp.normal(label, mu, sigma): A normally distributed variable. While optimizing, this hyperparameter is unconstrained.
  • hp.qnormal(label, mu, sigma, q): Distribution given by q*round(uniform(mu, sigma)/q). This variable is also unconstrained.
  • hp.lognormal(label, mu, sigma): Distribution given by exp(normal(mu, sigma)). This variable is constrained to be positive.
  • hp.qlognormal(label, mu, sigma, q): Distribution given by q*round(exp(normal(mu, sigma))/q). This variable is constrained to be positive.
  • hp.randint(label, upper): Returns a random integer in the range [0, upper). This distribution assumes no correlation in loss function between nearby integer values (Eg. for random seed values).

Hyperopt in Action

Let’s use the Bank marketing data from UCI Machine Learning Repository to demonstrate the working of hyperopt. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. The target variable is whether the client has subscribed for a term deposit from the bank. A number of variables related to the client, the marketing campaign, previous contact with the client, and socio economic indicators are available as explanatory variable. Four datasets are available in UCI site, we will use the file “ bank-additional-full.csv”. You can read more about the dataset here. We will use Google colab to run our code. The code file is uploaded here.

We see that the data set is unbalanced with just 4640 clients subscribing to the term deposit out of 41188 clients in the dataset. As per the dataset information in the UCI website the “duration” variable, which records the duration of the call with the client, is highly correlated with the target. Since this information is not available to the bank before making the call, we will discard this variable. Also note that the value “999” in “pdays” is actually a missing value indicator which indicates that the client was not contacted before by the bank. Let’s drop the “duration” variable and copy our target to a new variable y. We will also create a binary variable for “previously not contacted by the bank” and replace “999” in “pdays” with nan.

Next, we split the data into training and test sets and perform label encoding for categorical variables.

Let’s fit an XGBoost model as a baseline model. We will use ROC AUC as the evaluation criteria as we have an unbalanced classification problem. Note that we haven’t used early stopping here.

Next, we will use hyperopt to optimize the hyperparameters for this model. First we will define a function that computes ROC AUC on the test set for a given set of hyperparameter values. Note that we return a dictionary as output from this function. Hyperopt minimizes the ‘loss’ values in the output dictionary, hence we return -1 * ROC AUC as loss.

default_params are hyperparameters that we do not wish to tune. The values in default_params will be ignored if we pass the same hyperparameter in the argument “params” to hyperopt_xgb_train() function. Note that we are not tuning for “num_round” here; we will rather use early stopping when we fit the final model. Let’s test this function with hyperparameter values we used before.

Next, we will define two function that will help us visualize the configuration space (i.e. the probability distribution) for the hyperparameters tuned. We will visualize both of the following:

  • The probability distribution that we define and provide as input to hyperopt based on our domain knowledge (plot_params_space()).
  • The distribution of hyperparameter values that hyperopt actually tires based on previous evaluation history. (plot_params_tried()). This function also returns a dataframe with the results of all trials.

This will help us get a clear understanding of the workings of hyperopt. Note that the plotting functions defined below does not work with nested configuration spaces.

Let’s define the configuration space for the xgb model and visualize the same.

Note the use of pyll.scope.int() function to convert max_depth values to integers types. hyperopt.pyll.scope module allows us to use custom functions in the defining the probability space for hyperparameters. See the hyperopt paper² for more examples.

We will define a trials object to store the results of all evaluations that the hyperopt makes. We will also use this object to plot the results of the evaluation. If trials object is not provided, the fmin() function below will return only the best hyperparameter value.

The fmin() function is the workhorse that performs the hyperparameter optimization. The first argument to fmin is the objective function & the second argument is the configuration space. Other argument to fmin() are:

  • algo: specifies the optimization algorithm to use. Available option are ‘tpe.suggest’ (implements bayesian optimization with TPE algorithm); and ‘rand.suggest’ (implements random search).
  • max_evals: Number of evaluations to be carried out.
  • trials: Trials() object to save results of all evaluations carried out by hyperopt. The .trials attribute of the Trials object is a list with an element for every evaluation made by fmin.

We get the best auc roc score of about 0.82 for the above hyperparameter values. Note the use hyperopt’s space_eval() function to get the hyperparameter values from best. space_eval() is required as best will have integer indices for any hyperparameter that is defined by hp.choice(). Let’s plot the results of evaluations done by hyperopt.

We can see that the distribution of values evaluated by hyperopt is very different from the distributions we defined. Let’s also plot a scatter plot of hyperparameter values against the loss (-ve of ROC AUC).

The hyperparameter values that gave the best results are:

For the final model, we will increase num_round to 1000 and enable early stopping with 50 rounds.

Note that xgboost.train() will return a model from the last iteration (iteration 147 here), not the best one. We can get the best iteration as below:

Nested Configuration Space

Next, we will use a nested configuration space to simultaneously tune an xgboost model and a random forest model & select the best performing model. We will also draw a few samples to see what we get from this nested configuration space.

Next, we will create an objective function for this nested configuration space. We will replace the missing values in pdays column with median for the random forest model.

Find the code file uploaded here.

References

  1. Bergstra, James & Bardenet, R. & Kégl, Balázs & Bengio, Y.. (2011). Algorithms for Hyper-Parameter Optimization.
  2. Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms by James Bergstra , Dan Yamins , David D. Cox
  3. A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning by Will Koehrsen
  4. Parameter Tuning with Hyperopt by Kris Wright

--

--