How to optimize hyperparameters?

9 min readMay 31, 2017

At Qucit, we strive to improve the life of citizens by making cities more efficient and enjoyable. In order to make this dream a reality, we are building an urban predictive platform. At the core of this platform lies a set of general purpose machine learning algorithms that, given enough urban data and a phenomenon (say vehicle congestion, parking fraud or shared car mobility), will predict it for future time values. In order to develop our models, we collect many contextual data streams, run various machine learning experiments and build dashboards and mobile applications to make these models accessible. Moreover, these various models need fine-tuning to work properly. The tuning step is often laborious, time-consuming and intractable when the model becomes very complex. What are the different strategies that are commonly employed to solve this problem? Let’s start from the beginning.

A primer on machine learning

Machine learning is a systematic approach to “teach” computers how to automatically learn and extract “insights” from data without using hard-coded rules (in opposition with what was previously done in the expert system method). In what follows, we will focus our attention on supervised learning). To make things less abstract, here is an example. Let’s say that we want to predict the price of a house. We are given a training dataset where each row contains characterisitcs about the house (number of rooms, size, number of floors, zip code, construction date…) and its price. The goal of our machine learning model is to reconstruct the price of each house given its many features. Once the model is trained, it can predict the price of a new house given its characteristics.

Parameters and hyperparameters

In any ML problem, there are two categories of things to optimize: parameters and hyperparameters. The first category is learned using the training data by making the model fit as closely as possible to the data. The second category is related to the model design and can’t be learned directly from the training set. This is often related to the model complexity. As an illustration, let’s again consider the housing example and apply a LASSO model to it. The LASSO is a linear regression model with a penalization term. The loss function we want to optimize is the following:

It has two parts: the first is made smaller by varying the `ß` vector whereas the second is controlled via the `λ` term. The `ß` parameters give an indication of how much the number of rooms, number of floors and so on play a role in predicting the price. These are the model parameters. The `λ` term controls the number of selected variables: do we consider only the number of rooms and floors or do we also add the residence zip code when predicitng its price? The bigger it is, the lower the number of selected features. This is what we call a hyperparameter. For more details about the differences, we refer to this Quora thread Model complexity and the curse of overfitting

Obviously, we want our models to perform well. A naive method to improve the performances is to make the models more and more complex by adding lots of hyperparameters. However, this approach doesn’t work. In fact, without tuning hyperparameters, it is often easy to fall into the trap of overfitting. This phenomenon is observed when the training and testing errors are far apart. This is what we observe on the right side of the graph above and it happens when the algorithm fits too much to the training data by learning the noise. The algorithm is thus unable to generalize, i.e. it can’t predict what happens on new instances of the data. The opposite phenomenon, called underfitting, happens when the training and testing errors are high. It is what we observe on the left side of the graph above. So how to avoid these traps while getting good performances? Strategies to select hyperparameters As discussed in the previous paragraph, ideally we would like to select hyperparameters that give us the best training performance while being able to generalize. In order to do that, ML practionners often use cross-validation evaluation. So how does this technique work?

Cross-validation

Cross-validation is one of the most popular methods to estimate how well an ML model can generalize on unseen data. It consists in dividing the training data into `n-folds` (for example 5 folds as in the graph above). Then, using all but one fold, we train the model and evaluate its performances on the remaining one. This step is repeated `n` times by varying the fold that is used for evaluation. The final score is the average of the `n` scores obtained on each iteration. Applied to the context of hyperparameters, it gives us a method to estimate how well each different model (that differs by its hyperparameters’ values) generalizes and thus select the best one. This evaluation is performed over a set (also called grid) of hyerparameters values. Now that we have our evaluation method, one question remains to be solved: how to choose the hyperparameters grid over which to optimize? Let’s find out.

Choosing the grid points

There are two broad approaches for choosing the hyperparameters values to test. The first one consists in choosing hyperparameter values independently each time: either via a deterministic and exhaustive approach, known as grid search, which consists in trying all the possible combinations, or via a randomized approach where grid points are drawn from specified probabilistic distributions. The second and more clever method consists in estimating a cost metric for choosing hyperparameters: the first values are drawn at random then the information gathered from each sample is used to choose the subsequent ones.

Sequential Model-Based (Global) Optimization

In fact, in the second approach, we use a general optimization method called “Sequential Model-Based (Global) Optimization” (SMBO for short). It consists in approximating a costly evaluation function by a more tractable “proxy”. One commonly used approximation method for SMBO is the Tree-structured Parzen Estimators algorithm (TPE for short). It uses Parzen estimators (also known as kernel density estimators to replace a complex prior with estimated non-parametric densities. For more technical details, we refer to this paper. Fortunatly for us, open source libraries that implement this algorithm exist and help us abstract away the mathematical formulas. We will explore one such example in the following paragraph.

Hyperopt to the rescue

Hyperopt is an open-source Python library that implements the TPE approach As stated in its website: Hyperopt is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions. Here is a complete Hyperopt code example:

# Machine learning imports import xgboost as xgb from sklearn.model_selection import (train_test_split, cross_val_score) from sklearn.metrics import mean_squared_error, make_scorer from sklearn.datasets import make_regression # Hyperopt imports from hyperopt import fmin, tpe, STATUS_OK, Trials, hp # Scientific imports import numpy as np MAX_EVALS = 100 SEED = 314 # To make the example reproducible NUMBER_SAMPLES = 2000 TEST_SIZE = 0.2 hyperopt_hp_grid = {'n_estimators': hp.quniform('n_estimators', 10, 1000, 1), 'learning_rate': hp.loguniform('learning_rate', 0.001, 0.1), 'max_depth': hp.quniform('max_depth', 3, 15, 1), 'gamma': hp.loguniform('gamma', 0.01, 1)} mse_scorer = make_scorer(mean_squared_error) # Generate fake regression data features, targets = make_regression(NUMBER_SAMPLES, random_state=SEED) # Train/test split train_features, test_features, train_targets, test_targets = ( train_test_split(features, targets, test_size=TEST_SIZE, random_state=SEED) ) def transform_params(params): params["gamma"] = np.log(params["gamma"]) params["learning_rate"] = np.log(params["learning_rate"]) params["n_estimators"] = int(params["n_estimators"]) params["max_depth"] = int(params["max_depth"]) return params def loss(params): params = transform_params(params) xgb_regressor = xgb.XGBRegressor(silent=False, **params) cv_mse = cross_val_score(xgb_regressor, train_features, train_targets, cv=5, verbose=0, n_jobs=4, scoring=mse_scorer) rmse = np.sqrt(cv_mse.mean()) return {'loss': rmse, 'status': STATUS_OK} def optimize(trials, space): best = fmin(loss, space, algo=tpe.suggest, trials=trials, max_evals=MAX_EVALS) return best trials = Trials() hyperopt_optimal_hp = optimize(trials, hyperopt_hp_grid) hyperopt_optimal_hp = transform_params(hyperopt_optimal_hp)

In order to find the best hyperparameters using Hyperopt, one needs to specify two functions — `loss` and `optimize`, a hyperparameters grid and a machine learning model (we use an xgboost regression model in this example, which is an efficient gradient boosting algorithm). On one hand, the `loss` function gives the evaluation metric that is used to find the best hyperparameters. It is computed using cross validation for an `XGBoost` regression model. On the other hand, the `optimize` function specifies the optimization strategy and the search space. Finally, notice that we have introduced a `transform_params` function since Hyperopt doesn’t return the hyperparameters values as expected (`exp(value)` instead of `value` when working with a logarithmic distribution for example).

Predicting flights arrival delays

Among our missions at Qucit, we work on improving transportation systems by providing predictive tools. Among these tools, we offer the capability of predicting delays and thus help operators better manage their fleets. One area in which delays are hard and expensive to manage is the airlines business. As an application of what we have learned in the previous section, we will work with the Airlines Delay dataset. This dataset contains information about US delayed airlines and their causes. It contains records from 1987 up to 2009. You can get the `DelayedFlights.csv` dataset from Kaggle (you will need to create an account first) or from the following Github repo (it is the zipped file `DelayedFlights.csv.bz2`). To follow along, it is highly recommended to get the accompanying [notebook (http://snip.ly/lyzcl). First, let’s load the data:

import pandas as pd def load_airlines_delays(): airlines_df = (pd.read_csv('../data/DelayedFlights.csv') .dropna()) features = airlines_df.drop(['Unnamed: 0', 'ArrDelay', u'CarrierDelay', u'WeatherDelay', u'NASDelay', u'SecurityDelay', u'LateAircraftDelay'], axis=1) # Don't select the columns having object type columns = features.columns[features.dtypes != 'object'] targets = airlines_df[['ArrDelay']] return features.loc[:, columns], targets features, targets = load_airlines_delays()

The targets column we will use is the `ArrDelay` one. It contains the arrival delay (in minutes) of each flight. As you might have noticed, we have dropped all the columns with `object` type. This is done in order to simplify data processing. We have also dropped the columns that might leak information about the arrival delay. In fact, `WeatherDelay`, `NASDelay`, `SecurityDelay` and `LateAircraftDelay` sum up to the `ArrDelay` value. Once we have loaded the data, we split it into train and test subsets (we leave the test subset for the final evaluation).

train_features, test_features, train_targets, test_targets = train_test_split(features, targets, test_size=TEST_SIZE, random_state=SEED)

We will compare the TPE approach against the grid search method.

Results

Running both methods leads the following results:

opt_hptest_rmsemean_cv_rmseopt_method

{‘n_estimators’: 10, ‘learning_rate’: 0.2, ‘max_depth’: 4, ‘gamma’: 0.2}

16.270025

16.208705

grid_search

{‘n_estimators’: 783, ‘learning_rate’: 0.083035126529, ‘max_depth’: 8, ‘gamma’: 0.78386596379}

1.907361

1.537235

hyperopt_tpe

Compared to the grid search, the TPE algorithm gives impressive results on both test and cross-validation sets: the RMSE scores for the TPE method are “8 to 11 times” lower. This shows how important it is to well tune hyperparameters, especially when working with complex models (XGBoost regresion in our example). Finally, these results where obtained without much code development for the TPE algorithm (since we use the Hyperopt library).

Conclusion

The approach that we have presented in this blog post seems like magic. It really isn’t. In fact, we had to provide the grid search hyperparameters distributions (the type of the distribution and its support). Moreover, it can’t be as easily distributed as grid search even though is is possible to do it using MongoDB. Finally, notice that there is another popular alternative to the TPE algorithm that uses Gaussian processes(a generalization of the Gaussian distribution). BayesianOptimization is one such implementation in Python. If you want to learn more about the subject of hyperparameters tuning, I highly recommend reading this blog post. This blog post is also great to learn more about Hyperopt. Stay tuned for our upcoming webinar about this topic. If you want to have a sneak peek at our platform when it is ready, please leave your email for a demo.

About the author

Yassine is a data scientist at Qucit. Among his missions, he makes sure that data science models integrate seamlessly into different internal tools, APIs and user-facing products.