Narrowing the Search: Which Hyperparameters Really Matter?

Studying Hyperparameter Importance to Speed Up Optimization

Published in

data from the trenches

8 min readMay 7, 2020

When performing hyperparameter optimization, we are really searching for the best model we can find within our time constraints. If we searched long enough, with random search we would get arbitrarily close to the optimum model. However, the more hyperparameters we have to tune, the larger our search space and the more infeasible this becomes, especially when we are working with larger real-world datasets and each model we test can take hours to train.

This is why in AutoML the best search strategies are those that can quickly hone in on the most promising regions of search space to explore, and therefore find better models in the limited amount of time available to train and test models.

Hyperparameter optimization feels like figuring out a map on a budget (Credit Pexels)

One way to refine the search space is to study which hyperparameters are most ‘important’ and focus on them. For a given machine learning task it is likely that changing the values of some hyperparameters will make a much larger difference to the performance than others. Tuning the value of these hyperparameters can therefore bring the greatest benefits.

By studying the importance of the different hyperparameters over many datasets, we can try to meta-learn good values for the hyperparameters. For the most important hyperparameters we hope to reduce the search range and focus on the best values, while for the least important ones, we could fix them to a single value and thus remove entire dimensions from a grid search.

This is exactly what Jan N. van Rijn and Frank Hutter propose in their paper “Hyperparameter Importance Across Datasets”[1].

We think this is a really interesting approach to crafting a more targeted hyperparameter search when you have limited resources, and it has the potential to offer valuable insights to data scientists as well, so we decided to test it for ourselves on XGBoost to see what insights we could glean.

How Can You Identify Which Hyperparameters Are Important?

To determine which hyperparameters have the largest impact on the model performance, first you need to collect an unbiased dataset of performance evaluations with different combinations of hyperparameters by doing a random search on many different datasets.

Then, for each dataset you can use functional ANOVA [2] to decompose the variance in the performance metric and attribute it to each of the hyperparameters and their interactions. This is a quantitative measure of the importance of each hyperparameter.

Figure 1: From the original paper, the marginal contributions over all datasets for SVM (RBF Kernel) (left) showing that hyperparameter gamma was the most important and AdaBoost (right) showing that max depth was the most important hyperparameter. [1]

When you aggregate these variance contributions or importances over all the datasets, you can identify which hyperparameters are generally more important. To illustrate this point, in figure 1 the violin plots show the distribution of importances for each hyperparameter, ordered by the median. On the left are the hyperparameters for SVM with the RBF kernel where gamma is shown to generally be the most important hyperparameter, and on the right for the AdaBoost algorithm, max depth was found to be generally the most important hyperparameter.

Hyperparameter Importance Density Distributions

To study whether or not there are clear best values for each of these hyperparameters across all the datasets, the ten best models found during the random search are selected and for each hyperparameter independently, a 1-dimensional Kernel Density Estimator is fitted to these best evaluations to create density distributions. Examples of these density distributions from the algorithms studied in the paper are shown in figure 2. Since these distributions were used as priors to sample from for hyperband, they were referred to as priors.

Figure 2: Examples of density distributions of the top ten evaluations from all datasets for some of the hyperparameters of the algorithms studied in the paper, Random Forest, Adaboost, SVM (RBF kernel) and SVM (sigmoid).

More Random Forest Hyperparameters

Here at Dataiku we build the Dataiku Data Science Studio (Dataiku DSS), a collaborative data science platform that allows coders to collaborate with non-coder profiles to explore, prototype, build, and deliver their own data and AI products together. Dataiku DSS offers the possibility to train machine learning algorithms through a visual interface.

We were particularly interested in looking at the hyperparameters that we expose in the visual machine learning interface of Dataiku DSS, so we ran the experiments again while searching over different hyperparameters. We also extended the range for min samples leaf, as we expected that higher values would perform better for larger datasets. We evaluated 225 models for each dataset. For each combination of hyperparameters the model was evaluated using 3-fold cross validation for the metric AUC.

The hyperparameters and their values we searched over were:

Min Samples Leaf: 1–60
Max Features: 0.1–1.0
Max Depth: 6–20
Number of Estimators: 10–200

We again found the most important hyperparameter to be min samples leaf, followed by max features and number of estimators, the new violin plot can be seen in figure 3.

Figure 3: The violin plot of importances for this set of hyperparameters for random forest, showing min samples leaf to be by far the most important hyperparameter generally.

We found that min samples leaf had a higher importance than the original experiments because the range was larger and very high values in some cases could seriously degrade the performance. However lower values were not always better, as can be seen by comparing the two datasets in figure 4.

Figure 4: The marginal performance for min samples leaf for random forest on the mice protein dataset and on the analcatdata DMFT dataset where higher values of min samples leaf are better.

We created the density distributions to explore the best values of this hyperparameter over the full set of datasets and recovered a similar shape to that in the original experiments. This is shown in figure 5.

Figure 5: One-dimensional density plots of the 10 best values of min samples leaf across all datasets, from the original paper (left) and for our new experiments on OpenML-CC18 data with different hyperparameters (right).

XGBoost Experiments

XGBoost is an algorithm with a very large number of parameters. We are using the implementation with the scikit-learn API, which reduces the number of parameters you can change, and we decided to restrict our study to those available to tune in Dataiku DSS. The hyperparameters and their ranges that we chose to search over are:

Booster: Gbtree, dart
Cpu tree method: Exact, hist, approx
Max Depth: 3–20
Learning Rate: 0.01–0.5
Gamma: 0.0–1.0
Alpha: 0.0–1.0
Lambda: 0.0–1.0
Min child weight: 0.0–5.0
Subsample: 0.5–1.0
Colsample by tree: 0.5–1.0
Colsample by level: 0.5–1.0

The technique for fanova relies on training a model to predict performance, as they detail in the paper [1], this requires a minimum number of evaluations, which increases with the number of dimensions in the search. Since we were searching over 11 hyperparameters, we increased the number of evaluations per dataset from 225 to 1500 to improve the quality of the underlying model. All models were evaluated on AUC averaged over 3 fold cross validation, and training was stopped after 4 early stopping rounds.

In figure 6 is shown the violin plot of the importances over all datasets, learning rate was found to be generally the most important, followed by subsample and min child weight.

Figure 6: A violin plot showing the most important hyperparameters and interactions for XGBoost over all datasets with learning rate generally the most important.

Important Hyperparameters for XGBoost

From our experiments we found that:

For learning rate, higher values of 0.2 and above tended to perform better.
For subsample, the density was increased above 0.8.
For min child weight, there is increased density at lower values. A setting of 0 means no minimum weight is required on the sum of the instance weight in a child, therefore no regularization. This means the algorithm can be much slower, and thus setting this parameter to 1 (which is the default value) represents a good trade off between performance and training time.

Figure 7: The density distributions of the ten best values over all datasets for the hyperparameters learning rate, subsample and min child weight for XGBoost.

What Are Our Takeaways From This Experiment?

This is a very interesting technique that can help data scientists build intuition about the effect of changing the hyperparameters for an algorithm, not only generally, such as in studies on benchmark sets of datasets, but in particular for a given task, as the effect is data-dependent.

It can help you to tune your algorithms faster by focusing on the hyperparameters that bring the largest performance improvement. Generally, that is min samples leaf for random forest and learning rate, subsample and min child weight for XGBoost.

Why Is Min Samples Leaf for Random Forest Almost Always Better Set So Low?

While doing this study, we were surprised to find that higher values of min samples leaf did not bring higher performance as often as we expected. This subject has already been discussed in the scikit-learn community and it may be due to the way this parameter has been implemented in scikit-learn. While the expectation may be that the way this parameter would behave would limit the depth of the trees, in reality it simply does not consider splits that would leave fewer than the minimum number of samples in a leaf, and so the algorithm could choose a worse split instead of halting the splitting.

How to Set the Number of Estimators?

We found the number of estimators for random forest to be only the third most important hyperparameter. We were interested to see what would be a reasonable number of estimators that would be likely to give good performance.

By looking at the marginal performance for each dataset as well as the distribution of best values across all datasets, we see that while 10 estimators frequently led to lower performance, once the number of estimators reached 100, the performance was relatively stable. In figure 8 is an example of the characteristic shape of the marginal performance on the left and the density distribution of best values over all datasets on the right.

Given that the training time increases with the number of estimators, a sensible default value for this hyperparameter would appear to be 100. This provides empirical evidence for the change of the default from 10 to 100 estimators in version 0.22 of scikit-learn.

FigurFigure 8: An example marginal for the number of estimators parameter for random forest on the madelon dataset, and the density distribution of the best values of n estimators over all datasets.

The Importance of a Good Benchmarking Set

Finally, we found that even though the OpenML-CC18 benchmark set is conceived to be more challenging, some datasets were still too easy. Therefore, tuning the algorithm had little effect on these tasks, so they were removed from the results. We are working on further curating our benchmark set of datasets to be more challenging.

References

[1] van Rijn, J. N. & Hutter, F. Hyperparameter Importance Across Datasets. (2017). doi:10.1145/3219819.3220058

[2] Hutter, F., Hoos, H. & Leyton-Brown, K. An efficient approach for assessing hyperparameter importance. 31st Int. Conf. Mach. Learn. ICML 2014 2, 1130–1144 (2014).

[3] Casalicchio, G., Hutter, F., Mantovani, R. G. & Vanschoren, J. OpenML Benchmarking Suites and the OpenML100. 1–6 arXiv:1708.03731v1

[4] 1. Casalicchio, G., Hutter, F. & Mantovani, R. G. OpenML Benchmarking Suites. 1–6 arXiv:1708.03731v2