# Evaluating Hyperparameters in Machine Learning

In machine learning (ML), a hyperparameter is a parameter whose value is given by the user and used to control the learning process. This is in contrast to other parameters, whose values are obtained algorithmically via training.

Hyperparameter tuning, or optimization, is often costly and software packages invariably provide hyperparameter defaults. Practitioners will often tune these — either manually or through some automated process — to gain better performance. They may resort to previously reported “good” values or perform some hyperparameter-tuning experiments.

In my recent paper, “High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms”, I examined the issue of hyperparameter tuning through extensive empirical experimentation, involving many algorithms, datasets, metrics, and hyperparameters. I wished to assess just how much of a performance gain could be had per algorithm by employing a performant tuning method.

My setup included a fairly large amount of ingredients:

- 144 classification datasets and 106 regression datasets
- 26 ML algorithms — 13 classifiers and 13 regressors
- 3 separate metrics for classification problems (accuracy, balanced accuracy, F1), and 3 separate metrics for regression problems (R², adjusted R², complement RMSE)
*Optuna*

Optuna is a state-of-the-art automatic hyperparameter optimization software framework . It offers a define-by-run-style user API where you can dynamically construct the search space, and an efficient sampling algorithm and pruning algorithm. Moreover, my experience has shown it to be fairly easy to set up.

Optuna formulates the hyperparameter optimization problem as a process of minimizing or maximizing an objective function that takes a set of hyperparameters as an input and returns its (validation) score. It also provides pruning, namely, automatic early stopping of unpromising trials. Optuna has many uses, both in machine learning and in deep learning.

The overall flow of my experimentation was as follows: For each combination of algorithm and dataset, I ran 30 replicate runs. Each replicate assessed model performance over the respective three classification or regression metrics for two cases, separately: default hyperparameters, and hyperparameters obtained through Optuna. I took care to provide both scenarios with equal computational resources, so that comparisons were fair.

All told, I ran a total of 96,192 replicates, each consisting of 300 algorithm runs, with the final tally being 28,857,600 algorithm runs (in fact, this number is even higher, since I used five-fold cross validation per run).

In addition to examining the raw results, I devised a “bottom-line” measure, which I called *hp_score. *A higher value means that the algorithm is expected to gain more from hyperparameter tuning, while a lower value means that the algorithm is expected to gain less from hyperparameter tuning (on average).

These are the final hp_scores of the 26 algorithms I tested:

My main takeaway from the study: For most ML algorithms, we should not expect huge gains from hyperparameter tuning *on average*; however, there may be some datasets for which default hyperparameters perform poorly, especially for some algorithms. In particular, those algorithms at the bottom of the above hp_score table would likely *not *benefit greatly from a significant investment in hyperparameter tuning. Some algorithms are robust to hyperparameter selection, while others are less robust.

We might use the hp_score table in two ways:

- Decide how much to invest in hyperparameter tuning of a particular algorithm
- Select algorithms that require less tuning to hopefully save time — as well as energy

Perhaps the main limitation of my work (as in others involving hyperparameter experimentation) regards the somewhat subjective choice of value ranges for hyperparameters (though I tried to select commonly used ranges). This is, ipso facto, unavoidable in such empirical research. While this limitation cannot be completely overcome, it can be offset given that the code is publicly available and anyone can enhance the experiment and add additional findings. Indeed, I hope this to be the case.

Finally, here is the paper link again:

And the github repo: