Fine-Tuning Machine Learning Models: Unleashing Hyperparameter Mastery with Optuna

Published in

Data Science Indonesia

7 min readNov 25, 2023

Introduction

Machine learning models have been in our daily lives nowadays from waking up to sleep again, opening social media apps, booking online transportation, food, and online shopping even while I am typing this there are recommendation words.

Therefore using machine learning systems in production is a common practice in the current time the one who is not most likely to be left behind in the competition. I recall the term Humans With AI Will Replace Humans Without AI.

Developing well-generalized machine learning models in production is not easy, there are so many factors that affect the performances from data issues (qualities, inconsistent label, unbalance, sampling bias), code issues (tracking, logging, pipeline robustness), and model issues (under-fitting, over-fitting, scalability, explainability).

There are two common approaches to developing a well-generalized ML model first is model-centric and the second is data-centric. In this post, I’ll focus more on the model-centric approach, tuning its hyperparameters.

Hyperparameters are a set of parameters whose value is used to control the learning process which we have to determine while initiating the model. different models will have different sets of parameters let's say in clustering we determine how many clusters we will create, or a learning rate how large steps the model takes to update its weights.

Hyperparameter optimization aims to find the best set of parameters in a vast search space as efficiently as possible in time, computing, and money. In this post, I’ll cover practical guides for different kinds of hyperparameter optimization including:

Grid Search
Random Search
Tree-Structured Parzen Estimator (Bayesian Method)

Ultimately, we see how model performance is compared to each other with hyperparameters obtained from these techniques. First, let's create or find the datasets

Datasets

I found this Date Fruit dataset that suits our purpose to explore hyperparameter search since the dataset is already clean and all features are numerical.

It comprises 35 features and 898 samples with 7 different types of date, not that large but enough for our experiment.

Like a wise man says a picture is worth a thousand words so let's first visualize the data. For high-dimensional data, I use a parallel plot to get the first glimpse of the data. After seeing the picture I think the area, perimeter, compactness, and roundness would be strong predictors to determine the type of date.

Parallel Plot High-Dimensional Date Fruit Data by Author

As humans live in 3-D spatial spaces hard to visualize it more than that. let’s see it in a 3D plot, in order to achieve it we have to reduce the features from 38 to 3 only, using PCA (Principal component analysis). The date fruit type is quite well clustered except for the blue (Berhi) and purple (Iraqi)

Hyperparameters Search

In this experiment, we will use Optuna as our main tool to optimize the hyperparameters. I have already tried sklearn GridsearchCV and RandomSearchCV but it cannot control the time constraints if I stop or interrupt, the experiment will be lost.

to control the result we will use the same model and hyperparameters that we are going to optimize.

Model: XGBoost
Parameters to optimize: n_estimator, learning rate, max_depth, subsample, colsample_bytree, min_child_weight
Time limit: 15 Minutes

It will use the same objective as below, using 3 stratified Kfold validations to keep the same data distribution in the evaluation set.

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 3000),
        "learning_rate": trial.suggest_float("learning_rate", 1e-2, 0.1, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10)}

    model = XGBClassifier(**params)
    score = []
    for train_index, test_index in skf.split(X_train, y_train):
        x_train_fold, x_test_fold = X_train[train_index], X_train[test_index]
        y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
        
        model.fit(x_train_fold, y_train_fold)
        y_predict = model.predict(x_test_fold)
        score.append(f1_score(y_test_fold, y_predict, average='weighted'))

    return np.mean(score)

RandomSearch

The name is straightforward right, it is just a random search. Try a bunch of different parameters keeping the best result (compared to the previous) and try again, it does not account for the direction of the result and whether gives a better result or not.

As I said earlier we will use Optuna since we cannot control the time budget in sklearn RandomSearchCV. To do a random search with have to specify the sampler in optuna.create_sudy

study_random = optuna.create_study(
                            sampler=optuna.samplers.RandomSampler(seed=77),
                            pruner=optuna.pruners.MedianPruner(),
                            study_name='hyper-search-random',
                            direction='maximize',
                            load_if_exists=False)

It finishes after a timeout of 15 minutes and does not reach 100 trial iterations. Below is the contour plot from the random search as we can see the dot is just randomly sampled from the parameters space. The color is the objective value in this case is F1 score.

GridSearch

From the name also straightforward we search hyperparameters within the grid, The grid is all possible parameters we have specified. Different from random search where we just randomly sample the parameters, grid search tries one by one all possible combinations of the parameters it guarantees optimal solutions but the search space may be too large and we don't have time for it.

In Optuna we replaced the sampler to be a GridSampler. In the grid sampler, we have to manually specify our search space and pass it into the sampler.

params_grid = {'n_estimators': [700, 900, 950, 1000, 2500, 3000],
                'learning_rate': [0.05, 0.03, 0.01, 0.1],
                'max_depth': [3,4,5,6,7],
                'subsample':[0.5, 0.7, 1],
                'colsample_bytree':[0.5, 0.7, 1],
                'min_child_weight': [1,2,3,4]}
study_grid = optuna.create_study(
                 sampler=optuna.samplers.GridSampler(search_space=params_grid,
                                                     seed=77),
                  pruner=optuna.pruners.MedianPruner(),
                  study_name='hyper-search-grid',
                  direction='maximize',
                  load_if_exists=False)

From the contour plot, we can see the parameter search will be a grid shape, and all possible combinations will be tried.

Tree-Structured Parzen Estimator

The TPE algorithm, specifically, is used to model and optimize the acquisition function in Bayesian optimization. The basic idea is to use a probabilistic model to guide the search for the optimal solution.

Here’s a high-level overview of how TPE works:

Select samples, Randomly select a few observations using different hyperparameters
Divide good and bad distributions, TPE maintains two probability distributions (using kernel density estimation): one for the ‘good’ configurations that have led to improvements and one for the ‘bad’ configurations.
Updating, The sampled configurations are then evaluated using the objective function. TPE updates its probability distributions to shift the focus toward configurations that are more likely to improve the objective.
Keep iterating

Different from previous methods, TPE keeps into account the previous trial and moves towards the best objective value.

study_tpe = optuna.create_study(
                            sampler=optuna.samplers.TPESampler(),
                            pruner=optuna.pruners.MedianPruner(),
                            study_name='hyper-search-tpe',
                            direction='maximize',
                            load_if_exists=False)

From the contour plot, we see that TPE sampled more in the left area and kept exploiting it. This approach is way better than a random or grid sample since the search space may be too large.

Results

How about the result, remember that we optimize the parameter with a time constraint which is 15 minutes. The evaluation data split using the same distribution containing 30% data.

The evaluation results indicate that TPE outperforms Random Search and Grid Search within the specified 15-minute time frame. However, it is essential to consider that Grid Search’s effectiveness may be influenced by the manually provided parameter grid.

Conclusion

In this post we see how to implement different algorithms to search hyperparameters, it’s not only to get the best parameters but also to consider time constraints.

Optuna is a powerful tool for hyperparameter optimization, offering various samplers such as RandomSampler, GridSampler, and TPESampler. Each sampler has its own characteristics, and the choice of sampler depends on the nature of the problem and the available resources.

Within a limited time frame, TPE may be the first choice to try, as the result shows it gives a better score on the evaluation set within the same time frame.

The code can be accessed on this Kaggle notebook