Using optuna with sklearn the right way — Part 1

Optimizing hyperparameters on sklearn can get ugly fast, let's try and make it simpler

11 min readJul 11, 2023

Introduction

In Machine Learning, one of the biggest challenges is maximizing performance while minimizing overfitting. This problem is also known as the bias-variance tradeoff: basically, there is a sweet spot on model complexity, where the model is complex enough to learn everything there is to learn, while being simple enough not to over-adapt to the training dataset.

In this part, we'll look at the basics of how to use optuna with sklearn, and in Part 2 we'll extend the functionality in order to optimize just about everyghing we possibly can.

Hyperparameters

There are several ways to avoid overfitting, but a commonly used technique is hyperparameter tuning, which explicitly determines model complexity.

For the uninitiated, "parameters" are what models learn during training (the coefficients in a logistic regression, the variable-cutoff combination in decision trees, etc.), while "hyperparameters" are kind of knobs and levers that need to be determined before training.

In decision trees, for example, the max_depth hyperparameter determines the maximum allowed number of levels the tree can grow to, regardless of how much the tree could theoretically learn if it added more leaves. For all the available hyperparameters scikit-learn lets us touch, feel free to take a look at the documentation.

Now, considering the amount of hyperparameters a DecisionTreeClassifier has, each of which will impact model complexity in a different way, an obvious question is "how do I choose the best combination?". Well, dear reader, if you're having that question too, then it's your lucky day, for in what follows I'll go over an extremely scalable and reusable way of doing it.

Workflow

I remember when I first started reading data science tutorials and saw people writing classes and functions, and thought "why the hell are they doing that? It's much simpler to just write everything in a single script and have ir run from end to end". Several years later, I can now see the error in my ways: correctly grouping behavior helps the code be more reusable and, more importantly, understandable. So, if this is one of the first tutorials you're reading, don't panic, it will all click eventually.

With that out of the way, here's what we're going to do:

Write instantiation functions, that will create the necessary objects of the sklearn pipeline with their corresponding initializations. Each of these functions must be pure and do only one thing (i.e. create an transformer/estimator instance).
Write an appropriate objective function, which takes a trial and the data explicitly, uses the previously defined functions to create the model object, evaluates it using cross validation and returns the performance.
Create an optuna study and call the optimize method on the objective function and the data.
Get the best trial out of the finished study.
Pass the best trial to the instantiation functions to get the best model.

A little note before we begin: the code in this article we'll add typing when defining functions. This is not strictly necessary, as the CPython interpreter ignores typing information. However, it plays very nicely with IDEs' auto-complete functionality, as well as static analysis tools such as pyre and mypy (which I prefer), which help you find errors before ever running things.

Instantiation functions

To understand the following, you'll need to be familiar with scikit-learn's Pipeline and ColumnTransformer classes.

Ready? Great.

Most real-world datasets have both numerical and categorical columns, as well as missing data, so let's start by instantiating an imputer:

from sklearn.impute import SimpleImputer
from optuna import Trial

def instantiate_numerical_simple_imputer(trial : Trial, fill_value : int=-1) -> SimpleImputer:
  strategy = trial.suggest_categorical(
    'numerical_strategy', ['mean', 'median', 'most_frequent', 'constant']
  )
  return SimpleImputer(strategy=strategy, fill_value=fill_value)

def instantiate_categorical_simple_imputer(trial : Trial, fill_value : str='missing') -> SimpleImputer:
  strategy = trial.suggest_categorical(
    'categorical_strategy', ['most_frequent', 'constant']
  )
  return SimpleImputer(strategy=strategy, fill_value=fill_value)

Oh, didn't I tell you? The model isn't the only thing we'll be optimizing, but the preprocessing too.

If you recall my previous article, optuna trial objects choose values from the provided distributions through the suggest API, and evaluate the objective function with those values. In this particular case, the values correspond to the imputer's hyperparameters.

An interesting quirk to note is that the first argument of suggest_x functions is the name optuna will use to interact with a particular variable (i.e. hyperparameter). Therefore, it's imperative for variable names to be distinct throughout the entire optimization; that's why I had to add the variable type prefix ('numerical_strategy' and 'categorical_strategy'), for if we don't, an error will be raised.

Now let's move on to the categorical encoding by using category-encoders:

from category_encoders import WOEEncoder

def instantiate_woe_encoder(trial : Trial) -> WOEEncoder:
  params = {
    'sigma': trial.suggest_float('sigma', 0.001, 5),
    'regularization': trial.suggest_float('regularization', 0, 5),
    'randomized': trial.suggest_categorical('randomized', [True, False])
  }
  return WOEEncoder(**params)

Doesn't that look nice?

Now let's move on to dealing with the numerical variables. In my experience, scaling numerical variables doesn't particularly improve the performance of tree-based models but hey, why not let the data decide?

from sklearn.preprocessing import RobustScaler

def instantiate_robust_scaler(trial : Trial) -> RobustScaler:
  params = {
    'with_centering': trial.suggest_categorical(
      'with_centering', [True, False]
    ),
    'with_scaling': trial.suggest_categorical(
      'with_scaling', [True, False]
    )
  }
  return RobustScaler(**params)

Now we need the model itself. This particular implementation is for an ExtraTreesClassifier, but you can obviously abstract it away for basically any model you like:

from sklearn.ensemble import ExtraTreesClassifier

def instantiate_extra_trees(trial : Trial) -> ExtraTreesClassifier:
  params = {
    'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
    'max_depth': trial.suggest_int('max_depth', 1, 20),
    'max_features': trial.suggest_float('max_features', 0, 1),
    'bootstrap': trial.suggest_categorical('bootstrap', [True, False]),
    'n_jobs': -1,
    'random_state': 42
  }
  return ExtraTreesClassifier(**params)

Note that the hyperparameters I chose to optimize are based on my own experience, but you can modify the function to suit your needs.

Now let's create the pipeline instantiation functions:

from sklearn.pipeline import Pipeline

def instantiate_numerical_pipeline(trial : Trial) -> Pipeline:
  pipeline = Pipeline([
    ('imputer', instantiate_numerical_simple_imputer(trial)),
    ('scaler', instantiate_robust_scaler(trial))
  ])
  return pipeline

def instantiate_categorical_function(trial : Trial) -> Pipeline:
  pipeline = Pipeline([
    ('imputer', instantiate_categorical_simple_imputer(trial)),
    ('encoder', instantiate_woe_encoder(trial))
  ])
  return pipeline

Finally, let's put it all together:

from sklearn.compose import ColumnTransformer

def instantiate_processor(trial : Trial, numerical_columns : list[str], categorical_columns : list[str]) -> ColumnTransformer:
  
  numerical_pipeline = instantiate_numerical_pipeline(trial)
  categorical_pipeline = instantiate_categorical_pipeline(trial)
  
  processor = ColumnTransformer([
    ('numerical_pipeline', numerical_pipeline, numerical_columns),
    ('categorical_pipeline', categorical_pipeline, categorical_columns)
  ])
  
  return processor

def instantiate_model(trial : Trial, numerical_columns : list[str], categorical_columns : list[str]) -> Pipeline:
  
  processor = instantiate_processor(
    trial, numerical_columns, categorical_columns
  )
  extra_trees = instantiate_extra_trees(trial)
  
  model = Pipeline([
    ('processor', processor),
    ('extra_trees', extra_trees)
  ])
  
  return model

One of the greatest things about this way of working is that if you want to try more models, you only need to modify the final function, the one that instantiates the predictor itself, because everything else is already predefined.

"But Walter", you might say "isn't this basically the same as with sklearn's grid and random searches? Aren't we adding unnecessary extra steps by introducing optuna?". Well… yes and no.

With what we've seen so far, the truth is that the only thing we have done is add a new library and learned its API (and it also gives us access to optuna's samplers). However, as we'll see in following articles, optuna's flexibility will make some apparently hard things extremely comfortable, and I don't know about you, but I'm all for comfort.

Objective Function

So, with all of that out of the way, let's finally define the function optuna will interact with:

from typing import Optional
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import roc_auc_score, make_scorer
from pandas import DataFrame, Series
import numpy as np

def objective(trial : Trial, X : DataFrame, y : np.ndarray | Series, numerical_columns : Optional[list[str]]=None, categorical_columns : Optional[list[str]]=None, random_state : int=42) -> float:
  if numerical_columns is None:
    numerical_columns = [
      *X.select_dtypes(exclude=['object', 'category']).columns
    ]
  
  if categorical_columns is None:
    categorical_columns = [
      *X.select_dtypes(include=['object', 'category']).columns
    ]
  
  model = instantiate_model(trial, numerical_columns, categorical_columns)
  
  kf = KFold(n_splits=5, shuffle=True, random_state=random_state)
  roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)
  scores = cross_val_score(model, X, y, scoring=roc_auc_scorer, cv=kf)
  
  return np.min([np.mean(scores), np.median([scores])])

That's a mouthful, let's break it down bit by bit.

Function arguments have nothing particularly weird: an optuna trial (don't worry, optuna will deal with it for us), a pandas dataframe named X (this can be suitably changed to a numpy array, a datatable table, or whatever you want, as long as you pass appropriate datatypes to the sklearn pipeline), a numpy array/pandas series with the target values, optional the names of both numerical and categorical columns in their respective lists and a random_state for the cross_validation.

For this implementation it's important that, if you decide to pass the column names, X's columns are the same as the passed names, no more, no less. This last point can be changed with a little moving-around, but if you're just copy-pasting it can save you a couple of headaches. Something else that's different with respect to the objective functions you may find online (including in optuna's documentation) is that this function is pure: ALL of its arguments must be explicitly passed, meaning that it doesn't depend on the state of the program at execution time. In contrast, others (such as here) tend to make the objective function get the data from elsewhere. This is not necessarily incorrect, and works just as well, but if you made a mistake somewhere, it makes debugging a literal nightmare. Additionally, pure functions can be easily unit tested.

With that out of the way, let's get back to the code. First of all, as the column names are optional, the function checks if they were passed or not. If they weren't (i.e. if they are None), it will use pandas' interface to automatically detect the corresponding columns. Again, this implementation uses pandas, but you can modify the code to do it with you table-library of choice.

if numerical_columns is None:
  numerical_columns = [
    *X.select_dtypes(exclude=['object', 'category']).columns
  ]
  
if categorical_columns is None:
  categorical_columns = [
   *X.select_dtypes(include=['object', 'category']).columns
  ]

This next section is where the magic happens:

model = instantiate_model(trial, numerical_columns, categorical_columns)

That simple line will build the entire pipeline, choosing the hyperparameters from the distributions we provided. The truth is, if you understood the general logic of the instantiation functions, there's really not a lot to explain here. That's the good thing about optuna's suggest API: it allows us to encapsulate extremely complicated logic within a couple of lines of code.

The next section is some standard-issue sklearn code:

kf = KFold(n_splits=5, shuffle=True, random_state=random_state)
roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)
scores = cross_val_score(model, X, y, scoring=roc_auc_scorer, cv=kf)

A correctly instantiated KFold object will allow us to reproducibly separate between train and test splits for cross-validation.

The following line uses the make_scorer function to turn the roc_auc_metric into an appropriate scorer necessary for the cross_val_score function (for more information on this I highly recommend the documentation). I arbitrarily chose to optimize for ROC AUC in this particular instance, but you can obviously choose whichever sklearn metric you want, or even create a custom one.

cross_val_score is a nifty little function that conveniently encapsulates performing a reproducible cross-validation on an arbitrary dataset with an arbitrary model and an arbitrary scoring function. The return value of this function will be a numpy array with the scores (the ROC AUC scores in this case) for the test sets of each of the folds.

Finally, we have:

return np.min([np.mean(scores), np.median(scores)])

Now, here's the other place where optuna really shines. As it doesn't really care what is inside the function as long as it returns a number (or array of numbers… to be continued…), optuna basically turns a blind eye to the internals of the objective function. This allows us to calculate just about whatever we want in there.

In this particular case, I set the return value to be the minimum between the mean and the median of the cross-validation scores. The reason for it is simple: mean values tend to be strongly influenced by outliers, so if a single split has a very good performance, but the others don't (which happens more often than one would like), then the score will be artificially high. Let's look at a simple example to illustrate the point:

scores = np.ndarray([0.51, 0.51, 0.51, 0.51, 0.98])
np.mean(scores) # 0.604
np.median(scores) # 0.51

Similarly, the mean is also influenced by very small values, while the median won't. In this way, the minimum between the mean and median will prevent scores from becoming too optimistic, which may negatively influence surrogate-model optimization (e.g. Bayesian) methods. This also helps prevent overfitting towards an "easy" fold (the one with high performance, which will be the same across all trials).

You can obviously calculate this same score in sklearn grid and random search objects (after finishing the search) through the cv_results_ attribute. However, that requires manually setting things and selecting hyperparameters one by one, which could very easily turn hacky.

The rest

If you recall the missing steps are as follows:

Create the study,
Get the best trial,
Instantiate and train the best model.

The first part is just this:

from optuna import create_study

study = create_study(study_name='optimization', direction='maximize')

study.optimize(lambda trial: objective(trial, X_train, y_train), n_trials=100)

The names of the functions and objects are pretty self-explanatory: the create_study function creates a study object which "knows" both its own name and the direction to optimize (i.e. loss functions must be minimized, while gain/score functions must be maximized). The "optimize" method is a high-order function that takes the objective function (here is where the trial object automagically appears) and the number of trials to run the optimization for.

Depending on the size of the dataset, the number of splits selected for cross-validation and learning complexity, the optimize function will take a short or long amount of time (though I dare say it'll be on the longer side).

After that's done, we'll be left with a Study object that contains all of the data/metadata we could ever want about the optimization. If this is your first time using optuna, don’t worry, we’ll get a deeper into the nuances of optuna studies in the next part. For now, let's move on to getting the best combination of hyperparameters out of there.

Study objects contain several attributes:

Attributes of the study object. Taken from the documentation.

The attributes of interest to us right now are "best_params" and "best_trial", which will contain a dictionary with the best key-value pairs for the chosen hyperparameters and its corresponding trial object, respectively. "direction/s", "system_attrs" and "user_attrs" are study-related metadata (again, don't panic, we'll look at them in the other parts). Regarding "best_trials", it's a way of selecting the best trials for multi-objective optimizations, which we aren't doing right now, so we can safely forget about it for now.

If you want to visually inspect the best combination of hyperparameters, you can do it as follows:

study.best_params

For our particular case, this is a dictionary of the form:

{
  'numerical_strategy': ...,
  'categorical_strategy': ...,
  'sigma': ...,
  'regularization': ...,
  'randomized': ...,
  'with_scaling': ...,
  'with_centering': ...,
  'n_estimators': ...,
  'max_depth': ...,
  'max_features': ...,
  'bootstrap': ...,
}

The good: this is fantastic and understandable. The bad: we can't use it to instantiate a model comfortably! But don't be impatient, we'll get there shortly. The next step is to get the best trial, which can be done like so:

best_trial = study.best_trial

The trial itself is a very particular optuna object that contains all the data and metadata concerning that particular combination of hyperparameters, such as the parameters themselves, the objective function return value (which in our case is the minimum between the mean and median of the ROC AUC score).

This is where the instantiation functions get their moment to shine:

model = instantiate_model(best_trial, numerical_columns, categorical_columns)
model.fit(X_train, y_train)

That's IT!

We can now easily calculate the out-of-sample performance:

probabilities = model.predict_proba(X_test)[:, 1]
score = roc_auc_score(y_train, probabilities)

And train the model on all the data:

model.fit(X_full, y_full)

Some thoughts

First, if you actually got this far, I'd like to both congratulate and thank you.

As you've seen throughout this article, optuna allows us to very comfortably define anything needed for an arbitrarily complex optimization.

This way of doing things (instantiation functions, pure objective function) allows us to define things once and build a simple-ish module of optimization-related functions which can be recycled as needed on as many projects as we want.

In general, I don’t recommend moving forward without looking at more than one hyperparameter set. In the above, we only used the best trial, but a better practice would be to look at the top 10% and compare them.

Part 2 will extend what we saw here, the following articles we'll be recycling the methodology presented here, and look at some more of optuna's features.

Thanks for reading!