Using optuna with sklearn the right way — Part 2

Let's take optimizations up a notch

6 min readJul 12, 2023

Introduction

In the previous article, we learned how to use optuna to optimize the hyperparameters of all the components in an scikit-learn pipeline. In this article, we'll add two new dimensions to the optimization: columns and estimators.

In the first part we'll see how to let the optimization select an appropriate subset of columns for the model to use. In the second part of the article we'll take everything to its natural conclusion, and let the optimization choose both the hyperparameters, as well as the best preprocessing steps and learning algorithm.

So for example, we'll give it the full dataset with 150 columns, and the optimization will choose a subset of columns (say, 50), as well as the pipeline structure itself. In this way, we don't have to select how to impute or preprocess the data; the model itself (the final step in the pipeline, which makes the predictions) will also be optimized for.

Hyper-hyperparameters?

"But Walter", you might say, "why not just do several optimizations for different estimators and just choose the best?". Let me answer with another question: isn't that an optimization with extra steps?

Look at it this way: hyperparameters (like max_depth or n_estimators) are variables that, in one way or another, determine the way in which learning will occur; therefore, by definition, hyperparameters are something that isn't estimated during training. From this perspective, we could add "algorithm", "imputer", "scaler" as new variables to the optimization, where the first one chooses the learning algorithm (logistic regression, random forest, etc.), the second chooses the imputing method (mean, median, etc.), and the las one chooses the scaling method (standardization, min-max scaling, etc.).

Are you ready? Great.

Workflow

This article will be structured in the following way:

Write the "choose_columns" function, which will allow optuna to directly interface with the dataset.
Write the "instantiate_learner" function, which will allow optuna to test several different algorithms during optimization.
Subtly reimplement the "instantiate_processor" and "instantiate_model" functions.

From now on, I'll assume you are familiar with the "instantiate_x" way of using optuna that we describe in the previous article, so if you find yourself lost, feel free to take a quick look and pop back here, I'll wait.

Column selection

Let's jump right in:

from optuna import Trial

def choose_columns(trial : Trial, columns : list[str]) -> list[str]:
  choose = lambda column: trial.suggest_categorical(column, [True, False])
  choices = [*filter(choose, columns)]
  return choices

Line by line, the code first defines a small function that takes a column name (optuna will enforce the string type), applies the function using python's fantastic filter and returns the columns which the trial suggested "True" for.

Now, there are obviously many different ways of doing this. Another one I think is worthy of mention is to use scikit-lego's ColumnSelector:

from sklego.preprocessing import ColumnSelector

def instantiate_column_selector(trial : Trial, columns : list[str]) -> ColumnSelector:
  choose = lambda column: trial.suggest_categorical(column, [True, False])
  choices = [*filter(choose, columns)]
  selector = ColumnSelector(choices)
  return selector

The first lines are exactly the same as before (and we could even use the choose_column function if we wanted), but this can now be integrated with an sklearn pipeline, which may make thinking about column selection as a hyperparameter a tad more intuitive. All in all, the implementation is up to you.

Before moving on, a little side-note on scikit-lego: it's an absolutely amazing library, I strongly encourage you to look into it if you haven't already. It's built by some of the smartest people out there, who really know their stuff. Also, the functionality it encapsulates extends scikit-learn in some great and convenient ways.

Estimator Selection

Having already built several instantiation functions, I'll assume you know how to write them. So, when you see "instantiate_logistic_regression", you can write it on your own, choosing whatever hyperparameters you want to optimize.

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

Classifier = (
  RandomForestClassifier |
  ExtraTreesClassifier |
  SVC |
  LogisticRegression |
  KNeighborsClassifier
)

def instantiate_learner(trial : Trial) -> Classifier:
  algorithm = trial.suggest_categorical(
    'algorithm', ['logistic', 'forest', 'extra_forest', 'svm', 'knn']
  )
  if algorithm=='logistic':
    model = instantiate_logistic_regression(trial)
  elif algorithm=='forest':
    model = instantiate_random_forest(trial)
  elif algorithm=='extra_forest':
    model = instantiate_extra_forest(trial)
  elif algorithm=='svm':
    model = instantiate_svm(trial)
  elif algorithm=='knn':
    model = instantiate_knn(trial)
  
  return model

Similarly, we can do the same for the scaling:

from sklearn.preprocessing import (
  StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler
)

Scaler = (
  StandardScaler |
  MinMaxScaler |
  MaxAbsScaler |
  RobustScaler
)

def instantiate_scaler(trial : Trial) -> Scaler:
  method = trial.suggest_categorical(
    'scaling_method', ['standard', 'minmax', 'maxabs', 'robust']
  )
  
  if method=='standard':
    scaler = instantiate_standard_scaler(trial)
  elif method=='minmax':
    scaler = instantiate_minmax_scaler(trial)
  elif method=='maxabs':
    scaler = instantiate_maxabs_scaler(trial)
  elif method=='robust':
    scaler = instantiate_robust_scaler(trial)
  
  return scaler

Simple, right? It should be noted, though, that some of the instantiation functions can do no optimization by themselves. The MaxAbsScaler, for example, has no hyperparameters to tune, so the instantiation function is nothing more than an API-unification tool.

Ditto for the encoding strategy using both scikit-learn and category-encoders:

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from category_encoders import WOEEncoder

Encoder = (
  OrdinalEncoder |
  OneHotEncoder |
  WOEENcoder
)

def instantiate_encoder(trial : Trial) -> Encoder:
  method = trial.suggest_categorical(
    'encoding_method', ['ordinal', 'onehot', 'woe']
  )
  
  if method=='ordinal':
    encoder = instantiate_ordinal_encoder(trial)
  elif method=='onehot':
    encoder = instantiate_onehot_encoder(trial)
  elif method=='woe':
    encoder = instantiate_woe_encoder(trial)
  
  return encoder

Obviously, these functions can be modified to suit your specific needs; you can use other libraries, add your own estimators, remove the ones I wrote, whatever you like.

Final Instantiation

That's all well and good, but we're still not done. In order to do that, we'll need to make some simple modifications to the final instantiation functions, as well as include the choose_columns function:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def instantiate_numerical_pipeline(trial : Trial) -> Pipeline:
  pipeline = Pipeline([
    ('imputer', instantiate_numerical_simple_imputer(trial)),
    ('scaler', instantiate_scaler(trial))
  ])
  return pipeline

def instantiate_categorical_pipeline(trial : Trial) -> Pipeline:
  pipeline = Pipeline([
    ('imputer', instantiate_categorical_simple_imputer(trial)),
    ('encoder', instantiate_encoder(trial))
  ])
  return pipeline

def instantiate_processor(trial : Trial, numerical_columns : list[str], categorical_columns : list[str]) -> ColumnTransformer:
  
  numerical_pipeline = instantiate_numerical_pipeline(trial)
  categorical_pipeline = instantiate_categorical_pipeline(trial)
  
  selected_numerical_columns = choose_columns(numerical_columns)
  selected_categorical_columns = choose_columns(categorical_columns)
  
  processor = ColumnTransformer([
    ('numerical_pipeline', numerical_pipeline, selected_numerical_columns),
    ('categorical_pipeline', categorical_pipeline, selected_categorical_columns)
  ])
  
  return processor

def instantiate_model(trial : Trial, numerical_columns : list[str], categorical_columns : list[str]) -> Pipeline:
  
  processor = instantiate_processor(
    trial, numerical_columns, categorical_columns
  )
  
  learner = instantiate_learner(trial)
  
  model = Pipeline([
    ('processor', processor),
    ('model', learner)
  ])
  
  return model

Wasn't that nice and easy? The good thing about this approach is its simplicity and scalability: just changing a couple of lines of code we made the optimization much more flexible.

Now on to the objective function… Hold on! Do we really need to write a new objective function?

If you think about it, that's really not the case, because the necessary functions (which conveniently kept their names) are directly used inside the objective function from last time, so that's basically it.

Some Thoughts

Once again, if you actually got this far, I’d like to both congratulate and thank you.

After all of that I've convinced you that no part of an sklearn pipeline is unoptimizable, and now you can look at each estimator and transformer as its own hyperparameter. If you think about it, feature engineering could also be appropriately integrated!

Now, a word of warning. As you've probably realized by now, I'm all for automating things; however, no optimization will get magical results. You still need to know what you're doing, and nothing will actually replace domain knowledge. Understanding the problem to solve will help you much more than any library of algorithm.

With that out of the way, Parts 3 and 4 we will be orienting the discussion towards getting to know optuna a little better, so stay tuned…

Thanks for reading!