Using optuna with sklearn the right way — Part 3

Pruning unpromising trials to smartly speed up the optimization

13 min readJul 17, 2023

Introduction

In the previous two articles, we looked at how to use optuna to replace sklearn's GridSearchCV and RandomizedSearchCV classes. This article will follow the conventions established there, so I recommend reading them if you haven't.

Hyperparameter optimizations however, are still extremely expensive procedures, for each fold of the cross validation needs to train an entire model. That, multiplied by the amount of hyperparameter combinations to evaluate, makes an enormous amount of time spent. The worse part is that most of these combinations will probably be garbage.

Wouldn't it be great if we could just train a simpler version of the model on all the selected combinations, and then train the full model on the most promising ones?

Well, dear reader, that's exactly what we're going to talk about today, so buckle up.

Halving Searches and Pruning

In the past few releases, scikit-learn developers have introduced two new hyperparameter optimization classes: HalvingGridSearchCV and HalvingRandomSearchCV (though at the time of writing they are still technically experimental).

What does halving have to do with anything? Well, follow the procedure and you'll get it:

Choose a complexity-related resource (be patient, we'll define it later),
Define a minimum amount of resources to be used (i.e. the "simpler" version of the model mentioned in the introduction),
Train the search with the minimum amount of resources defined above and evaluate performance,
Given the performance metrics of the previous iteration, select the top 50%,
Train the selected models using double the amount of resources as before and evaluate performance,
Given the performance metrics of the previous iteration, select the top 50%,
Rinse and repeat until you either exhaust the amount of resources or are left with only one model.

But what about that "resource"? The idea is for it to be something that somehow equates to computation time, such as number of rows to train with or number of learners in an ensemble. The idea is that the first iterations will get a rough estimation of performance for each hyperparameter combination; then, given the best of those, increase the resources (a.k.a. computation time) and retrain. With each iteration, the performance estimation will become better and better (because more resources is more better), so the optimization will quickly hone in on the best performers.

With this in mind, halving searches will train many models (much more than the non-halving classes, mind you), but they will finish much faster, because resource allocation is optimized through the successive model selection.

As a small aside, the "halving" itself is not necessarily true. By default, a halving search will select a third of the candidates (the best third) and increase the resource by a factor of three. This factor can be controlled from the class constructor, to more finely tune how aggressively we want the optimization to advance. I highly recommend looking at the documentation for further details.

In this article, we'll look at optuna's "pruner" objects, which conveniently, allow us to do the same, but with more flexibility.

We haven't interacted much with trials, but during optimization, trials can do several things. One of those is use the "report" method, which stores two values inside the trial and calls a pruner (this pruner is defined at the study level).

The procedure is as follows: during a trial run, the model will be trained partially (for example, on a subsample of data, by calling the "partial_fit" method or setting warm_start=True for those models that support it) and evaluated; this evaluation will be reported by the trial and internally passed to the pruner which, according to its particular logic, will determine if further training iterations (either training with more data, adding another epoch or adding more trees, respectively) will improve performance. If performance isn't up to the pruner's exacting standards, it will kill the trial and force the study to move to the next one.

If you think about it, the successive halving algorithm (SHA) implemented in sklearn is incompatible with the way optuna works because SHA requires all hyperparameter combinations to be tested first, for it to select the best performers and promote them to the next iteration ("rung" is the term used in the papers).

Optuna trials, however, do not interact with one another and are therefore completely independent (which can be taken as a very loose definition of them being "asynchronous"). Therefore, optuna’s equivalent to scikit-learn’s SHA procedure is the Asynchronous SHA (ASHA) implementation. This procedure works using a heuristic that allows trials to be promoted early during the process, and assume that mistakes will vanish asymptotically as the optimization progresses. I highly recommend reading the 2020 paper to understand the algorithm in detail, the section on ASHA is surprisingly user-friendly.

Workflow

The article itself will mostly look at the SuccessiveHalvingPruner (though optuna has many more), because it's the equivalent of scikit-learn's halving searches. However, we'll see three different ways of using it:

First we'll look at how to use the full dataset for those models that implement the partial_fit method.
Then, we'll look at the models that have the warm_start option in their constructor.
Finally, we’ll look at how to use pruning for training any model with incrementally more data.

All right, let's dive in.

Epochs as the resource

If you are at all familiar with neural network training, you have probably heard about something called "epochs". During training, the algorithm throws data at the network and tells it to make predictions, when the prediction is correct, nothing happens; when the network makes a mistake, the algorithm hits it and says "bad network, when will you learn", and tells it to adjust its weights in a way that will prevent that mistake from being made again (theoretically at least). In what is probably the best example of a toxic relationship, the algorithm then repeats the process, giving more examples to the network and verbally abusing it until its mistake rate is down to whatever unrealistic standards the algorithm has. That try-test-abuse-update cycle is called an "epoch".

This successive re-adjusting of the model can improve performance. Excessively adding epochs however, can lead to overfitting. So how can we know how many epochs are really necessary? Well, that's exactly when trial pruning comes in.

Let's look at a simple objective function:

from typing import Optional
from pandas import DataFrame
from optuna import Trial
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

def objective(trial : Trial, X : DataFrame, y : DataFrame, seed : int=42) -> Optional[float]:
  X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=True, random_state=seed
  )
  
  sgd = instantiate_sgd_classifier(trial, random_state=seed)
  n_train_iter = 128
  
  for epoch in range(n_train_iter):
    sgd.partial_fit(X_train, y_train)

    epoch_score = roc_auc_score(y_test, sgd.predict_proba(X_test))
    trial.report(epoch_score, epoch)

    if trial.should_prune():
      raise TrialPruned()
  
  return epoch_score

As always, the objective function takes an optuna trial and some data, which it slits into train and validation sets, then the learner is initialized by an appropriately-designed instantiation function. So far so good, nothing new yet.

The following loop calls the partial_fit method of the SGDClassifier which, on its first iteration, will make a single pass over the training data and learn model parameters accordingly. The classifier is then scored on the test data (the SGDClassifier in particular only has predict_proba for some losses, son be careful with your choice of metric). The following three lines are where the magic happens: the trial calls the report method; this method will store the epoch_score and epoch values, and pass them on to the pruner internally. By looking at these values, the previous values and other trials' values, the pruner will tell the trial if it should be pruned or not (the pruner is what is actually called inside the should_prune method). If it returns True, then the trial is prematurely killed and the study is told to move on to the next one, for this one was pruned. That wasn't so hard now, was it?

In order to activate pruning during a study, you must instantiate it in the following way:

from optuna import create_study
from optuna.pruners import SuccessiveHalvingPruner
from optuna.samplers import RandomSampler

study = create_study(
  direction="maximize",
  pruner=SuccessiveHalvingPruner(reduction_factor=2),
  sampler=RandomSampler(seed=42) #not necessary, helps with reproducibility
)
study.optimize(lambda trial: objective(trial, X, y), n_trials=60)

As you can see, the study definition is only different in that a pruner object must be passed. Obviously, any pruner can be passed here.

With respect to the RandomSampler object passed in the constructor, it is not really necessary because it takes a default value, but doing so will force the optimization to be reproducible. So, if you want to try and see what happens when different values are passed (to the reduction_factor, for example), you'll need the sampler for comparisons to be really fair. Don't worry, we'll look at samplers in a future article.

Sub-models as the resource

Now, let's think about ensembles. These models use base (also called "weak") learners to reduce either bias, variance or both. Which begs the question "how many base learners are actually necessary?". Depending on the chosen algorithm, more base learners may help reduce the chosen type of error, however adding more models takes time and won't necessarily improve the final model. So, just as before, we'll let the pruner decide when enough is enough.

To this end, we'll recycle the instantiate_extra_trees function from the first article, but adding the warm_start parameter:

from sklearn.ensemble import ExtraTreesClassifier

def instantiate_extra_trees(trial : Trial, warm_start : bool=False) -> ExtraTreesClassifier:
  params = {
    'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
    'max_depth': trial.suggest_int('max_depth', 1, 20),
    'max_features': trial.suggest_float('max_features', 0, 1),
    'bootstrap': trial.suggest_categorical('bootstrap', [True, False]),
    'warm_start': warm_start,
    'n_jobs': -1,
    'random_state': 42
  }
return ExtraTreesClassifier(**params)

Now we need to define the objective function, which as you'll see will be a tad tricky. Before going into the code, let's think through the logic we need to implement:

We need to get the amount of models (i.e. n_estimators) the trial wants to train, and store it in a variable.
We need to reset the n_estimators vale to a lower number.
Finally, we'll need to train the partial model and score it on the test data, pruning if appropriate.

With that out of the way, here's the code:

from sklearn.model_selection import KFold, cross_val_score, make_scorer

def objective(trial : Trial, X : DataFrame, y : DataFrame, seed : int=42) -> Optional[float]:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, shuffle=True, random_state=seed
    )
    
    model = instantiate_extra_trees(trial, warm_start=True)
    n_estimators = model.get_params().get('n_estimators')
    min_estimators = 100
    
    for num_estimators in range(min_estimators, n_estimators + 1):
        model.set_params(n_estimators=num_estimators)
        model.fit(X_train, y_train)
        
        score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
        trial.report(score, num_estimators)
    
        if trial.should_prune():
            raise TrialPruned()

    kfold = KFold(shuffle=True, random_state=seed)
    roc_auc = make_scorer(roc_auc_score, needs_proba=True)
    scores = cross_val_score(model, X, y, cv=kfold, scoring=roc_auc)
    
    return np.min([np.mean(scores), np.median(scores)])

I told you it wasn't going to be trivial. But don't panic, we'll go over it line by line. The first line makes a split in the data to pass a non-overfit score to the pruner, we then instantiate the model using the function, being careful to pass warm_start=True. This is where the tricky part starts:

n_estimators = model.get_params().get('n_estimators')
min_estimators = 100

In case you aren't familiar, scikit-learn estimators have a get_params method that returns a dictionary with all the arguments passed to the __init__ method, so this basically extracts the maximum amount of trees that should be trained. Then, the min_estimators variable is set to 100 (though it can obviously be any number you want), this value will determine the obligatory amount of trees trained in all trials.

The loop itself will go from that minimum value up to the total amount of estimators chosen for that trial, adding a single tree at each iteration.

for num_estimators in range(min_estimators, n_estimators + 1):
    model.set_params(n_estimators=num_estimators)
    model.fit(X_train, y_train)
    
    score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    trial.report(score, num_estimators)

    if trial.should_prune():
        raise TrialPruned()

Inside the loop, we use the "set_params" method to edit the amount of estimators to the corresponding value according to the iteration. We then fit the model on the training data and score it accordingly, the trial's "report" method is used to tell the pruner what's up with the resource. Then, the pruner is asked about the promise of this trial through the "should_prune" method. If appropriate, a TrialPruned exception will be raised and the trial will be stopped.

kfold = KFold(shuffle=True, random_state=seed)
roc_auc = make_scorer(roc_auc_score, needs_proba=True)
scores = cross_val_score(model, X, y, cv=kfold, scoring=roc_auc)

return np.min([np.mean(scores), np.median(scores)])

Finally, if the pruner deemed our trial worthy to finish to completion, a cross validation is performed on all the data and its overfitting-preventing value is reported to the study.

"But Walter", you might say, "this is incredibly inefficient, we're training the full model, one tree at a time, only to retrain it five times again in the cross validation". And, esteemed reader, you're absolutely right. However, if you think about the number of trials that will be pruned, this makes the retraining overhead almost negligible. This is because the unpromising (is that even a word) trials will be killed before the model is fully trained, and the cross validation won't be performed.

After that, the study can be created and used just like before:

study = create_study(
  direction="maximize",
  pruner=SuccessiveHalvingPruner(reduction_factor=2),
  sampler=RandomSampler(seed=42) #not necessary, helps with reproducibility
)
study.optimize(lambda trial: objective(trial, X, y), n_trials=60)

Data as the resource

Now we get to the trickiest part: chopping the dataset up into pieces. This will take some time and thinking, but trust me when I tell you it will be worth it.

First, we will need a helper function that allows us to conveniently take a logarithm in any base and take the integer part. It can be defined either as a lambda function or a normal one:

log_int = lambda x, base: np.floor(np.log(x)/np.log(base)).astype(int)

def log_int(x, base : int=2):
  return np.floor(np.log(x)/np.log(base)).astype(int)

Something important to note here is that the base must be a positive integer. As we'll see later, this is related to the pruner's reduction_factor, which has exponential behavior. The function will allow us to be in line with the pruner's way of deciding when to prune.

Now on to what I'll call the "scale logic". This will consist of getting as many powers of the base (in our examples, the base is 2) as we can out of the data, given the number of rungs defined by the user:

def generate_sample_numbers(y : DataFrame, base : int, n_rungs : int) -> list[int]:
  
  data_size = len(y)
  data_scale = log_int(data_size, base)
  min_scale = data_scale - n_rungs
  min_samples = base**min_scale
  
  return [
      *map(lambda scale: base**scale, range(min_scale, data_scale+1))
  ]

Though it may appear arbitrary, this will allow us to sample from the data in powers of “base” (say, 2). For example, if the dataset has a size (data_size) of 10000, then data_scale will be 13, min_scale will be 9, and min_samples will be 2**9 (equal to 512). This means that in the first iteration, the model will be trained with this amount of samples, in the next iteration it will be trained with 1024, the next one with 2048, etc.

The “base” parameter will allow us to set different a different reduction factor in the pruner, and everything will work the same, but with different numbers of samples (and growth speeds). The “n_rungs” parameter lets us choose the number of rungs (or levels) that the ASHA algorithm will consider; this basically equates to how many times the model will be retrained.

With that out of the way, here's the objective function in all its glory:

def objective(trial: Trial, X : DataFrame, y : DataFrame, seed : int=42, base : int=2, n_rungs=4) -> float:
  X_train, X_test, y_train, y_test = train_test_split(
      X, y, shuffle=True, random_state=seed
  )
  model = instantiate_extra_trees(trial, warm_start=False)
  
  n_samples_list = get_sample_numbers(y_train, base, n_rungs)
      
  for n_samples in n_samples_list:
    X_train_sample = X_train.sample(n_samples, random_state=seed)
    y_train_sample = y_train.sample(n_samples, random_state=seed)

    model.fit(X_train_sample, y_train_sample.values.ravel())

    score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    trial.report(score, n_samples)
    
    if trial.should_prune():
      raise TrialPruned()

  kfold = KFold(shuffle=True, random_state=seed)
  roc_auc = make_scorer(roc_auc_score, needs_proba=True)
  scores = cross_val_score(model, X, y, cv=kfold, scoring=roc_auc)
  return np.min([np.mean(scores), np.median(scores)])

As you can probably notice, the high-level logic is basically the same as the previous two: split the data, instantiate the model, train increasingly complex models in a loop and evaluate them on the dataset not used during training, prune if appropriate, cross_validate and return.

Now let's take a closer look at the things that are different. First, notice that warm_start=False is passed to the instantiation function.

After that, the scale arithmetic is computed using the get_sample_numbers function explained earlier. To reiterate, this function will determine how many rows must be sampled for this particular model-rung combination.

Moving on to the loop:

for n_samples in n_samples_list:
  X_train_sample = X_train.sample(n_samples, random_state=seed)
  y_train_sample = y_train.sample(n_samples, random_state=seed)

  model.fit(X_train_sample, y_train_sample.values.ravel())

  score = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
  trial.report(score, n_samples)
  
  if trial.should_prune():
    raise TrialPruned()

First, we're using the pandas DataFrame's sample method, which randomly takes rows of the dataset (by default, without replacement). This smaller dataset is then used to train a model and score it. The pruner is then asked what should be done with this trial.

Finally, as always, we're doing cross validation on the complete dataset and returning its value:

kfold = KFold(shuffle=True, random_state=seed)
roc_auc = make_scorer(roc_auc_score, needs_proba=True)
scores = cross_val_score(model, X, y, cv=kfold, scoring=roc_auc)
return np.min([np.mean(scores), np.median(scores)])

After this, defining the study and running the optimization seems trivial, doesn't it?

factor = 2
study = create_study(
  direction="maximize",
  pruner=SuccessiveHalvingPruner(reduction_factor=factor),
  sampler=RandomSampler(seed=42) #not necessary, helps with reproducibility
)
study.optimize(
  lambda trial: objective(trial, X, y, base=factor, n_rungs=4), n_trials=60
)

Some thoughts

That was something, wasn't it? If you've actually made it this far, you really deserve some congratulations.

These methods can be extremely useful when dealing with very complicated models or huge datasets, because they will allow the study to focus its resources on the most promising trials.

The methods presented here are a minimum implementation, depending on your use-case and available compute, you could modify them to do more/less iterations, do intermediate cross validations to report to the pruner, or even train unsupervised learning models!

Something else I'd like to point out is regarding the get_sample_numbers function. In the article, the function was only used for the last example, although there exists an equivalent for the first (where the function defines the number of iterations where the partial_fit is applied) and second case (where the function defines the number of base learners as powers of the base). In contrast to the third case however, those two aren't strictly necessary, because the overhead added with each iteration is relatively low (though this is algorithm-specific in the ensembles); that's why I only implemented it for the last one. If you think it may be a good improvement then go for it.

Personally, I consider options 2 (using warm start to incrementally add sub-models) and 3 (refitting the entire model on more and more data) to be more useful. However, you can use the code here and extend it to whatever you are working on.

In the next part we'll look at different optimization methods, how they work and how to correctly leverage them.

I hope you enjoyed reading and find it useful.