Using optuna with sklearn the right way — Part 4

Let's talk about samplers

13 min readJul 25, 2023

Photo by Eduardo Buscariolli on Unsplash

Introduction

In parts 1, 2 and 3 we looked in a general way at how optuna works, how to make it interact with scikit-learn in the most flexible way possible, and how to use optuna's fantastic pruning functionality to speed up the optimization. But was has been conspicuously missing of the discussion so far (and was even teased in Part 3) is the optimization mechanism itself, what optuna calls the "sampler" object.

In scikit-learn there are four "search" classes that implement optimization: GridSearchCV, RandomizedSearchCV, as well as their halving counterparts discussed in the previous article. A rather unknown class however, is the ParameterSampler used internally by the random searches.

Optuna gives us many, many more ways of generating parameter combinations. In what follows we'll look at the "why", the "what" and the "how".

Searching hard or hardly searching?

Let's first take a small step back and think about the two search types. A grid search is a brute-force approach that can be summarized as nested for-loops that will exhaustively try every possible parameter combination (i.e. the "grid"). This necessitates that we explicitly pass all the parameters we want it to try, which implies that we are limiting the search space in two ways:

First, the edges. These are the maximum and minimum values that can be chosen for each numeric parameter. Sometimes, these limits are baked into the method (such as with max_features in decision trees, whose float values need to be between zero and smaller or equal to 1). Others, we have to make a decision of how "overkill" these largest-smallest values are; regularization parameters tend to be a good example of this, where a value of 500 is probably too large for anything useful, let alone anything bigger.
Second, the sample spacing. This is the distance between one parameter value and the next. For example, n_estimators in random forests, can be any positive integer; say we define the minimum to be 100 trees and the maximum to be 1000. Will we try every possible value between the extremes? Probably not.

This last one is going to massively impact the number of models the search will need to train in order to find the "best"; the smaller the step size, the longer the optimization will take. However, larger step sizes could negatively impact the results, because the best hyperparameters may be between the chosen values (for example, if the optimal value is 1.51, and we are sampling 1 and 2). And here is when the random search makes its debut.

Random searches are often only talked about in terms of reducing search times, as a sort of stupid younger brother to grid search's "exact" method. But the truth is far from it, and random searches are a force to be reckoned, and it has actually been proven that they are better than grid search.

The genius in how scikit-learn's random search is implemented is that the param_distributions argument is a dictionary where numerical parameters can be supplied as lists (just like in grid search) or as scipy distribution objects. If a list is provided, the search will uniformly sample from it, but if a distribution is provided, at each iteration the search will call its rvs method (which essentially stands fro "random variate sample") to get a value from that particular distribution.

This is great because it gives users a lot of flexibility when choosing how they want their parameters to be sampled, so previous knowledge about good hyperparameter values can be suggested to the search, but other values will be tried too.

So far, it sounds good enough, so what's the catch? The problem with both approaches is that they both jump from good regions to bad regions. What this means is that a huge amount of resources will be spent training models that will probably be garbage. So this is why smarter samplers can help improve our results.

But what does "smarter" mean in this context? Imagine that we are optimizing a function in the range between 0 and 100, and have the following results so far:

Table of observed parameter/result values

Where should we evaluate next? I don't know about you, but I would start evaluating the function around 59, probably towards the larger values, progressively getting further away. As the optimization continues, I would focus my efforts near the numbers where the optimal results were found. That makes sense, right? We first do a short random search, and then start sampling values closer to the maxima.

Workflow

This article is going to be slightly different than the previous ones, in that the code will be minimal, and more of a presentation of the different types of samples available in optuna, as well as a broad description about how each of them works.

First, we'll take a look into optuna sampler classes in general and how they interact with studies and trials.
Then we'll do a relatively-deep dive into three sampling algorithms.
Finally, we'll look at a couple of plots to see how the samplers work in practice.

Samplers

Shortly, we'll look at the available samplers in optuna. However, I first wanted to briefly mention why this article has little-to-no code. First of all, samplers are passed to the study in the following way:

from optuna import create_study
from optuna.samplers import YourFavoriteSampler

study = create_study(sampler=YourFavoriteSampler())

So instantiating them is extremely easy. That is why the code for this article is mostly done. As we'll see below, we already have the necessary code in the previous articles.

"But how do we use samplers?", you might ask. This is where the suggest API comes into play; we as users need only pass the sampler to the study, and it will manage everything for us. Internally, this is what's going on:

The flow of information between study, objective function, trial and sampler. A more detailed explanation can be found here.

The exact details are a bit more complicated, but basically what happens is that the study instantiates a trial, the sampler takes an element from the search space and then the function is called; after the return value is handed to the study, the trial is stored and the process is repeated with a new trial. In case the previous sentence confused what you had read from the diagram, consider that every "relative" object in the image considers the relationships between hyperparameters, while the "independent" ones are exclusively for those hyperparameters that are unrelated to the others. In general, everything is relative unless you dynamically alter the search space during optimization (yes, you can do that quite easily).

GridSampler and RandomSampler

As already discussed, the simplest samplers are basically the same as in scikit-learn. The random sampler will always sample using whatever distribution you passed in the suggest API (i.e. a uniform or log-uniform distribution) and tell the trial to evaluate whatever comes out. The random sampler __init__ method has only the "seed" argument to initialize the sampling.

Let's move on to more interesting stuff.

Quasi-Monte Carlo Sampler

Quasi-Monte Carlo sequences are sets of numbers with a low discrepancy. That doesn't help? No?

QMC sampling is conceptually similar to random sampling in the sense that we don't explicitly define the hyperparameter values to try. However, they do differ in one key aspect: in QMC, sampling is modified in such a way that the elements are approximately evenly distributed.

Random samples versus QMC samples. Image taken from wikipedia.

Looking at the image we can intuitively understand what's going on, and that this business about "discrepancy" is actually related to how homogeneously distributed the points are. In a sense, the QMCSampler combines the best of the random and grid samplers because it efficiently samples the search space while not being restricted by the grid spacing.

The __init__ arguments are related to the method used when generating the QMC sequence.

Covariance Matrix Adaptation Evolution Strategy Sampler

This is when some really cool non-math concepts start to appear (though the math is still pretty cool too).

The Covariance Matrix Adaptation (CMA) is arguably the scariest part of the name. Let's break it down:

Covariance refers to the relationship between two numeric variables. The variance of a variable, for example, can be interpreted as the covariance of the variable with itself.
Matrix in this particular case, refers to the matrix that contains the covariance values between all variables.
Adaptation refers to the fact that the covariance matrix previously calculated will be modified (i.e. adapted) as the optimization progresses.

At each step, the method will try to fit a Gaussian to the objective function, no matter how poorly, evaluate several points and refit the Gaussian. The idea is that, given enough iterations, the Gaussian will have honed in on the optimal value of the objective function in that search space.

For a very clear illustration, take a look at this gif from the original repository; on the left you can see the objective function, as well as the evaluated points, and on the right you can see the Gaussian. As is shown in the animation, the gaussian's covariance (i.e. the "size" of the ovals on the right) progressively shrinks as it gets closer to the optimal value.

In order for the optimization to actually optimize anything, the mean and covariance matrix of the Gaussian must actually converge to the desired value. That is where to the Evolution Strategy (ES) part comes in.

At each iteration, several elements (called "population") will be sampled from the current Gaussian and their fitness will be calculated (which basically means that the objective function will be evaluated using them). Then, a weighted mean and covariance of all the population parameters is be calculated, where the weighting factor is proportional to the fitness, and a Gaussian with those parameters is built. This essentially implies that those individuals (i.e. parameter combinations) closest to the optimum will have a larger influence in the shape of the resulting Gaussian. For the terminology-avid reader, this process is called weighted intermediate recombination.

An important disadvantage of this method is that, by design, it is very prone to fall into local maxima/minima. So what optuna's implementation does is run it several times using different (and more) initial samples to allow it to better explore all hyperparameter space.

This method, along with the following one, is called a "surrogate model" optimization method. If you think about it, the name makes a ton of sense, because the algorithm is building a representation of the objective function (which in this case is extremely simplified into a Gaussian) and updating that representation according to the values obtained. This internal "model" of the objective function ensures that most of the computing resources are dedicated to "good" combinations of hyperparameters.

Tree-Structured Parzen Estimator Sampler

That is a really scary-looking name, isn't it? Fear not, by the end of this section you'll know exactly what each part corresponds to.

To adequately understand what the TPESampler does, we have to consider two things. The first is that this is another surrogate model algorithm, which means that the optimization will try to learn what the objective function looks like; and the second is that it is a Bayesian approach, which means that all previously known results are incorporated when updating the model.

Like CMA-ES before, TP starts sampling randomly from the available space. It will then use the calculated objective function values to fit a Gaussian Process (GP) model. I know this sounds complicated (and it is), but it can be summarized as a continuous function in which every point has an intrinsic uncertainty that follows a Gaussian distribution.

Example of a Gaussian Process (blue line) fitted on data (red points) with its corresponding credibility intervals (shaded blue region) to estimate the unknown response function (dotted black line). Image taken from here.

The figure above shows one such example, where the observed values are shown in red, and the model is the blue curve and the blue-shaded region. If you were to make a vertical cut on any point, the response function would have a normal (Gaussian) distribution centered around the blue curve, and a standard deviation proportional to the width of the shaded interval.

Now that we have a surrogate model, we need to select the hyperparameter combination to try next. In order to do that, TP uses a lot of math to calculate a function whose value is a compromise between the expected value (i.e. the blue curve of the figure) and the variance of the GP. It may seem weird at first, but bear with me. The first part is clear enough: we want to maximize (or minimize) the function, so choosing the largest expected value is the obvious answer. But the GP is a model, and this means that it may well be wrong. That is why the uncertainty (modeled by the credibility intervals) is useful here, because it shows where the algorithm is less sure of its estimation. The algorithm is called Adaptive Parzen Estimation.

This is known as the exploration-exploitation compromise:

On one hand, we want to exploit (make the most out of) the regions where we are certain (low uncertainty) the function could have an optimum.
On the other hand, we want the optimization to efficiently explore as much of te search space as possible, so focusing on the regions where uncertainty is the largest will allow the optimization to evaluate otherwise unlikely values. Note in the figure above that both extremes (near -2 and 10), the blue curve drops downward, significantly deviating from the black line's true value; the large uncertainty will eventually force the optimization to evaluate points there.

What about the "Tree-Structured" part? In the original paper, the authors design the algorithm to work with neural networks. If you think about the hyperparameters of a neural network, they are mostly determined by the topology of the network, so if you choose a network with ten layers, then you have to optimize the hyperparameters for each layer (activation function for each neuron, number of neurons, connections, etc.). This means that the search over some hyperparameters is conditioned on the selection of others. In this way, you first need to choose the number of layers, then the type of layer and then the hyperparameters for the neurons. This hierarchically conditioned structure is a directed graph… also called a tree.

The TPESampler is what optuna uses whenever you pass no sampler explicitly.

Other samplers

If you look here, you'll see that optuna has many more samplers that we didn't look at here (and there are a couple more in the integration module). I chose these ones because they are the ones I have found the most useful in my day-to-day work.

Hands-on

The set-up will be basically the same as in the first article: an instantiation function, an objective function and a study. However, instead of just one study, we'll do one with each sampler. The other thing to take into account is that the instantiation function was modified like so:

def instantiate_extra_trees(trial : Trial) -> ExtraTreesClassifier:
    params = {
        'max_features': trial.suggest_float('max_features', 1e-5, 1),
        'min_samples_leaf': trial.suggest_float('min_samples_leaf', 1e-5, 1),
        'n_jobs': -1,
        'random_state': 42
    }
    return ExtraTreesClassifier(**params)

First, as this is for illustrative purposes, we have removed most hyperparameters. Second, the zeros have been replaced with very small values; this is to prevent any of the samplers from acting up and selecting an invalid zero (because you can't build a leaf with zero elements).

RandomSampler

The first study is performed using the old faithful random sampler:

random_study = create_study(
    direction="maximize",
    sampler=RandomSampler(seed=42)
)
random_study.optimize(lambda trial: objective(trial, X, y), n_trials=100)

After one hundred trials (or as many as you like), we can make use of yet another of optuna's convenience modules: optuna.visualization. This has lots of utility functions that can take information out of the study and plot it nicely for us to interpret. It should be noted that the module uses a plotly backend, which can be a bit of a pain to set up correctly in jupyter lab so, if you want a guaranteed way of getting the results, the same functions are available in optuna.visualization.matplotlib (though experimental):

from optuna.visualization.matplotlib import plot_contour
from optuna.visualization import plot_contour

plot_contour(random_study, ['max_features', 'min_samples_leaf'])

The x and y axes represent the hyperparameter values, while the z axis (the color) represents the value of the objective function. Every black dot is a hyperparameter combination evaluated using cross validation.

There are two things worth noting from the previous plot. The first is that points tend to either be extremely close to each other or very far away, so some regions are not properly explored. The second thing is that there are many dots in the top of the plot, where performance is pretty lousy.

Let's see how the QMC sampler did.

QMCSampler

qmc_study = create_study(
    direction="maximize",
    sampler=QMCSampler(seed=42),
)
qmc_study.optimize(lambda trial: objective(trial, X, y), n_trials=100)

plot_contour(qmc_study, ['max_features', 'min_samples_leaf'])

The difference with the random sampler is very noticeable: this space is much more extensively explored, with the spacing between points being very homogeneous. However, just like with the random sampler, a lot of resources are wasted exploring the unfruitful region at the top.

What about the surrogate model samplers?

CmaEsSampler

cma_study = create_study(
    direction="maximize",
    sampler=CmaEsSampler(seed=42),
)
cma_study.optimize(lambda trial: objective(trial, X, y), n_trials=100)

plot_contour(cma_study, ['max_features', 'min_samples_leaf'])

That's more like it! The top region is very sparsely populated, and most of the points are clustered towards the mid-lower part of the plot. This means that the optimization learned that was the most fruitful region.

How does this compare to the Bayesian sampler?

TPESampler

tpe_study = create_study(
    direction="maximize",
    sampler=TPESampler(seed=42),
)
tpe_study.optimize(lambda trial: objective(trial, X, y), n_trials=100)

plot_contour(tpe_study, ['max_features', 'min_samples_leaf'])

This looks very similar to the previous plot, but there is a big difference: points are more tightly clustered in the mid-lower region of the plot. This is neither good nor bad, but it's worth noting how the sampling strategies differ in identical conditions.

Conclusion

Wow, you've made it this far! If you have actually read all your way from part 1 to here, I give you my most sincere congratulations.

The idea for this series was to give a roughly-comprehensive overview of how optuna works and the best way of leveraging it if you use scikit-learn as your main library. Hopefully though, you'll be able to take the ideas presented here and extrapolate them to your favorite library (such as spark's mllib, which has a significantly different API).

Obviously, we just brushed the surface of everything available in optuna, but for now this part will be the end of the series. I may do another on more optuna-specific stuff in the future, so feel free to ask if there is anything in particular you'd like me to write about.

Thanks for reading!