Using Optuna the wrong way

Because why not manually implement a logistic regression?

Walter Sperat
7 min readJun 7, 2023

If you have lived under a rock for the last couple of years, Optuna is a Python library aimed at hyperparameter optimization.

In this post however, I would like to take a step back and do something different. In and of itself, there's nothing in the library that forces us to do machine learning, or to tune hyperparameters even.

The Optuna API

Let's look at the main objects the library exposes for us to use and (broadly) how they interact:

  • Study: this object will centralize and store everything regarding the optimization; it's the main thing we'll interact with.
  • Trial (in all its different versions): each of these objects represents a single instance of function evaluation (parameter values, function results) with its corresponding metadata (start and end time, pruning data, attributes, etc.)
  • Objective function: this is any python callable object (lambda function, functools.partial, or anything that explicitly defines a __call__ method) that takes a trial as an argument and returns a number (though it can also return a tuple containing many numbers).

And that's it!

There are obviously a lot more moving parts (such as samplers, which we'll expand on in a couple of sections), but if you learn how to effectively use these three types of objects, you can already call yourself a good Optuna user.

Studies are something you need to explicitly instantiate yourself; objective functions are something you need to define step by step, and that will encapsulate the function you want to optimize for. Inside the objective function you will define whatever logic is necessary for your particular use-case. Furthermore, as the objective function takes a trial as its argument, that's where you get to interact with the trial object by using the suggest API.

It may have a scary name, but the suggest API basically allows you to hand over to the trial the decision of what value to try next. If this seems a bit too abstract right now don't worry, we'll see an example shortly.

The "Why?"

Now, notice that at no point in the previous section did I mention machine learning, AI or any other of the latest buzzwords: function optimization is completely independent of any particular implementation, it's just finding the largest/smallest value/values for a function given a set of constraints.

Finding maxima or minima is something we learn to do in highschool: take the derivative, make it equal to zero and look at the solutions. However, functions can be ill-behaved and not have their derivatives defined everywhere, or they could be intractably complex to calculate.

That's where Optuna (and other tools like it) comes in: it's a framework for finding maxima or minima of functions which are, for one reason or other, intractable in practice to work with by using analytic methods. This is especially useful when dealing with functions that are very expensive to calculate (such as machine learning scores, wink wink).

Hands on

So, today I'd like to give a gentle introduction to how Optuna works by using an example from machine learning, but in a pretty unconventional way: we'll find the parameters of a logistic regression semi-manually and compare them with the ones sklearn's implementation found to convince ourselves that it's actually a good result.

First, we'll need to import all the stuff we are going to need:

import numpy as np
from matplotlib import pyplot as plt

from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from optuna import create_study, Trial
from optuna.samplers import TPESampler

Next, let's generate and visualize some data,

X, y = make_classification(
n_features=2,
n_redundant=0,
n_informative=2,
random_state=3,
n_clusters_per_class=1,
class_sep=1,
n_samples=1000
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

plt.scatter(X_train[:,0], X_train[:,1], c=y_train)
Distribution of the generated data with the class represented as the color.
Distribution of the training data

Now let's write some functions to help us build the Logistic Regression. The particular implementation we're targeting is maximizing the likelihood across model betas. Therefore, the first function is the sigmoid (that is, the logistic function itself), and the second one is the negative log likelihood (minimizing this is equivalent to maximizing the likelihood but computationally easier to deal with):

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def neg_log_likelihood(X, y, theta):
m = len(y)
y_hat = sigmoid(X.dot(theta))
J = (-1/m) * np.sum(y*np.log(y_hat) + (1-y)*np.log(1-y_hat))
return J

So far we have a way to calculate predictions (by evaluating the sigmoid function) and to calculate the cost of any given set of predictions (with the neg_log_likelihood function).

What about Optuna? Nothing we have done up to now is Optuna specific. Well, young grasshopper, here comes the crux of the article:

def objective(trial : Trial, X : np.ndarray, y : np.ndarray) -> float:
X_ = np.hstack((np.ones((X.shape[0], 1)), X))
beta_1 = trial.suggest_float('beta_1', -10, 10)
beta_2 = trial.suggest_float('beta_2', -10, 10)
intercept = trial.suggest_float('intercept', -10, 10)

B = np.array([intercept, beta_1, beta_2])

loss = neg_log_likelihood(X_, y, B)

return loss

That's quite a handful, so let's look at the lines one by one:

  • In case you didn't know, Python allows for type annotations for variables and function return values; they are ignored by the interpreter itself but help coders understand better, and IDEs can use them to improve their autocomplete suggestions. So, we have a trial variable of type optuna.Trial, as well as X and y variables, which are both numpy arrays, and returns a float. So far so easy, right?
  • The np.hstack (horizontal stack) function just sticks two arrays together side by side (similar to how pd.concat works by default). In this case, we're gluing a column of ones to our original array. This column will allow us to estimate intercepts for each variable.
  • The next three lines are basically the same, so let's look at their structure: parameter_name=trial.suggest_float('name', min, max). This is a variable definition like any other, where an object is assigned to a variable name, no mysteries there. But what is the trial doing? How will this work? Ah, that's where Optuna's genius comes in: each time a new trial is created, it will draw new parameter values from their corresponding distributions so, the trial.suggest_float method will just return a number between -10 and 10. That's it, nothing more. In this way, the trial now has a concrete value for each of the betas, including the intercept.
  • Then, we stick these estimates into an array for convenience.
  • Finally, we call the neg_log_likelihood function with the appropriate variables and return its value.

It wasn't that hard, was it? Alright, what now? It's study time!

logistic_study = create_study(study_name='logistic_study',
sampler=TPESampler(seed=42),
direction='minimize')

The create_study function just creates a study with the appropriate parameters. If you don't deterimne a sampler explicitly, the TPESampler will be used still, but I prefer to set the seeds to ensure reproducibility.

Now for the fun part: let's find those parameters.

logistic_study.optimize(
lambda trial: objective(trial, X_train, y_train),
n_trials=500,
n_jobs=-1
)

Notice that the optimize method is a high-order function (meaning its argument is a function itself), and the passed function has only one free argument: the trial, where Optuna will work its magic. At each step (or "trial"), the study will call the objective function by passing a new Trial object which will, in turn, ask the sampler for a value within its bounds. Finally, the function will be evaluated and its value returned back to the study, which will repeat the process.

The inner workings of the sampler are out of the scope of this article so, for now, feel free to read the documentation.

"Why so many trials?" you may ask. Well, it's my article and I like being thorough, though you can probably get comparable results with much less.

Let's see what we got by asking the study:

logistic_study.best_params
{'beta_1': 3.4065211923063017,
'beta_2': -7.9512792095013705,
'intercept': -5.127111119254792}

Ok… So what now?

Let's fit sklearn's implementation and look at the coefficients it found:

model = LogisticRegression().fit(X_train, y_train)
np.hstack((model.coef_.ravel(), model.intercept_))
array([ 1.88798021, -5.25490865, -2.8369587 ])

Hm. Comparing coefficients doesn't make our model look great. Alright, what about performance in the test set?

sklearn_likelihood = neg_log_likelihood(
np.hstack((np.ones((X_test.shape[0], 1)), X_test)),
y_test,
np.hstack((model.coef_.ravel(), model.intercept_))
)

optuna_likelihood = neg_log_likelihood(
np.hstack((np.ones((X_test.shape[0], 1)), X_test)),
y_test,
np.array([*logistic_study.best_params.values()])
)

sklearn_likelihood, optuna_likelihood
0.9117186368407011, 0.957710928177615

It's not fantastic, but also not terrible, so let's take the win.

Now we'll create a new logistic regression object and plug the values optuna learned in it:

model_ = LogisticRegression()
model_.coef_ = np.array([[
logistic_study.best_params['beta_1'],
logistic_study.best_params['beta_2']
]])

model_.intercept_ = np.array([
logistic_study.best_params['intercept']
])

model_.classes_ = np.array([0, 1])

Finally, build some plots with the real values (ground truth), sklearn's predictions and optuna's predictions:

fig, ax = plt.subplots(1, 3, figsize=(25,10))
ax[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train)
ax[1].scatter(X_train[:, 0], X_train[:, 1], c=model.predict_proba(X_train)[:, 1])
ax[2].scatter(X_train[:, 0], X_train[:, 1], c=model_.predict_proba(X_train)[:, 1])
ax[0].set_title('Ground Truth')
ax[1].set_title('Sklearn LR')
ax[2].set_title('Optuna LR')
plt.show()
Ground truth and predictions for the estimated coefficients

Both decision frontiers seem to be at least approximately well placed. If we compare both sets of learned coefficients, Optuna-determined ones move the frontier ever so slightly upward. Looking at some sklearn metrics we can see that the models are pretty similar:

print(accuracy_score(y_test, model.predict(X_test)))
print(accuracy_score(y_test, model_.predict(X_test)))
print(roc_auc_score(y_test, model.predict_proba(X_test)[:,1]))
print(roc_auc_score(y_test, model_.predict_proba(X_test)[:, 1]))
0.968
0.964
0.9912954429083461
0.9908474142345111

Differences are in the 1% order of magnitude. Not bad at all!

The objective for this article was twofold:

  • First, I wanted to show the basic usage of Optuna and how it works.
  • Second, I wanted to show that, unlike things like GridSearchCV, Optuna's API is flexible enough for us to optimize just about anything we can write as a Python function.

The first point is more of a bonus if you ask me. The library's flexibility is what makes it so powerful and useful in many areas, especially when machine learning is concerned. Hopefully, I'll be able to write more about this last point.

One final thing before I go, as I feel the need to make a big disclaimer: DO NOT FIND LOGISTIC REGRESSION COEFFICIENTS THIS WAY. The example was done for illustrative purposes and nothing more, I don't condone nor recommend finding model parameters like this, especially when we have efficient implementations written by people who are much smarter than us (such as the sklearn development team).

--

--