plot_rank: A New Visualization Tool in Optuna
--
We are excited to introduce a feature enhancement in Optuna v3.2 — optuna.visualization.plot_rank
. This new visualization functionality provides valuable insights into of complex, high-dimensional objective function landscapes.
The plot_rank
graph features axes that represent different parameters, with individual points indicating separate trials. These points are color-coded based on the rank of the objective value, which allows for quick identification of higher-performing parameters and offers valuable insights into parameter correlations.
A Simple Example
To better understand this feature, let’s delve into the famous breast_cancer
dataset. Originating from digitized images of breast masses, this dataset consists of 30 numeric attributes portraying cell nuclei in the images. The task at hand is to predict the tumor's nature - "malignant" or "benign", based on these attributes.
We’ll use RandomForestClassifier
from sklearn
as our model for this case, aiming to optimize the hyperparameters that greatly influence the model's performance:
- max_depth (
Mdpth
): This defines the tree's maximum depth. - min_samples_split (
mspl
): The least number of samples needed to split an internal node. - min_samples_leaf (
mlfs
): The minimum number of samples required at a leaf node, affecting the model's smoothing. - min_weight_fraction_leaf (
mwfr
): The smallest weighted fraction of the total sum of input samples needed at a leaf node. - max_features (
Mfts
): This relates to the number of features taken into account when searching for the best split. - max_leaf_nodes (
Mnods
): This helps control tree growth in a best-first manner, where 'best' nodes refer to those leading to the most significant decrease in impurity. - min_impurity_decrease (
mid
): A node will only be split if this action leads to an equal or greater decrease in impurity than this value.
Here’s a Python code snippet detailing the entire process, right from loading the dataset to setting up hyperparameter optimization:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import optuna
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
def objective(trial):
clf = RandomForestClassifier(
n_estimators=100,
criterion="gini",
max_depth=trial.suggest_int('Mdpth', 2, 32, log=True),
min_samples_split=trial.suggest_int('mspl', 2, 32, log=True),
min_samples_leaf=trial.suggest_int('mlfs', 1, 32, log=True),
min_weight_fraction_leaf=trial.suggest_float('mwfr', 0.0, 0.5),
max_features=trial.suggest_int("Mfts", 1, 15),
max_leaf_nodes=trial.suggest_int('Mnods', 4, 100, log=True),
min_impurity_decrease=trial.suggest_float('mid', 0.0, 0.5),
)
clf.fit(X_train, y_train)
return clf.score(X_test, y_test)
# Optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=140)
We’ll now shift our focus towards visualizing which parameters tend to give better outcomes. Given the plethora of parameters, we’ll first identify their importance and then illustrate the top 4.
# Get parameters sorted by the importance values
importances = optuna.importance.get_param_importances(study)
params_sorted = list(importances.keys())
# Plot
fig = optuna.visualization.plot_rank(study, params=params_sorted[:4])
fig.show()
In this plot, each trial is represented as a unique point, with the color corresponding to its ranking. You can read the actual objective value of each point from the color bar on the right.
The chart offers some useful insights:
- There seems to be no strong correlation between the parameters, suggesting that optimizing each parameter independently might also lead to good results.
- The performance tends to drop significantly when
mid
is large. Hence, it is beneficial to keepmid
small. The objective value seems to be relatively good when the values formwfr
,Mfts
, andMnods
are in the ranges of 0–0.2, 1-4, and 10–30, respectively. High values ofMnods
seem to be underexplored, but that parameter doesn’t look as important as other parameters.
Drawing on these observations, you may consider fixing mid
at 0 and optimizing other three parameters within a narrower range for the next optimization cycle. Also, if there seems to be a correlation between parameters, turning on the multivariate
option in the default TPESampler
can often speed up the discovery of optimal parameters. Finally, if you want to explore the search space more thoroughly, you can increase the number of trials.
Comparison with plot_contour
The new plot_rank
feature shares similarities with optuna.visualization.plot_contour
, which was available in Optuna v3.1 and earlier. However, compared to plot_contour
, plot_rank
offers better readability.
An example plot created by plot_contour
would look something like this:
Unfortunately, this plot might not be as easily readable and insightful. plot_rank
introduces several improvements to boost the readability of plot_contour
:
- In
plot_contour
, the objective function values are color-mapped linearly, making it hard to distinguish good parameters if the study contains a few extremely bad parameters.plot_rank
, as the name suggests, assigns colors based on the ranking of the objective function values, ensuring most trials have unique colors. - While
plot_contour
generates contours by interpolating objective function values within the plane, these contours don't represent the model's predictions used for optimization and lack theoretical basis. Besides, noisy objective function values can clutter the plot with contours.plot_rank
eliminates contours altogether, coloring points based on ranks for an easier understanding of trends and noise levels.
Potential Pitfalls
Although plot_rank
offers a handy visualization for understanding the objective function's general shape, some information is inevitably lost by projecting into two dimensions. This may lead to false impressions. (This pitfall also exists in plot_slice
and plot_contour
.)
Consider this hypothetical example to illustrate potential pitfalls.
This figure might give the impression that all parameters are sampled uniformly. Moreover, looking at the graphs in the z
column (or row), it may appear that the objective function is largest when z
is small. However, if viewed in three dimensions, the samples are actually distributed as follows:
Even though the projection onto two dimensions seems to show uniform sampling, the actual sampling within a high-dimensional space might be covering only a part of the search space. In such cases, the unsampled part (the corner of the 3D plot with high x
, y
, and z
values in the above example) might contain the best objective function values. Such risk of misinterpretation inherently exists when visualizing high-dimensional spaces in lower dimensions, so careful interpretation is necessary.
Wrapping Up
In this post, we introduced optuna.visualization.plot_rank
, an enhanced visualization feature available in Optuna v3.2. This projects trials in high-dimensional search spaces onto two-dimensional scatter plots. Compared to the plot_contour
function, plot_rank
provides improved clarity.
While projecting complex high-dimensional landscapes into two dimensions inevitably involves information loss, plot_rank
is very helpful in practical tasks like scikit-learn examples. It provides valuable insights into the shape of the objective function and correlations between variables. We hope that this new visualization aids you in understanding the complex landscapes of your objective functions in hyperparameter optimization.