Image for post
Image for post
Photo by Eric Ward on Unsplash

Visualizing Hyperparameters in Optuna

Crissman Loomis
Jan 12 · 5 min read

When you understand what hyperparameters (HPs) are, you may notice that there are many HPs in your code that you haven’t noticed before. In a typical deep learning model, in addition to the frequently debated HPs of choice of optimizer (SGD, Momentum, Adam, etc.) and learning rate and variables for the optimizer, there’s also number of layers, activation functions between the layers, nodes in each layer, etc. Not all of these HPs have a major effect on the output of the algorithm, but they may have effects in other ways, such as the amount of storage space required or time of calculation. Let’s look at how Optuna visualizations can clarify how HPs affect algorithm performance.

The Many Faces of Fashion-MNIST

Image for post
Image for post
Ok, so what’s the difference between a pull-over and a coat again?

Let’s use a simple fully-connected two-layer Neural Network identifying clothes on the Fashion-MNIST (FM). For a sampling of possible HPs to tune, we’ll start with the common examples, like optimizer, learning rate, nodes per layer, and then add on some of the less tuned HPs: activation functions per layer (ReLU, tanh), dropout percentage, and batch size. The PyTorch code used to run Optuna and the Jupyter notebook used to look at the images are available on github. Note, the PyTorch code used is functional, not pretty. If you’re interested in learning how to use Optuna with PyTorch code, please take a look at the blog post Using Optuna to Optimize PyTorch Hyperparameters.

We’ll use the Optuna default sampler of Tree-structured Parzen Estimator (TPE), which is a good starting point for a wide range of problems, and do 500 trials of twenty epochs per trial to look for the best HPs.

What’s Important?

Image for post
Image for post
Yup, learning rate is a big deal!

Consistent with popular wisdom about neural nets, learning rate (lr) has the most important effect on the overall algorithm performance, follow by a much less impact from the optimizer selected and the dropout rates.

Progress Over Time

Image for post
Image for post
Siri, please plot blue snow on a red box, thanks.

On the plot, the Objective Value on the y-axis is the accuracy of the model. From looking at the Best Value line, we can see that 500 trials was probably much more than was necessary for the relatively simple FM HPs. FM is a fairly simple problem compared to modern deep learning models, and accordingly the nearly right angle of the Best Value line shows that Optuna found good values for the HPs in well under twenty trials and was unable to improve on those results that much in continuing to tune HPs out to 500 trials.

Slicing It Up

Image for post
Image for post
For your eye test, please tell me which HPs were the best…

Since the darkness of the dots correspond to the number of the trial that produced them, the darker areas correspond to later trials. Let’s specify plot for the number of nodes in the second layer plot to see them better with optuna.visualization.plot_slice(study, [‘n_units_l1’]):

Image for post
Image for post
Blue dots be rushing for the exit on the right!

Although we know from the HP importance table that the number of units didn’t have a major effect on the accuracy, from looking at the Slice Plot, there are two things to note.

First, is that over almost the entire range from, say, 30–128 nodes in the second layer, the accuracy was about the same. If accuracy is not as important as the size of the model, this implies it might be possible to reduce the number of nodes to shrink the number of parameters (not HPs) that need to be trained.

The second is the the best values, which can be seen by the darker dots that Optuna used in later trials as it was homing in on the best values, are crowded to the right of the possible range. This suggests that for number of nodes, more is better, and if even a slight improvement in accuracy is critical, it might be a good idea to increase the number of nodes possible.

The Shape of Things

Let’s take a look at their interaction, using a Contour Plot with optuna.visualization.plot_contour(study, [‘lr’, ‘optimizer’]):

Image for post
Image for post
Searching for the perfect learning rate is a Blue Ocean pursuit.

The Contour Plot is again a three-dimensional kind of plot, with this time the third dimension of color representing the Objective Value, in this case the accuracy. Judging from the clusters in the various optimizers, it looks like RMSProp favored smaller learning rates around 0.0002, SGD about .0005, and Adam larger still at about 0.001. Optuna found Adam to be the best performing optimizer, so there are more data points for Adam.

Now that we’ve confirmed that learning rate does interact with the optimizer, we could make different HPs for the learning rate for each optimizer, such as sgd_lr, rmsprop_lr, etc., or use the new optuna.samplers.PartialFixedSampler function to fix the value of the optimizer to Adam, and only search for the learning rate for the best optimizer, like this:

I hope this gives some idea of the visualization options available in Optuna. For more information, please take a look at the Optuna website or Optuna github!


A hyperparameter optimization framework

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store