When you understand what hyperparameters (HPs) are, you may notice that there are many HPs in your code that you haven’t noticed before. In a typical deep learning model, in addition to the frequently debated HPs of choice of optimizer (SGD, Momentum, Adam, etc.) and learning rate and variables for the optimizer, there’s also number of layers, activation functions between the layers, nodes in each layer, etc. Not all of these HPs have a major effect on the output of the algorithm, but they may have effects in other ways, such as the amount of storage space required or time of calculation. Let’s look at how Optuna visualizations can clarify how HPs affect algorithm performance.
The Many Faces of Fashion-MNIST
Let’s use a simple fully-connected two-layer Neural Network identifying clothes on the Fashion-MNIST (FM). For a sampling of possible HPs to tune, we’ll start with the common examples, like optimizer, learning rate, nodes per layer, and then add on some of the less tuned HPs: activation functions per layer (ReLU, tanh), dropout percentage, and batch size. The PyTorch code used to run Optuna and the Jupyter notebook used to look at the images are available on github. Note, the PyTorch code used is functional, not pretty. If you’re interested in learning how to use Optuna with PyTorch code, please take a look at the blog post Using Optuna to Optimize PyTorch Hyperparameters.
We’ll use the Optuna default sampler of Tree-structured Parzen Estimator (TPE), which is a good starting point for a wide range of problems, and do 500 trials of twenty epochs per trial to look for the best HPs.
When tuning HPs, one of the first questions is which of the HPs affect the performance the most? Optuna provides a HP importance measurement function for this purpose, which can be displayed with
Consistent with popular wisdom about neural nets, learning rate (lr) has the most important effect on the overall algorithm performance, follow by a much less impact from the optimizer selected and the dropout rates.
Progress Over Time
Next, let’s check how the optimization has progressed, with an Optimization History Plot from
On the plot, the Objective Value on the y-axis is the accuracy of the model. From looking at the Best Value line, we can see that 500 trials was probably much more than was necessary for the relatively simple FM HPs. FM is a fairly simple problem compared to modern deep learning models, and accordingly the nearly right angle of the Best Value line shows that Optuna found good values for the HPs in well under twenty trials and was unable to improve on those results that much in continuing to tune HPs out to 500 trials.
Slicing It Up
A good next plot to look at to understand how the HPs matter is the Plot Slice, which gives sort of three dimensional charts of the various HPs tuned against the objective value they returned, with the third dimension of the number of which trial produced that results being shown as a darkening color of the points. All the HPs can be displayed with
Since the darkness of the dots correspond to the number of the trial that produced them, the darker areas correspond to later trials. Let’s specify plot for the number of nodes in the second layer plot to see them better with
Although we know from the HP importance table that the number of units didn’t have a major effect on the accuracy, from looking at the Slice Plot, there are two things to note.
First, is that over almost the entire range from, say, 30–128 nodes in the second layer, the accuracy was about the same. If accuracy is not as important as the size of the model, this implies it might be possible to reduce the number of nodes to shrink the number of parameters (not HPs) that need to be trained.
The second is the the best values, which can be seen by the darker dots that Optuna used in later trials as it was homing in on the best values, are crowded to the right of the possible range. This suggests that for number of nodes, more is better, and if even a slight improvement in accuracy is critical, it might be a good idea to increase the number of nodes possible.
The Shape of Things
You might have noticed that up above I said that “most” of the HPs are uncorrelated. Unfortunately, the two HPs that might be correlated include the HP that has the most impact on the performance — the learning rate and the optimizer. Does searching for a single learning rate at the same time as testing different optimizers work?
Let’s take a look at their interaction, using a Contour Plot with
optuna.visualization.plot_contour(study, [‘lr’, ‘optimizer’]):
The Contour Plot is again a three-dimensional kind of plot, with this time the third dimension of color representing the Objective Value, in this case the accuracy. Judging from the clusters in the various optimizers, it looks like RMSProp favored smaller learning rates around 0.0002, SGD about .0005, and Adam larger still at about 0.001. Optuna found Adam to be the best performing optimizer, so there are more data points for Adam.
Now that we’ve confirmed that learning rate does interact with the optimizer, we could make different HPs for the learning rate for each optimizer, such as
rmsprop_lr, etc., or use the new
optuna.samplers.PartialFixedSampler function to fix the value of the optimizer to Adam, and only search for the learning rate for the best optimizer, like this: