How to Tune Hyper-Parameters in Deep Learning

DECLARATION: Most notes are from Michael Nielsen’s book Improving the way neural networks learn.

5 min readDec 9, 2017

The goal should be to develop a workflow that enables you to quickly do a pretty good job on the optimization, while leaving you the flexibility to try more detailed optimizations, if that’s important.

Each heuristic is not just a (potential) explanation, it’s also a challenge to investigate and understand in more detail.

General Guidance

If train and test data have mismatched distributions, make sure your validation and test data have the same distribution.
Whether a model has high bias (and/or high variance) is a relative concept, it depends on what kinds of bias one can expect to achieve by human being or a base model.
Bias-Variance trade off may not applicable to deep learning. Larger network can reduce bias without introduce much variance, more data can reduce variance without hurt much on bias.

First Objective

When using neural networks to attack a new problem the first challenge is to get any non-trivial learning, i.e., for the network to achieve results better than chance. This can be surprisingly difficult, especially when confronting a new class of problem.

Try strip down original problem into simpler problem, e.g. instead of classify 10 digits, try classify 0 and 1 first
Try strip down your network to the simplest one that can do meaningful learning
Increase the frequency of monitoring validation accuracy, e.g. each 1000 images
Reduce validation size. All that matters is that the network sees enough images to do real learning, and to get a pretty good rough estimate of performance.
We can continue, individually adjusting each hyper-parameter, gradually improving performance.
For example, we can first find a good value of learning rate η, then we move on to find a good value for regularization parameter (or weight decay) λ. Then experiment with a more complex architecture, say a network with 10 hidden neurons. And you repeat the exercise again.

Tuning Learning Rate η

The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.
Plot Cost on training data (or validation accuracy) vs. Epoch with learning rate as control parameter, i.e. one curve per learning rate
First, we estimate the threshold value for learning rate at which the cost on the training data immediately begins decreasing, instead of oscillating or increasing (You may need to run for more epochs for each learning rate)
You can estimate the order of magnitude by starting with η=0.01. If the cost decreases during the first few epochs, then you should successively try η=0.1,1.0,… until you find a value for η where the cost oscillates or increases during the first few epochs.
Obviously, the actual value of η that you use should be no larger than the threshold value. In fact, if the value of η is to remain usable over many epochs then you likely want to use a value for η that is smaller, say, a factor of two below the threshold. Such a choice will typically allow you to train for many epochs, without causing too much of a slowdown in learning.
Try dynamic learning rate. Use larger learning rate at beginning of training and lower it in later stage. (Deep, Big, Simple Neural Nets Excel on Handwritten Digit Recognition, by Dan Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber, 2010)

Tuning Regularization Parameter (Weight Decay) λ

Start with no regularization (λ=0.0), and determining a value for learning rate η, as above. Using that choice of η, we can then use the validation data to select a good value for λ. Start by trialing λ=1.0, and then increase or decrease by factors of 10, as needed to improve performance on the validation data.

Tuning Mini-Batch Size

It can be faster to use larger mini-batch size than just one sample each time (online learning). But smaller batch size lets you update weights more frequently.
Since the maximum batch size is limited by the memory, we can use some acceptable (but not necessarily optimal) values for the other hyper-parameters, and then trial a number of different mini-batch sizes (Remember to scale learning rate accordingly). Plot the validation accuracy versus time (as in, real elapsed time, not epoch!), and choose whichever mini-batch size gives you the most rapid improvement in performance

Tuning Training Epochs

We can use early stopping to determine the number of training epochs. So we don’t have to tune the number of epochs. Note, you may not want to stop immediately when the accuracy drops, but wait when accuracy doesn’t improve for quite some time to move beyond accuracy plateau.

Other Parameters/Variables to Tune

Use different cost functions
Try different regularization methods: L1, L2 etc.
Initialize weights differently
Apply various stochastic gradient descent methods
Momentum-based gradient descent or Aadm
Try Hessian technique or Hessian optimization instead of just using gradient
Try different activation functions

Strategies to Tune Hyper-Parameters

Try Random values of hyper-parameters, don’t use grid

The reason is that various hyper-parameters have different importance, grid will reduce the range of values for the most important parameters. (Deep Learning Specialization, Andrew Ng)

Use coarse to fine search process

Use an appropriate scale to pick hyper-parameters

Example: to search learning rate between 0.0001 to 1, if you select values using random uniform, 90% of your values will fall between 0.1 to 1. So it’s better to sample the values at log scale. This is the case if the network is sensitive to parameter value’s in a narrow range.

Re-test hyper-parameters occasionally

Reference:

Improving the way neural networks learn, Michael Nielsen, Dec 2017
Random search for hyper-parameter optimization, by James Bergstra and Yoshua Bengio (2012).
Practical Bayesian optimization of machine learning algorithms, by Jasper Snoek, Hugo Larochelle, and Ryan Adams.
Practical recommendations for gradient-based training of deep architectures, by Yoshua Bengio (2012).
Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998)
Neural Networks: Tricks of the Trade, edited by Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller.
On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton (2012).
Understanding the difficulty of training deep feedforward networks, by Xavier Glorot and Yoshua Bengio (2010).
What is the Best Multi-Stage Architecture for Object Recognition?, by Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato and Yann LeCun (2009)
Deep Sparse Rectiﬁer Neural Networks, by Xavier Glorot, Antoine Bordes, and Yoshua Bengio (2011)
ImageNet Classification with Deep Convolutional Neural Networks, by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).
Rectified Linear Units Improve Restricted Boltzmann Machines, by Vinod Nair and Geoffrey Hinton (2010), which demonstrates the benefits of using rectified linear units in a somewhat different approach to neural networks.
Question and answer with neural networks researcher Yann LeCun
ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).
Deep Learning Specialization, Coursera.org, Andrew Ng, 2018