Optimization in Deep Learning

Amin Ag
AI³ | Theory, Practice, Business
3 min readSep 15, 2019

SGD with Momentum & Adam optimizer

As our goal is to minimize the cost function by finding the optimized value for weights. We also need to ensure that the algorithm generalizes well. This will help make a better prediction for the data that was not seen before.

Optimizers update the weight parameters to minimize the loss function. Loss function acts as guides to the terrain telling optimizer if it is moving in the right direction to reach the bottom of the valley, the global minimum.

A 3D surface plot demonstrating local and global minima

We need to find a way to somehow navigate to the bottom of the “valley”. To achieve this we run multiple iterations with different weights. This helps to find the minimum cost. This is Gradient descent. One of the popular choices for Gradient descent methods is Stochastic gradient descent (SGD).

SGD is the same as gradient descent, except that it is used for only partial data to train every time. The parameter is called mini-batch size.

SGD with Momentum:

When nearing a minimum we want to converge into it slowly but before the point, if we use a slower learning rate, it might take too much time to get to the minima. In fact, one paper reports that learning rates small enough to prevent bouncing around the ridges might lead the practitioner to believe that the loss isn’t improving at all, and abandon training altogether.

Momentum or SGD with momentum is a method which helps accelerate gradients vectors in the right directions, thus leading to faster converging.

Momentum accumulates the gradient of the past steps to determine the direction to go. The equations of gradient descent are revised as follows:

This helps us move more quickly towards the minima. For this reason, momentum is also referred to as a technique which dampens oscillations in our search.

In practice, the coefficient of momentum is initialized at 0.5, and gradually annealed to 0.9 over multiple epochs.

Adam Optimizer

The name Adam is derived from adaptive moment estimation. Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights iterative based in training data.

Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic Optimization“.

Stochastic gradient descent maintains a single learning rate for all weight updates and the learning rate does not change during training. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

The hyperparameter beta1 is generally kept around 0.9 while beta_2 is kept at 0.99. Epsilon is chosen to be 1e-10 generally.

Reference and further reading:

Adam: A Method for Stochastic Optimization, 2015

Adam Optimization Algorithm by Prof. A. Ng (youtbe vid)

Pytorch example (Torch.optim)

--

--