Types of Optimizers in Deep Learning From Gradient Descent to Adam

Thiyaneshwaran G
8 min readMar 29, 2022

--

This write up gives you an understanding on the different optimizers that we can choose to use in Deep learning. Lets dive in ..

Lets first understand what is Minima, Optimizers and then try to deep dive on its types.

Let us, for sake of simplicity, let us assume our network has only two parameters. In practice, this number would be around a billion, but we’ll still stick to the two parameter example throughout the post so as not drive ourselves nuts while trying to visualize things

The x and y axes represent the values of the two weights. The z axis represents the value of the loss function for a particular value of two weights. Our goal is to find the particular value of weight for which the loss is minimum. Such a point is called a minima for the loss function.

We need to find a way to somehow navigate to the bottom of the “valley” to point B, where the loss function has a minima? So how do we do that?

Optimizers :

Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.

Lets now try to deep dive on the Optimizers.

Gradient Descent :

This will try to identify the least cost function value by updating the weights of learning algorithm and will come up with the best suited parameter values corresponding to Global Minima.

How is it done ?

This is done by moving down the hill with negative slope — increasing the older weight and positive slope reducing the older weight

Backpropagation in neural networks also uses a gradient descent algorithm. Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a loss function. It calculates that which way the weights should be altered so that the function can reach a minima

Advantages :

Easy Computation , Easy to Implement and Easy to Understand

Disadvantages :

  1. May trap at local minima.
  2. Weights are changed after calculating gradient on the whole dataset. So, if the dataset is too large than this may take years to converge to the minima.
  3. Requires large memory to calculate gradient on the whole dataset.

Stochastic Gradient Descent

It’s a variant of Gradient Descent. “Stochastic”, in plain terms means “random”. Yes, you might have guessed it right !! It is while selecting data points at each step to calculate the derivatives. SGD randomly picks one data point from the whole data set at each iteration to reduce the computations enormously. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent.

θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.

As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities.

Advantages:

  1. Frequent updates of model parameters hence, converges in less time.
  2. Requires less memory as no need to store values of loss functions.
  3. May get new minima’s.

Disadvantages:

  1. High variance in model parameters.
  2. May shoot even after achieving global minima.
  3. To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.

Mini-Batch Gradient Descent

It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated.

θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.

So, when we are using the mini-batch gradient descent we are updating our parameters frequently as well as we can use vectorized implementation for faster computations.

Advantages:

  1. Frequently updates the model parameters and also has less variance.
  2. Requires medium amount of memory.

All types of Gradient Descent have some challenges:

  1. Choosing an optimum value of the learning rate. If the learning rate is too small than gradient descent may take ages to converge.
  2. Have a constant learning rate for all the parameters. There may be some parameters which we may not want to change at the same rate.
  3. May get trapped at local minima.

Adagrad

This is the Adaptive Gradient optimization algorithm, where the learning rate plays an important role in determining the updated parameter values. Unlike Stochastic Gradient descent, this optimizer uses a different learning rate for each iteration(EPOCH) rather than using the same learning rate for determining all the parameters.

Thus it performs smaller updates(lower learning rates) for the weights corresponding to the high-frequency features and bigger updates(higher learning rates) for the weights corresponding to the low-frequency features, which in turn helps in better performance with higher accuracy. Adagrad is well-suited for dealing with sparse data.

So at each iteration, first the alpha at time t will be calculated and as the iterations increase the value of t increases, and thus alpha t will start increasing

Now the learning rate is calculated at each time step. Given by,

As the learning rate changes for each iteration, the formula for updating the weight also changes. Given by,

However, there is a disadvantage of getting into the problem of Vanishing Gradient because after a lot of iterations the alpha value becomes very large making the learning rate very small leading to no change between the new and the old weight. This in turn causes the learning rate to shrink and eventually become very small, where the algorithm is not able to acquire any further knowledge.

AdaDelta:

This is an extension of the Adaptive Gradient optimizer, taking care of its aggressive nature of reducing the learning rate infinitesimally. Here instead of using the previous squared gradients, the sum of gradients is defined as a reducing weighted average of all past squared gradients(weighted averages) this restricts the learning rate to reduce to a very small value.

The formula for the new weight remains the same as in Adagrad. However, there are some changes in determining the learning rate at time step t for each iteration.

At each iteration, first the weighted average is calculated. Where we have the restricting term(gamma = 0.95) which helps in avoiding the problem of Vanishing Gradient.

After which the Learning rate is calculated using the formula,

Thus because of the restricting term, the weighted average will increase at a slower rate, making the learning rate to reduce slowly to reach the global minima

Advantages:

  1. Now the learning rate does not decay and the training does not stop.

Disadvantages:

  1. Computationally expensive.

RMSProp:

Both the optimizing algorithms, RMSprop(Root Mean Square Propagation) and Adadelta were developed around the same time, for the same purpose to resolve Adagrad’s problem of destructive learning rates. However, both use the same method which utilizes an Exponential Weighted Average to determine the learning rate at time t for each iteration.

RMSprop is an adaptive learning rate method proposed by Geoffrey Hinton, which appropriately divides the learning rate by an exponentially weighted average of squared gradients. It is suggested to set gamma at 0.95, as it has been showing good results for most of the cases.

Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as a hyperparameter. This means that the learning rate changes over time.

Adam:

This is the Adaptive Moment Estimation algorithm which also works on the method of computing adaptive learning rates for each parameter at every iteration. It uses a combination of Gradient Descent with Momentum and RMSprop to determine the parameter values.

When introducing the algorithm, there was a list of attractive benefits of using Adam on non-convex optimization problems which made it the most commonly used optimizer.

It comes with several advantages combining the benefits of both Gradient with Momentum and RMSProp like low memory requirements, appropriate for non-stationary objectives, works best with large data and parameters with efficient computation. This works using the same methodology of adaptive learning rate in addition to storing an exponential weighted average of the past squared derivative of loss with respect to the weight at time t-1.

It comes with several parameters, which are β1, β2, and ε (epsilon). Where β1 and β2 are the initial restricting parameters for Momentum and RMSprop respectively. Here, β1 corresponds to the first moment and β2 corresponds to the second moment.

For updating the weights with an adaptive learning rate at iteration t, first, we need to calculate the first and second moment given by the following formulae,

VdW = β1 x VdW + (1-β1) x dW — — GD with Momentum (1st)

SdW = β2 x SdW + (1-β2) x dW² — — RMSprop (2nd)

The corrected VdW and SdW is given by,

Therefore the new weight will be updated using the formula,

The initial value of n is to be tuned for better results.

Adam is relatively easy to configure where the default configuration parameters do well on most problems. It is proposed to have default values of β1=0.9 ,β2 = 0.999 and ε =10E-8. Studies show that Adam works well in practice, in comparison to other adaptive learning algorithms.

Comparison:

Conclusions

Adam is the best optimizers. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.

For sparse data use the optimizers with dynamic learning rate.

--

--

Thiyaneshwaran G

Senior Data Scientist and Intelligent Process Automation Analyst