Optimizers in Deep Learning

Awais Qarni
DiveDeepAI
Published in
4 min readAug 10, 2022

What exactly is an optimizer, and why do we need one?

Optimizers are used for changing the properties of your neural network, such as weights and learning rates, to minimize loss or cost during backpropagation.

· Gradient Descent

Gradient descent is an iterative machine learning optimization algorithm to reduce the cost function or we can say Gradient descent finds a global minima in training deep neural networks. But May trap at local minima. Since we use the entire dataset to compute the gradient convergence is slow. If the dataset is huge and contains millions or billions of data points then it is memory as well as computationally intensive. Weights are changed after calculating the gradient on the whole dataset. So, if the datasets are too large then this may take years to converge to the minima.

Gradient Descent
  • Stochastic Gradient Descent

It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after the computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of a dataset instead of one time as in Gradient Descent.

As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities and we may get new minima. To get the same convergence as gradient descent needs to slowly reduce the value of the learning rate.

SGD with momentum
  • Mini Batch Stochastic Gradient Descent

Mini-batch gradient is a variation of gradient descent where the batch size consists of more than one and less than the total dataset. Mini batch gradient descent is widely used and converges faster and is more stable. The batch size can vary depending on the dataset. As we take a batch with different samples, it reduces the noise which is the variance of the weight updates and this helps to have a more stable and faster convergence.

Choosing an optimum value of the learning rate. If the learning rate is too low then Gradient Descent may take many hours to converge. Have a constant learning rate for all the parameters. There may be some parameters that we may not want to change at the same rate. May get trapped at local minima.

  • Adagrad

AdaGrad is a different optimization technique. When we update a weight parameter, we divide the current gradient by the root of that term g. Summing up the gradients along the axis where the gradients are small causes the squared sum of these gradients to become even smaller. If during the update step, we divide the current gradient by a very small sum of squared gradients g, the result of that division becomes very high and vice versa for the other axis with high gradient values.

However, there is a problem with this optimization algorithm. Imagine what would happen to the sum of the squared gradients when training takes a long time. Over time, this term would get bigger. If the current gradient is divided by this large number, the update step for the weights becomes very small. In AdaGrad learning rate changes for each training parameter.

  • RMS Prop

RMS Prop resolves the problem of varying gradients. The problem with the gradients is that some of them were small while others may be huge. So, defining a single learning rate might not be the best idea. RMS Prop uses the sign of the gradient adapting the step size individually for each weight. In this algorithm, the two gradients are first compared for signs. If they have the same sign, we’re going in the right direction and hence increase the step size by a small fraction. Whereas, if they have opposite signs, we have to decrease the step size. Then we limit the step size, and now we can go for the weight update. The problem with RMS Prop is that the learning rate has to be defined manually and the suggested value doesn’t work for every application.

  • Adadelta

It is a supplement to AdaGrad that aids in addressing the issue of the declining learning rate. Instead of adding up all previously squared gradients, Adadelta limits the window of accumulated earlier gradients to a certain size w. In this situation, an exponential moving average is used rather than the average of all the gradients. Adadelta is computationally expensive.

  • Adam

Adam (Adaptive Moment Estimation) deals with first- and second-order momentums. The idea behind Adam is that instead of rolling quickly only so we may clear the minimum, we should somewhat slow down to allow for a more thorough search. The method is too fast and converges rapidly. Adam is computationally costly.

What is the best Optimization Algorithm for Deep Learning?

Adam is the best optimizer. If one wants to train the neural network in less time and more efficiently then Adam is the optimizer. If want to use a gradient descent algorithm then min-batch gradient descent is the best option. I recommend that you always start with the Adam Optimizer, regardless of the architecture of the neural network of the problem domain you are dealing with.

References:

  1. Krish Naik (YouTube)
  2. Analytics Vidhya (Blog)

--

--