Optimizers in Deep Learning
What is an optimizer?
Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of production. Optimizers are mathematical functions which are dependent on model’s learnable parameters i.e Weights & Biases. Optimizers help to know how to change weights and learning rate of neural network to reduce the losses.
This post will walk you through the optimizers and some popular approaches.
Types of optimizers
Let’s learn about different types of optimizers and how they exactly work to minimize the loss function.
Gradient descent is an optimization algorithm based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum. Gradient Descent iteratively reduces a loss function by moving in the direction opposite to that of steepest ascent. It is dependent on the derivatives of the loss function for finding minima. uses the data of the entire training set to calculate the gradient of the cost function to the parameters which requires large amount of memory and slows down the process.
Advantages of Gradient Descent
- Easy to understand
- Easy to implement
Disadvantages of Gradient Descent
- Because this method calculates the gradient for the entire data set in one update, the calculation is very slow.
- It requires large memory and it is computationally expensive.
How big/small the steps are gradient descent takes into the direction of the local minimum are determined by the learning rate, which figures out how fast or slow we will move towards the optimal weights.
Stochastic Gradient Descent
It is a variant of Gradient Descent. It update the model parameters one by one. If the model has 10K dataset SGD will update the model parameters 10k times.
Advantages of Stochastic Gradient Descent
- Frequent updates of model parameter
- Requires less Memory.
- Allows the use of large data sets as it has to update only one example at a time.
Disadvantages of Stochastic Gradient Descent
- The frequent can also result in noisy gradients which may cause the error to increase instead of decreasing it.
- High Variance.
- Frequent updates are computationally expensive.
Mini-Batch Gradient Descent
It is a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into small batches and performs an update for each of those batches. This creates a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. it can reduce the variance when the parameters are updated, and the convergence is more stable. It splits the data set in batches in between 50 to 256 examples, chosen at random.
Advantages of Mini Batch Gradient Descent:
- It leads to more stable convergence.
- more efficient gradient calculations.
- Requires less amount of memory.
Disadvantages of Mini Batch Gradient Descent
- Mini-batch gradient descent does not guarantee good convergence,
- If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will oscillate or even deviate at the minimum value.
SGD with Momentum
SGD with Momentum is a stochastic optimization method that adds a momentum term to regular stochastic gradient descent. Momentum simulates the inertia of an object when it is moving, that is, the direction of the previous update is retained to a certain extent during the update, while the current update gradient is used to fine-tune the final update direction. In this way, you can increase the stability to a certain extent, so that you can learn faster, and also have the ability to get rid of local optimization.
Advantages of SGD with momentum
- Momentum helps to reduce the noise.
- Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
- Extra hyperparameter is added.
AdaGrad(Adaptive Gradient Descent)
In all the algorithms that we discussed previously the learning rate remains constant. The intuition behind AdaGrad is can we use different Learning Rates for each and every neuron for each and every hidden layer based on different iterations.
Advantages of AdaGrad
- Learning Rate changes adaptively with iterations.
- It is able to train sparse data as well.
Disadvantage of AdaGrad
- If the neural network is deep the learning rate becomes very small number which will cause dead neuron problem.
RMS-Prop (Root Mean Square Propagation)
RMS-Prop is a special version of Adagrad in which the learning rate is an exponential average of the gradients instead of the cumulative sum of squared gradients. RMS-Prop basically combines momentum with AdaGrad.
Advantages of RMS-Prop
- In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.
Disadvantages of RMS-Prop
- Slow Learning
Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing the learning rate and remove decaying learning rate problem. In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
Advantages of Adadelta
- The main advantage of AdaDelta is that we do not need to set a default learning rate.
Disadvantages of Adadelta
- Computationally expensive
Adam(Adaptive Moment Estimation)
Adam optimizer is one of the most popular and famous gradient descent optimization algorithms. It is a method that computes adaptive learning rates for each parameter. It stores both the decaying average of the past gradients , similar to momentum and also the decaying average of the past squared gradients , similar to RMS-Prop and Adadelta. Thus, it combines the advantages of both the methods.
Advantages of Adam
- Easy to implement
- Computationally efficient.
- Little memory requirements.
How to choose optimizers?
- If the data is sparse, use the self-applicable methods, namely Adagrad, Adadelta, RMSprop, Adam.
- RMSprop, Adadelta, Adam have similar effects in many cases.
- Adam just added bias-correction and momentum on the basis of RMSprop,
- As the gradient becomes sparse, Adam will perform better than RMSprop.
I hope this article has helped you learn and understand more about these concepts.