Types of Gradient Descent Optimisation Algorithms

Published in

The Startup

11 min readApr 13, 2020

Optimizer or Optimization algorithm is one of the important key aspects of training an efficient neural network. Loss function or Error Function can be referred to as the objective function of the optimization. Loss function however in simple terms can be stated as a function that best describes the difference in the actual and predicted label 𝑦. Optimization algorithm in machine learning, deep Learning tends to minimize the loss function and is also considered a minimization function. So minimization or maximization of the optimizer depends on the objective function and the problem. It is also necessary for the loss function to be differentiable to be optimizable. In simple terms, optimization is a mere brute-forcing way of fitting exact parameters into the loss function to achieve the minimum variance in the predicted labels. Here parameters are 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 and 𝑏𝑖𝑎𝑠 and loss function is a continuous, differentiable function.

Fig.1: Figure showing a non-convex function and a trajectory of an optimizer

Optimization Functions are of two types:

First Order Optimizations
Second-Order Optimizations

However, this post only focuses on the most widely used and practical optimization strategies and thus involving only first-order optimizations. Second-Order techniques like 𝑁𝑒𝑤𝑡𝑜𝑛′𝑠 𝐼𝐼 𝑂𝑟𝑑𝑒𝑟 𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 are efficient in most cases but are also computationally expensive.

𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝐷𝑒𝑠𝑐𝑒𝑛𝑡 𝐴𝑝𝑝𝑟𝑜𝑎𝑐ℎ

Gradient Descent is the most popular and almost an ideal optimization strategy for deep learning tasks. Let us understand Gradient Descent with some maths. The simple equation for updating weights is as follows:

Gradient Descent optimizer tries to learn the optimal parameter (weights here) for which the gradient of the loss function is minimum. A gradient is nothing but the slope or tangent or differentiated value at that point.

Why Gradient?
Gradient or Differentiation in simpler terms is nothing but the rate of change of the value. In our case, it is the rate of change of Loss Function with respect to weights.
The learning rate is the step size to jump to different parameter values to travel to the valley from a downhill path of the loss function. The loss function is the continuous convex/non-convex hill of uncertainties. This hilly path may have many local minima (valleys) and maxima (plateaus) and also saddle points. However, the goal of our optimizer is to find global minima or the deepest valley.

There exist many variations of Gradient Descent which only differ in the way how much data is used to compute the loss function and updating weights.

Vanilla Gradient Descent or Simple Gradient Descent: It computes the gradient of the Loss function for the entire training dataset. This becomes a cumbersome task in itself to compute gradients for all points for a single update in weights. This can take a very long time or might take forever and would also require large memory to fit the entire dataset and their gradients for computation. Hence even though Gradient Descent being the ideal way of optimization, it is not computationally feasible. Gradient Descent converges to the global optimal and is less prone to stuck in the local minima and saddle points for convex problems. However, because of its large time-space complexity even for simple convex functions, Gradient Descent is the least favored choice for practitioners. Gradient Descent for non-convex problems may take even longer time and might remain stuck in the local minima or nearby saddle points. A saddle point is a point in the curve where one dimension slopes up and other slopes down, where the gradient is minimum locally for all nearby points. When the gradient is minimum then the updation of weights stops which in turn means learning of the parameters and is referred to as convergence.

2. Stochastic Gradient Descent: It computes the gradient and updates its weights for every xi, yi pair. This method is comparatively faster and computationally less expensive. Moreover, it can update the weights with the new data points on the run. With the update in parameters for each point, SGD shows many fluctuations or variance in its journey towards the global minima. This problem is both good and bad for optimization. Simple Gradient descent covers a slow, steady path towards the nearest minima which sometimes limits it to the local minima. SGD, on the other hand, fluctuates because of a single random point for weight updates which helps in discovering the other local optimal. But because it randomly fluctuates, the exploration is still wayward and not highly efficient for convergence. Because of its random exploration of minima, it is called the stochastic Gradient Descent. However, the latter problem can be reduced by selecting a smaller learning rate.

3. Mini-batch Gradient Descent: As the name suggests mini-batch gradient descent performs the update for a batch of some random points from the dataset. This ensures the more stable update in the weights as opposed to the stochastic gradient descent and comparatively lesser computationally complex as against Vanilla Gradient Descent. Modern Deep Learning libraries utilize the power of mini-batch gradient descent and are thus widely used. The size of the batch is a hyperparameter and largely depends on the dataset.

To simplify all the above theory I will explain this with an analogy of a trekker. Think of our optimizer as a trekker, walking over the mountain ranges of our cost function. The ultimate goal of the trekker is to find us the geo-coordinates(weights of the loss function) of the deepest valley. Think of our data points as the necessary information for the trekker to analyze and update the geo-coordinates of that point which further helps him to travel towards the nearest valley. Our Optimizer has a lookup table to validate input data. Think of this lookup table as actual labels y.In the Gradient Descent process, the trekker analyses all the data points check it with his lookup table and update his knowledge about the geo coordinates. In gradient descent our optimizers after enormous analysis and computation crawl to the nearest valley. Hence Gradient Descent is slow. In Stochastic Gradient Descent, optimizer updates its knowledge about the geo-coordinates for every single data point and so mostly walks deviated from the valley but ultimately reaching the converging point. In Mini batch Gradient descent, optimizer updates its parameter after a certain batch of points.

But this is not just. Our optimizers are not yet efficient enough to trek through all the complex terrains of the cost function and find us the deepest valley. So far we have only tweaked with the input data points to update the weights. In deep learning, mostly the cost function or loss function is non-convex. So far we have seen our optimizer crawling, wandering, walking, etc in his quest of finding the ultimate global optimum point(i.e. deepest valley). But this learning hacks does not make our optimizers super-efficient.Our optimization process still cant address certain problems like

i.)How to effectively deal with the problem of getting stuck in the suboptimal local minima and saddle points?

ii.) How to decide the learning rate for faster convergence?

iii.) Should the learning rate be kept constant for all features? or Should it be made adaptive for dense and sparse features?

iv.)How the optimizer should gain momentum in the direction of the global minima?

To tackle all the aforementioned issues, researchers came up with many algorithms and strategies built on the gradient descent approach.

Momentum:

Exploration through SGD and Mini Batch SGD observes many noises in the path i.e. the optimizer fluctuates more and does not follow a steady, straight path towards the global optimum. This suggests that the optimizer lacks momentum. To denoise the trajectory of exploration of the optimizer, researchers came up with a method to impart a momentum to the update function.

Fig.3: Figure showing the noisy data and the actual function of those data points

Random fluctuations are reduced by weight averaging the past parameters. This helps to reduce the oscillation or random fluctuation of the optimizer and accelerate it towards the optimal point. It does this by adding a fraction value λ of the previous vector to the current vector update equation.

Fig.4: Reduced variance in the momentum path, close to the actual function

For simplicity let us understand the update equation for a single feature. Thus consider Wₜ, Wₜ₋₁ as scalar values of weight at t and t-1 iteration and the gradient with respect to these weights be ∂L / ∂Wₜ₋₁. So the update equation is written as :

Fig.5: Momentum reducing the fluctuations from the original SGD path

The term λ Vₜ₋₁ is called momentum and η gₜ is called gradient term. The momentum value increases for the dimensions whose gradients point in the same direction and reduces updates for dimensions whose gradients change directions. This helps in reducing the optimizer’s fluctuations and reach convergence faster.

Nesterov Accelerated Gradient (NAG):

One another improvement over SGD + Momentum is Nesterov Accelerated Gradient. In the SGD + Momentum technique, the resultant vector is the direction of the step of an optimizer. The gradient vector is calculated from the initial point.

In NAG, the optimizer travels in the direction of momentum at point 2 from the initial point. The gradient vector at point 2 is then calculated. The resultant vector of the momentum at point 1 and the gradient vector at point 2 is the actual step and direction of the optimizer. As shown in the figure lets understand the maths behind the working of NAG,

Let the initial point be Wₜ₋₁. First, move in the direction of momentum. Let this new point be W’.
Compute the gradient at point W’. Let the gradient at this point is g’.
gₜ is the gradient at point Wₜ₋₁.
The resultant of these two vectors is the step direction of the optimizer.

It is evident from the figure that gₜ ≠ g’ as opposed to what is assumed in the SGD + Momentum approach.

The equation is thus improved as follows:

The typically used value of λ is again 0.9

Adagrad :

In SGD and SGD + Momentum based techniques, the learning rate is the same for all weights. For an efficient optimizer, the learning rate has to be adaptive with the weights. This helps the optimizer to tune the step size as per the feature parameters. Incorporating the same learning rate for different sparse and dense features is not an efficient way of optimization.AdaGrad, as the name suggests, is an adaptive gradient approach, which effectively tunes the gradient term ηgₜ.

As the iteration number increases, the learning rate for that weight reduces. This is iteration based decay of learning rate. The default value of the learning rate η is taken as 0.01. The characteristics of adaptive gradient-based SGD are as follows :

— There’s no need for manual tuning of η for each weight.
— Sparse and Dense features are taken care of by the adaptive gradient such that sparse features have higher learning rates and dense features have smaller learning rate.

— But αₜ can grow large with iteration and this makes η’ gₜ value very small which leads to slower convergence.

RMSprop and AdaDelta:

With adagrad, if αₜ₋₁ is very large, then it leads to slower convergence.
These shortcomings of adagrad are removed in adadelta and RMSprop algorithms. Both these algorithms were independently developed at the same time and also share the same idea. Instead of taking the sum of all the gradients in αₜ₋₁ computation, an exponential weight decay of gradients can be taken.

So in RMSprop, the learning rate is computed as follows:

In Adaptive Delta or AdaDelta, the delta is referred to as the difference between the current weight and newly updated weight.AdaDelta uses the exponential moving average of square deltas in place of learning rate and thereby reducing the need to explicitly noting the learning rate parameter. The equation of AdaDelta is as follows:

But both these algorithm takes the exponentially weighted average of gᵢ² instead of the sum of gᵢ² so as to avoid large denominator in ηₜ

Adam:

Adam with its full name as Adaptive Moment Estimation uses the best of both worlds from momentum and AdaDelta.It utilizes the exponential decaying average of past squared gradients as in Adadelta and the exponentially decaying average of past gradients as in momentum.

mₜ is the exponential decaying average of gₜ and vₜ is the exponential decaying average of gₜ².

mₜ, which is the first-order moment is also referred to as mean and vₜ, a second-order moment is referred to as the variance.

Typically η = 0.001 ,β₁ =0.9 and β₂ = 0.99

Nadam:

Nadam uses the Adam and Nesterov accelerated Gradient approach together for optimization. The equation for weight update with momentum is written as

NAG improves the above equation as follows:

In Nadam, the look-ahead momentum vector is directly applied to update the parameters against NAG.

The only change here as compared to adam is the use of momentum λ Vₜ instead of λ Vₜ₋₁ as a look-ahead momentum vector.

Adam equation can also be written as

To add the Nesterov momentum to adam, the previous momentum vector is just replaced with the current momentum vector. In Nadam, the previous m_hat term is replaced by the current m_hat term to update the gradient. The final update equation of Nadam thus looks like

Performance Visualizations of Different Optimizers:

Fig.9: Optimization trajectories of different algorithms towards the global optimal.

Fig.10: Trajectories of different optimizers in the saddle point.

From the above visualizations and equations, it is clear that adaptive algorithms like Adam, AdaDelta, Nadam converges faster as compared to simple SGD and Momentum. Moreover, these adaptive algorithms are also less prone to stuck in the local minima and saddle points. Simple SGD is more likely to be on the hook near saddle points.

Conclusion:

In this blog post, we covered all the widely used First Order optimization algorithms. We tried to understand the motivation behind the development of these algorithms, techniques along with the mathematical intuition involved. Finally, we saw the comparative results of all optimizers through visualization.

References:

If you have any questions or doubts, feel free to ask them in the comments section below. Please reach out to me if anything is missing in this post, or if something can be improved/corrected!✌🏼