Into the Depths of Gradient Descent

Vishisht Priyadarshi
IITG.ai
Published in
16 min readSep 29, 2019

This post is co-authored with Animesh Tiwari.

Gradient descent is a first-order iterative optimisation algorithm for finding the minimum of a function. ”

Gradient descent is one of the most popular algorithms to perform optimisation and by far the most common way to optimize neural networks. It is often used as ‘Black-Box Optimisers’ but a good insight of various variations and mathematics behind them often comes handy. Gradient Descent can very well help in optimising non-convex functions. But what if the contour generated by the function is not so nice (like Saddle points, valleys, …)?
We can overcome most of the hurdles by building some concepts and bringing mathematical insights into its native implementation.

But first lets dive into various types of Gradient Descent Implementations :

1. Batch Gradient Descent:

Also known as Vanilla Gradient Descent , it first computes gradient of the cost function over entire training dataset , before making any updates to the weights.

Sources : https://www.textbook.ds100.org/ch/11/gradient_stochastic.html

Mathematically,

where,

θ = parameters
η = Learning rate
J = Cost Function

The main point to notice is that the parameters are given a single update only when entire dataset has been traversed to find the gradient cost. As a result, Batch Gradient Descent can become very slow. Moreover due to this , updating of our learning model online becomes very difficult.

Is it guaranteed that the gradient descent always converges ?

Yes it is.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces if we correctly set our Learning Parameters.

2. Stochastic Gradient Descent:

In contrast to Batch Gradient Descent , Stochastic Gradient Descent performs update in the parameters after each training example.

Sources : https://www.textbook.ds100.org/ch/11/gradient_stochastic.html

As we can notice, due to performing update after each training example, the path to the optimum value is noisier and the minimum optimum is never reached practically. The function keeps oscillating around the minima due to high variance after each update.We can overcome this issue by defining an error term which accounts for how close we want our function to reach to the optimum value.

Its advantage over Batch Gradient Descent is that it deals away with the redundant calculations on its way to minimum. As a result , it is much faster and can be used to learn through the data online.

To make sure that the function converges to the optimum value, the learning rate is slowly decreased. As a result, it follows same convergence behaviour as Batch Gradient Descent.

3. Mini Batch Gradient Descent:

Mini Batch Gradient maintains a balance between above two methods. In each iteration, now a fixed number of examples are chosen to make update instead of making update on single example, i.e., a mini batch of examples are created and the learning rate is updated on each such iterations.

Mathematically,

where ,

θ = parameters
η = Learning rate
J = Cost Function
n = Number of training examples in a mini batch

The size of mini batch depends on for what application the optimisation method is used. It usually varies from 50–256.

NOTE:

Before using Mini-batch gradient descent, training data is shuffled and mini batch is chosen by iterating over the shuffled data.

Challenges:

There are some challenges with Vanilla Mini-Batch gradient descent which need to be addressed, because it does not guarantee good convergence:

  1. Choosing a proper learning rate is a challenge, if our learning rate is smaller this results in slower convergence of our loss function towards the minimum and can take larger amount of time, but if our learning rate is too large, this causes overshooting of the function away from the minimum, there are a lot of fluctuations and our function may diverge.
  2. The learning rate can be adjusted by trying Learning Rate Schedules during training e.g. Annealing ; where the learning rate is reduced when the objective function falls between a pre-defined value or a threshold in an epoch, however these schedules have a problem of adapting to the characteristics of datasets as they have to be defined in advance.
  3. As we use the same learning rate to update all the parameters , the updating occurs by the same extent of all the parameters and we may not want this because if we have a sparse dataset with different relative frequencies of parameters we would want the rarely occurring parameters to be updated largely than the parameters which occur frequently. For this purpose, we make modifications which is referred in the AdaGrad Algorithm.
  4. Another challenge for SGD Optimization is getting trapped in the so-called sub-optimal local minima in case of non-convex functions, also referred to as the saddle points, at this point the value of derivative is close to zero for a larger distance; i.e., surrounded by a plateau of same error, this causes a lot of problem for SGD and makes it hard for it to escape.

GRADIENT DESCENT OPTIMISATION ALGORITHMS:

In the upcoming articles, some of the algorithms are covered which helps in overcoming aforementioned challenges :

1. Momentum:

The main problem SGD faces is while navigating around slopes of Ravine
(Ravine is an area, where the surface curves much more steeply in one dimension than in another). It only makes hesitant progress along the bottom towards local optimum and makes a number of oscillations down its path.

Sources : https://www.slideshare.net/HojinYang3/gradient-descent-optimizer

SGD with Momentum helps in overcoming large variations encountered on the way to optimum value. It acts as a means to dampen the oscillations and accelerate SGD in the relevant direction.

Source : https://www.willamette.edu/~gorr/classes/cs449/momrate.html

The concept of Momentum is based on the idea of Exponentially Weighted Averages. It is a kind of moving average which would denoise the data and bring it closer to the original path the function would have followed in Batch Gradient Descent.

Mathematically,

where, γ is the momentum term.

Intuitively, we can think of γ as : we are approximately averaging over last
1 / (1- γ) points of the (sequence) path of the function.
So γ is usually set to 0.9 or a similar value.

Another important point of observation is that momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

NOTE:
A very good visualisation of the working of SGD with momentum can be found on the following link:

https://distill.pub/2017/momentum/

2. Nesterov Accelerated Gradient :

Nesterov Accelerated Gradient is a modified version of SGD with momentum. It helps SGD choosing a better path down to local minimum with less oscillations.

Mathematically,

Here in the Loss function the term “θ - γ.νₜ ₋₁” helps in approximating the next updating process so that the oscillation is reduced.
γ is set to be 0.9 for exact same reasons as that for the SGD with momentum.

Sources : https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12

In above illustration, we can see that with the help of Nesterov Accelerated Gradient, the function is able to make wise steps towards the minimum such that number of oscillations is kept minimum.

Sources : https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Here is a visualisation of Nesterov’s method in terms of vectors that are seeking its way to local minima.
NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (green vector), while Momentum first computes the current gradient (small blue vector) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector).

As a result, NAG helps in speeding up the convergence process and increases effectiveness of the algorithm by deciding whether to make a large update or small update on each iteration.

3. AdaGrad:

Tuning of the hyperparameters (like learning rate η) can be very difficult and may vary over different datasets. If we set η too small, then the parameter update will be very slow and it will take very long time to achieve an acceptable loss. Otherwise, if we set it too large, then the parameter will move all over the function and may never achieve acceptable loss at all. To make things worse, the high-dimensional non-convex nature of neural networks optimisation could lead to different sensitivity on each dimension. The learning rate could be too small in some dimension and could be too large in another dimension.

Effects of Different learning rates

Adargrad addresses this issue of the Gradient Based Optimisation. It adapts the learning rate in such a way that the weights of the updates is dependent on the frequency of those parameters.

Sources : https://www.slideshare.net/HojinYang3/gradient-descent-optimizer

Earlier an update was performed for all parameters θ at once as every parameter θᵢ used the same learning rate η. But Adagrad uses uses a different learning rate for every parameter θᵢ at every time step t.
Lets dive into the mathematics behind Adagrad:

Denote the gradient of the objective function w.r.t. to the parameter θᵢ at time step t as :

Then the update step is shown as:

In its update rule, Adagrad modifies the general learning rate η at each time step t for every parameter θᵢ based on the past gradients that have been computed for θᵢ :

Gₜ ∈ ℝᵈˣᵈ here is a diagonal matrix where each diagonal element i, i is the sum of the squares of the gradients w.r.t. θᵢ up to time step t.The matrix G is defined as Gₜ = ∑ gᵢ . gᵢ ᵀ where i varies from 1 to t.
ϵ is a smoothing term that avoids division by zero (usually on the order of 1e⁻⁸).

Actually, we can use the full matrix Gₜ in the parameter update, but computing the square root of the full matrix is impractical, especially in very high dimension. On the other hand, computing the square root and the inverse of only the diagonal diag(Gₜ) can easily be done.

Another advantage of AdaGrad is that it does away with the manual setting of learning rate. We can set a default learning rate which is used entire learning process.

But it also has some disadvantages. One of them is that in the gradient term, the accumulation of positive terms in the denominator causes the update to become infinitesimally small throughout training . Due to this the updating process almost comes to a halt and the model fails to learn more.

4. AdaDelta:

Adadelta is built on AdaGrad and attempts to address some of the above-mentioned disadvantages. AdaDelta accumulates the gradients in the previous time steps using an exponentially decaying average of the squared gradients. This prevents the denominator from becoming infinitesimally small, and ensures that the parameters continue to be updated even after a large number of iterations. It also replaces the global learning rate η with an exponentially decaying average of the squares of the parameter updates over the previous iterations. This method has been shown to perform relatively well when used to train deep networks, and is much less sensitive to the choice of hyper-parameters.

Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average E[g²]ₜ at time step t then depends (as a fraction γ similarly to the Momentum term) only on the previous average and the current gradient:

We now rewrite our vanilla SGD update in terms of the parameter update vector Δ θₜ :

The parameter update vector of Adagrad that we derived previously thus takes the form:

We now simply replace the matrix Gₜ by decaying gradients over the past squared gradients E[g²]ₜ :

The value of gamma is generally set to around 0.9 same as we use in momentum.We also note that the update term above does not have the same units as that of the parameter, therefore we replace the learning rate n by the RMS of the exponentially decaying averages of the squared parameter updates. Since, this RMS is not known for the current step , we replace it by the RMS of previous step. In AdaDelta we do not even need to set the initial default value of learning rate since it has been replaced from the update rule.

5. RMS Prop:

This algorithm also looks into the problems of the continuously diminishing gradient summation terms in the denominator of the update term , RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients E[g²]. This step is similar to the first update step of AdaDelta.

Source : http://bakbang.blogspot.com/2017/08/rmsprop.html

The default value of learning rate is suitably chosen around 0.001.

6. Adam:

It is also known as Adaptive Moment Estimation. It basically calculates Adaptive Learning Rates for different parameters. It calculates the exponentially weighted average of the past gradients in addition to keeping track of the exponentially decaying averages of the past squared gradients.

where,
m ₜ = measures of the mean(first moment)
v ₜ = uncentred variance(second moment).

Since the vectors m and v here are initialized to zeros and the values of β1 and β2 are close to 1, their values are biased towards 0 during the initial steps when the decay rates are small. Therefore we introduce two bias correction terms and divide by them to counteract this problem, this term is large initially and becomes close to 1 as the steps become large. What basically β does is that it averages or smoothen the fluctuating curve by (1/1- β) terms.

The default values of β1 and β2 are chosen to be around 0.9 and 0.999 respectively .

7. AdaMax:

The denominator of the update expression which contains v is basically based on L1 Norm of the past gradients and current gradient. We can generalize this norm by using Lₚ normed vector spaces. In general, L1 and L2 norms are more stable than the generalized normed vector spaces, however the L(infinity) norm is also stable; therefore in AdaMax we basically use this L(infinity) norm instead of L2 norm in calculating the value of v in our update expression and replace the denominator by u.

We can now plug this into the Adam update equation by replacing √vₜ + ϵ with uₜ to obtain the AdaMax update rule:

Note that we need not to include the bias term in uₜ as this term has been calculated by using the max operation of L(infinity) norm , so this will not be biased to zero as it does in Adam.

The general default values of learning rate η, β₁, β₂ are chosen to be around 0.002, 0.9 and 0.999 respectively.

NOTE: A nice understanding of various types of norms can be found in following articles -

8. Nadam :

It is also called Nesterov-Accelerated Adaptive Moment Estimation. In this we are introducing the same pre-science to the Adam Estimation similar to the momentum which was basically used to look forward to the current step and make a more accurate step in the direction of the calculated gradient by updating the momentum term before computing the gradient.

Note that rather than applying the momentum step twice - one time for updating the gradient g(t) and a second time for updating the parameters g(t+1) - we now apply the look-ahead momentum vector directly to update the current parameters.

In order to add Nesterov term to our Adam term, we simply replace previous momentum vector by current momentum vector i.e we use m(t) to look forward instead of m(t-1)

We can now add Nesterov momentum just as we did by simply replacing this bias-corrected estimate of the momentum vector of the previous time step with the bias-corrected estimate of the current momentum vector:

VISUALISATIONS OF DIFFERENT OPTIMISATIONS

1. Long Valley :

Algorithms without scaling based on gradient information really struggle to break symmetry here -

Source : https://imgur.com/a/Hqolp

SGD gets no where and Nesterov Accelerated Gradient / Momentum exhibits oscillations until they build up velocity in the optimization direction. Algorithms that scale step size based on the gradient quickly break symmetry and begin descent.

2. Beale’s function :

Due to the large initial gradient, velocity based techniques shoot off and bounce around — adagrad almost goes unstable for the same reason.

Source : https://imgur.com/a/Hqolp

Algorithms that scale gradients/step sizes like adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability.

3. Saddle Point :

Source : https://imgur.com/a/Hqolp

NAG, Momentum again explore around, almost taking a different path. Adadelta, Adagrad, RMSProp proceed like accelerated SGD.

4. Irregular Contour :

NOTE : Numbers in figure legend indicate learning rate, specific to each Optimizer -

Source : https://github.com/Jaewan-Yun/optimizer-visualization
Source : https://github.com/Jaewan-Yun/optimizer-visualization

NOTE:

Following is the link to get your hands dirty to visualise the path taken by different Optimisers depending on its starting position:

https://bl.ocks.org/EmilienDupont/raw/aaf429be5705b219aaaf8d691e27ca87/

SOME MORE STRATEGIES FOR OPTIMISATIONS

1. Shuffling :

Shuffling of data helps in reducing variance and making sure that the model overfits less.

Generally we shuffle the data when it is sorted by its target. Here shuffling is required to make sure that the training/test/validation sets are representative of the overall distribution of the data.

In Batch Gradient Descent, we want to shuffle the data after each epoch because there will be always the risk to create batches that are not representative of the overall dataset, and therefore, our estimate of the gradient will be off. Shuffling our data after each epoch ensures that we will not be “stuck” with too many bad batches.

In regular stochastic gradient descent, when each batch has size 1, we still want to shuffle our data after each epoch to keep your learning general. By shuffling your data, we ensure that each data point creates an “independent” change on the model, without being biased by the same points before them.

2. Batch Normalisation :

To facilitate learning, we typically normalize the initial values of our parameters by initializing them with zero mean and unit variance.

As training progresses and we update parameters to different extents, we lose this normalization, which slows down training and amplifies changes as the network becomes deeper.

Batch normalization re-establishes these normalizations for every mini-batch and changes are backpropagated through the operation as well. By making normalization part of the model architecture, we are able to use higher learning rates and pay less attention to the initialization parameters.

3. Early Stopping :

The figure below shows the change in the value of our error function, as the optimization gradually proceeds the training error decreases continuously with the number of iterations and the values of our parameters are very small at the starting and it gets large after large number of iterations. On the other hand, the dev-set or the cross-validation set error decreases upto some number of iterations and then starts to increase after some mid value of the parameters , therefore to reduce overfitting we generally stop early as indicated in the figure below.

Source: https://elitedatascience.com

4. Gradient Noise:

Adding Gaussian-distributed, random noise to the input data of each layer could make your model more robust to small changes in the data enabling your network to better distinguish noise from signal. This random noise helps against poor initialisation and helps training deep and complex networks, it helps to escape and find new local minimums.

Gaussian Distribution — Source: https://www.sfu.ca/

CONCLUSION :

In this article, we discussed about various types of Gradient Descent implementations, what challenges it presents, its different optimisation algorithms and some good practices to adopt to optimise our model further.

Source : https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/

Stochastic Gradient Descent is powering nearly all of deep learning applications today. Furthermore, outside of deep learning, SGD is the main way to train large linear models on very large data sets. With the exponential growth of interest in Deep Learning, which started in the academic world around 2006, SGD, thanks to its simplicity in implementation and efficiency in dealing with large scale datasets, has become by far the most common method for training deep neural networks and other large scale.

REFERENCES :

--

--