Optimization techniques in Deep learning

Published in

CodeX

8 min readAug 31, 2021

In my previous post, I discussed how gradient descent and its variation works. But, those optimization techniques might not work well in all cases. There are few reasons SGD or batch SGD aren’t worthy of using every time.

these optimizations techniques might not work well for non-convex objective functions because of saddle points, that we encounter a lot in deep learning.

Difference between convex and non-convex functions

2. All the features do not have the same frequency, so the same learning rate cannot apply to all the weights.

3. Large learning rates cannot be used because of overshooting or divergence from the path.

In this blog post, we will discuss few other popular optimization techniques which can solve the above problems.

SGD with momentum
NAG
AdaGrad
RMSProp
AdaDelta
Adam

1. SGD with Momentum

We know that SGD or mini-batch SGD doesn’t use whole data to converge. Because of this lack of data, it tries to approximate the actual path to reach the optimal point. The direction to reach the destination is clear, but it goes through many ups and downs in the journey to reach it. 🙁

Momentum allows us to damp the oscillations using exponentially weighted averaging of the gradient,

with the momentum, there is no more getting stuck at local minima and also speeds up the convergence.

We only change the update function to implement the momentum

update function in SGD

Check my previous blog to know more about this update function.

There are two sets of equations to do the exponential averaging. It is better to know about both sets of equations to avoid confusion.

Where γ is the known momentum factor and η is the learning rate. (0<γ<1)

These both set of equations give the same result but with different γ, η values. But I prefer the second set of equations (i.e., eq (3) & eq (4)) for now.

By substituting eq (3) in eq (4), we get:

In the above equation i.e., eq (5)

γV(t-1) is the momentum step (green line) and ηdL/dW(t-1) is the gradient step (red line). The actual step(blue line) is the sum of momentum and gradient step. We know that momentum is the exp averaging of the gradient. As the no of iteration increases, the momentum step increases(the length of the green line) so does the Actual step (blue line). This is how it converges faster for every iteration.

Let’s see how weights get updated with momentum for few iterations

This is how SGD with momentum works and weights get updated.

2. Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient is like SGD + momentum with minor differences. In SGD with momentum, we compute gradient term (ηdL/dW(t-1)) and momentum term (γV(t-1)) and sum those to move to the next point. But in Nesterov Accelerated Gradient, first, we compute momentum and move in its direction and then compute the gradient at that point(dL/dw’) (which is not the same as actual grad (dL/wt-1)) and then move in that direction where we eventually reach the actual point i.e. wt.

In simple terms, we first move from Wt-1 to W’ then to Wt.

From SGD with momentum, we know that:

Now, to move only in the momentum step, we should not add the gradient term in the above equation.

So, without gradient term we won’t move to Wt, instead, we move to W’.

Notice! there is no gradient step in it.

Now compute the gradient at W’ to reach Wt

So, the final equation looks like:

You might get a doubt that we are ultimately reaching the same point(Wt), so what is the point of using Nesterov accelerated gradient instead of SGD with momentum?

The gradient that we calculate at W’ is known as the “lookahead” gradient, which helps us to look ahead of where we are reaching.

In Fig(a), for every update, the loss decreases and travels more distance than previous steps to converge, which is great. But at the 4th update, it overshoots, and later, with updates 5 and 6, it comes back to the minimal loss point. Until the first three updates, everything is the same in SGD with momentum and NAG. But on update 4, you can see the importance of NAG. We know, we compute two partial steps in NAG to reach the next point. In the 4th update, the first partial step is 4a i.e., moving in the momentum's direction and at 4b we compute the lookahead gradient to see where we are heading and move to that point. So, at the end of update 4, the overshoot in NAG is reduced when compared to SGD with momentum, and with just one more update i.e., 5th update, it reaches the minima. So, it takes 5 updates for NAG and 6 updates for SGD with momentum to reach the same destination. This is how NAG differs from SGD with momentum.

3. Adaptive Gradient (AdaGrad)

All the previous optimization techniques which we discussed till now had a constant learning rate. The core idea of the adaptive gradient is to have different learning rates for each weight in every iteration. But why is so important to have different learning rates?

Basically, features can be sparse (contain few non-zero elements) and dense (contain few zero elements). So, you can’t apply the same learning rate for sparse and dense features because they don’t converge at the same rate.

The update function of SGD:

Here, η is the learning rate, which is the same for all the weights in all the iterations.

Update function using Adagrad:

Here η’ is the learning rate, which is different for all the weights in every iteration.

ε is a small positive number to avoid divisible by zero error.

α(t-1) is the summation of the square of the previous gradient.

Where η is the constant learning rate and η is the adaptive learning rate that changes by using all the information of previous gradients.

Let’s see an example of how each weight has a different learning rate in an iteration.

Consider a neural network with two input nodes and an output node.

Here, α is not the same for each weight, and its value changes for every iteration, thus they have different learning rates for every weight in every iteration. But when we solve the same example using SGD, this is how it looks:

η is the constant learning rate, which is the same for all the weights till the last iteration.

So far, everything looks great about adagrad, but there is one major disadvantage that cannot be ignored, i.e.,

As the no of iterations(t) increases, the value of α increases, then η’ decreases.

The learning rate η’ decreases for every iteration, then the convergence becomes slower and at some point η’ becomes almost zero. This problem is known as the vanishing gradient. Adadelta and RMSprop are other optimization techniques that attempt to solve this vanishing gradient problem efficiently.

4. RMSProp

In adagrad, to calculate α, we took the summation of the square of the gradient. But, in RMSProp, we take an exponential average on the gradient square.

we are actually controlling the growth of the denominator term in η’ to solve the vanishing gradient problem by performing the exponential average.

5. AdaDelta

Adadelta does the same thing that rmsprop does in the denominator part of η’. Additionally, in adadelta, we replace the default learning rate η with the exponential average of the delta.

The delta(D) is

You might get confused by looking at these equations at a single glance. Please take your own time to process each line.

6. Adaptive moment estimation (Adam)

Momentum and Nesterov accelerated gradient have a constant learning rate and are focused only on modifying the gradient part. Adagrad, adadelta, and RMSProp are all about modifying the learning rate. Adam is the combination of momentum and RMSProp, which focuses on modifying both, learning rate and gradient to get better results.

In statistics, mean is oftenly called as 1st order moment and variance as 2nd order moment

In RMSProp we did the exponentially weighted average (EWA) of gₜ² to get the learning rate η’ and in SGD with momentum, we did exponentially weighted average (EWA) of gₜ to get the momentum.

EWA of gₜ and gₜ² is the estimate of mean (first moment) and variance (second moment) and will represent them as mₜ and vₜ

But when we initialize mₜ and vₜ with zeros, they are biased towards zero. So, we compute the bias correction in these first and second moments. (I highly recommend you to watch Andrew Ng Bias correction video to get a better understanding of this point).

And the update function looks similar to RMSProp’s one

Because of these synergic modifications, Adam works exceptionally well than other optimizers in most cases.

This is it. We have come to an end. Thanks for reading till the end and let me know if there are any suggestions or feedbacks in the comment section. See you again in my next blog!😊