Optimizer: diving deep into Neural Networks

Published in

Vision and Language Group

13 min readJun 16, 2021

Waiting for the neural network to train is always the most annoying time and even more annoying when the datasets are large.

Indeed, but changing the optimization algorithm may bring a significant change in speed as well as accuracy for the deep learning model training.

Types of Optimization Algorithms

Several optimizations algorithms exist. Mainly there are two types of optimization algorithms

First-order optimization algorithms →It uses First-order derivatives for optimization.
Second-order optimization algorithm →It uses second-order derivatives for optimization.

Here we are going to discuss some of the First-order optimization algorithms with a glimpse of the second-order optimization algorithm.

First things first, Let’s break it down into pieces for better understanding.

Optimization

Mathematically, the process of minimizing or maximizing any expression is called optimization. In machine learning algorithms, we have a loss function, which has to be minimized to predict results with significant model performance.

The objective of optimizers is to reach the point of minima of the loss function, If the loss function can be visualized in 3 dimensions, then some of its possible shapes :

In the Convex loss function (left fig.) on starting at any point randomly we can end only at one global minimum but in the Non-Convex loss function, our right figure at some local minima instead of global minima, due to its initial position and the optimization algorithm used.

In this blog, we are going to cover the following optimization algorithms:

Gradient descent
Stochastic gradient descent
Hessian optimization
Stochastic gradient descent with momentum
RMSprop
ADAM

Gradient Descent & Stochastic Gradient Descent

In Optimizing a curve, with the help of calculus we can find the exact point of minimum, but, in modern neural networks, there can be millions or billions of parameters. At such a high number of parameters, calculating that exact point is computationally infeasible, as a result, we have to follow an iterative approach to descent to the optimal minima.

In both Gradient descent and stochastic gradient descent, we calculate derivatives of loss functions with respect to each model parameter and then iteratively update them using the gradients.

But the main difference arises in the number of examples considered in one complete iteration. While in Gradient descent, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in Stochastic Gradient descent, on the other hand, you use ONLY ONE training sample from your training set to do the update for a parameter in a particular iteration. The update rule for Stochastic gradient descent follows as

If you use SUBSET instead of a single training sample, then it is called Minibatch Stochastic gradient Descent. The update rule for Minibatch Stochastic gradient Descent follows as

Usually, Stochastic gradient descent refers to Minibatch Stochastic gradient descent only. Stochastic gradient descent is much faster than gradient descent especially in the case of larger datasets as we don’t have to go through all the training samples but in comparison to Gradient descent, Stochastic gradient descent is noisy. The noise here refers to the addition of unwanted gradients in the data of original gradients. In Gradient descent, loss function usually decreases at each iteration but in Stochastic gradient descent loss function might go up in some or come down.

Here we can see that in the case of the Stochastic gradient descent path to the minima of the contour plot is more zig-zag but in the case of Gradient descent, there is a smooth contour plot.

Pathological Curvature

First things First, Gradient Descent and Stochastic gradient descent both work particularly badly when they reach pathological curvature. In such type of curvature, there is a direction where the gradient is very high but minima are not in that direction and another direction where the gradient is pretty less but moving in that direction will get to our minima or optimum position.

The left image shows that how a normal Gradient descent will be going through the pathological curvature, while the right image shows the ideal direction to be followed. Gradient descent continuously fluctuates between the high sloped curvatures with very little progress in the ideal direction.

Hessian Optimization

Hessian optimization uses the Hessian matrix which encompasses all the double derivatives of the curve. As we have seen in the pathological curvature, the gradient itself is a very incompetent source of information about the curvature, all it can tell is the rate at which the value of the function is changing with respect to the particular parameter.

Here all the 3 curves have different double derivatives but have identical gradients, it is cases like these where the need of the Hessian matrix becomes evident. Mathematically Hessian matrix is defined as

Now, let us see the Procedure of the Hessian Optimization for which we have to understand Newton’s method.

Newton’s Method in multivariate functions are approached by approximating the function by Taylor Series to a simple quadratic function in the neighbor of the point. Then we minimize the approximated function to obtain a new point on the original curve. This process is iterated over the whole function.

In practice, the update step can be written as

Here in Hessian optimization, we can see that second-order derivatives play a role in deciding the optimal direction in contrast to the gradient descent where only first-order derivatives are considered.

Although Hessian (H) has to be positive definite for the descent to take place in the optimal direction, in fact, recent studies (Dauphin et al., 2014; Choromanska et al., 2014) has shown that saddle points are dominating the optimization landscape of deep neural networks, implying that the Hessian is most likely indefinite.

Levenberg-Marquardt Modification is used as a modification to newton’s method when the Hessian matrix does not hold up to positive definite as a result the direction obtained does not correspond to the optimal minimum direction. Here the update rule is modified as

In the first step, we have added an identity matrix multiplied by a constant mu to the original Hessian matrix thus G becomes the modified Hessian matrix.

Hessian matrix will be positive definite when all the eigenvalues of Hessian matrix are positive. A fairly large value of the constant mu will make sure that all of the eigenvalues of modified Hessian are positive thus modified Hessian will remain positive definite as a result, direction of descent will be in optimal descent direction.

In the second step, G, the modified Hessian’s inverse is used in the update step, with a step size alpha which prevents divergence.

The step for the approximation of a curve to a quadratic approximation is the preconditioning step. Preconditioning can be thought of as a geometric solution to the problem of pathological curvature. It aims to locally transform the optimization landscape so that its curvature is equal in all directions.

On a two-dimensional curve, Preconditioners approximates the curve locally to a quadratic curve, transforming the optimization landscape for equal curvature in all directions.

Practically, there are several computational drawbacks for using Hessian inverse as a preconditioner. Neural networks typically have millions of parameters.

Rendering Hessian optimization to be practically infeasible.

SGD with Momentum

SGD adds noise to the gradients

Our optimizer oscillates in the direction perpendicular to the optimal direction, which leads to slow convergence. SGD with Momentum helps to reduce the oscillations in perpendicular direction while moving in the optimal direction.

Now SGD with momentum doesn't rely solely on current gradients to calculate the direction of descent but it carries a memory of all the past gradients to find the direction of descent.

The memory is carried by the function V, and each gradient is multiplied to its particular weight corresponding to its contribution towards the direction of descent. This function is similar to the exponentially weighted average function.

As beta lies between [0,1], therefore, the contribution of initial components of function S is decreased, but each component contributes to the function V thus acting as a memory of function V. SGD with momentum exploits this property of exponentially weighted averages.

Therefore instead of using a direct gradient in the update step of the parameters, we use its exponential average modified form to take into consideration the previous gradients. This approach also solves the problem of pathological curvature up to some extent.

These are the steps taken by the Gradient descent which is broken down into two components, one which points in the direction of the ideal path of descent (t1, t2, t3) and one perpendicular to it (n1, n2, n3). On applying the Momentum update step, gradients of previous steps start summing up. The components (n1, n2,n3) point opposite to each other, as a result, during summing up of gradients, they tend to cancel out each other thus reducing the component of descent perpendicular to the ideal path of descent, while at the same time (t1,t2,t3) points in the same direction, therefore, they add up to give higher component to the ideal path of descent. This reduces the noise of the gradient and leads to a smoother path towards 0the optima.

Here, to just form an analogy with standard momentum in physics, beta acts as friction for the original path of gradient descent, as, larger the beta, larger will be the contribution from the previous gradients, thus larger will be deviation from the original path. Function v act as velocity and gradient act as acceleration, as, more will be the gradient, quick changes occur in direction and if more will be the function v, large steps will be taken.

Usually, the coefficient of momentum(beta) is taken to be equal to 0.9.

Result

→ It solves the problem of pathological curvature, which can’t be solved by the SGD alone.

RMSprop algorithm

In SGD with momentum, the steps develop a speed in direction of decreasing gradients but when it reaches optimal minima, generally overshoots it and starts fluctuating around that optimal minima.

RMSprop dampens the gradients in the non-ideal descent direction and accelerates in the ideal descent, but, it does by calculating for each parameter separately.

There’s a lot of things happening in this formula, let’s take it down to pieces.

The first equation looks pretty much similar to our exponential weighted average equation used in SGD with momentum, but the key difference is that the gradient is squared instead of a single power.

The second equation is a parameter updating step, but, the gradient is modified by dividing with the exponentially weighted average function. Moreover, both of these equations work on every parameter separately.

In the first equation, squaring gradients will squish the smaller gradients to be smaller values and throw the larger gradients to larger values, it also affects that now the opposite gradients won’t cancel out and each other and keep adding just like the gradient in the ideal direction of descent. Just like the momentum equation, here also function v will act as memory and store the information about the previously squared gradients with their respective weights multiplied by them.

In the second equation, the gradient of the loss function is divided by the square root of exponential weighted average function v. If we assume that for a particular parameter w, the corresponding gradient has a high value, thus, the corresponding exponential weighted average will also be high, therefore, in second equation the gradient term would be less. Similarly, if the gradient has a small value, will also have a small exponential weighted average thus in the second equation the gradient term would large. Therefore gradient term will amplify for small gradients and nullify for large gradients.

Here we can see that for the small gradients, the gradient term of the second equation automatically amplifies it to a larger value while for a larger gradient, it gets nullify to a smaller value, this will help us avoid bouncing between the ridges, and move towards the minima.

Now consider the pathological curvature,

Here we can see that gradients along the ideal direction of descent are small and perpendicular to it is large, with RMSprop gradients get reduced in the perpendicular direction and get enlarged along the ideal direction. Also, it doesn’t let gradients along the ideal direction go too large to overshoot the optimum point.

The steps of RMSprop are done individually to all different parameters, therefore, it has much better control over every dimension of the loss function and can easily modify any individual gradient without affecting any other gradients.

The Hyperparameter gamma is generally chosen between 0.9 & 1 while it requires tuning accordingly. Hyperparameter epsilon in equation 2 is normally taken around 1e-8, as it is used to avoid the denominator going to zero.

Result

→ It solves the problem of overshooting the optimal minimal position, created in the SGD with momentum optimization algorithm and also solves the problem of pathological curvature.

ADAM Optimizer

As we have seen, in both SGD with momentum and RMSprop both algorithms were able to address the problem of the pathological curve with different approaches.

The Adaptive Movement Estimation algorithm or Adam optimizer combines both algorithms in a single algorithm.

Here we can see that, functions m and v store the information about both the gradient and gradient squared in the exponentially weighted average function. In the third equation, we combine both algorithms by multiplying the gradient with function m and dividing by the square root of function v.

When we apply both algorithms altogether, the momentum part of the algorithm tries to build speed and take larger steps toward the optimum while the RMSprop part of the algorithm tries to impede the motion by reducing gradients in direction of optimum minimum both of these algorithms together give an interesting result. The effective magnitude of the steps taken in parameter space at each timestep is approximately bounded by the stepsize setting α, whereas there’s no bound on how small the gradient can go which is favorable in case of curvature near-optimal minima.

Result

→ This solves the problem of overshooting optimal minimal position, created in the SGD with momentum optimization algorithm, and also provides an upper bound to the effective magnitude of the steps taken in parameter space at each timestep.

Conclusion

We got to learn many optimization techniques throughout, but RMSprop and ADAM usually dominate the most deep learning models. In the visualizations, you can easily see that momentum initially performs poorly on the pathological curvature but with time it builds up momentum in the desired direction and goes with very high speed, and then overshoots the optimal minima. whereas you can see that RMSprop has complete control over their speed.