From SGD to Adam

Published in

Blueqat (blueqat Inc. / former MDR Inc.)

8 min readMay 24, 2020

Gradient Descent is the most famous algorithm to optimize parameters in Neutral Networks and many other Machine learning algorithms. Most of the audience seek gradient descent as a way to optimize their ML models.

Gradient Descent is a way to minimize the loss function J(θ) which is parametrized by the model parameters θ∊R^n, where n is the number of parameters in the hypothesis function. Here we optimize the parameters by going opposite to the direction of the gradient of the loss function J(θ) with respect to the parameters. Let’s consider the loss function as a 3-D hilly terrain with heights and depths like in the figure below:

Image Link

We follow the direction of the slope in the above terrain going downhill and try to find the deepest valley. These represent the local and the global minimums where the loss function i.e the squared difference of the predicted value and the actual value is the minimum.

There are three basic kinds of variants of gradient descent depending on the amount of data we use for the gradient descent:

Batch Gradient Descent
Stochastic Gradient Descent
Mini-Batch gradient descent

We will be focusing on SGD(Stochastic Gradient Descent) and traverse to one of the most favourable gradient descent optimization algorithm Adaptive Moment Estimation (Adam).

Stochastic Gradient Descent (SGD):

In the stochastic gradient descent we update all the parameters for each training example x(i) and the label y(i)individually, instead of computing the gradient of the cost function with respect to the parameters for the whole training set.

θ=θ-α∇J(θ;x(i),y(i))

SGD is generally noisier than typical Gradient Descent as it usually takes a higher number of iterations to reach the minima, because of its randomness in its descent. Even though it requires a higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much less expensive than typical Gradient Descent. SGD is also better than the typical batch gradient descent because it doesn’t have the redundancy by doing one update at a time. We see that SGD has frequent updates having high variance and thus we see a lot of fluctuations in the loss vs iterations graph as below:

Image Link : https://en.wikipedia.org/wiki/Stochastic_gradient_descent

In the batch gradient descent technique we slowly and smoothly reach the minimum we are looking for while in the SGD we tend to jump to a lot of local minimums due to the above fluctuations and ultimalely it makes it a bit complicated to converge to the exact minimum. This can be improved by slowly decreasing the learning rate and thus we get the same convergence as batch gradient descent.

When we try to code the SGD what we have to keep in mind is to reshuffle the training data at every iteration. Let’s take data variable as the training data set.

for i in range(iterations):
    np.random.shuffle(data)
    for j in data:
        gradient=evaluate_gradient(loss_func,j,data)   #evaluate_gradient is the function for the calculation of gradient.
        parameters=parameters-a.gradient 
#a is the learning rate
#parameters are the features set which have to be updated.

Although SGD and other gradient descent variants are a good bet but we have to face certain challenges. Let say we have a sparse data set where some features are frequently occurring and others are rare, then opting for a same learning rate for all the parameters will not be a good idea.We would want to make a larger update for the rarely occurring ones as compared to the frequently occurring features. Another challenge is to choose a proper learning rate. If we choose a very large learning rate, we try to dwindle around the minimum and if we choose a very small learning rate the convergence gets really slow. In the neural networks domain one of the issue we face with the highly non convex functions is that one gets trapped in the numerous local minimas.

Thus in order to deal with these challenges there are some gradient descent optimization algorithms that have come into place. Out of these Adaptive Moment Estimation (Adam) is the best optimizer at present. We will look into Adam optimizer which inherits itself from Adagrad and RMSProp, so we will look into all these optimizers sequentially.

Adagrad:

Adagrad optimizer helps us in solving one of the challenges we saw above connected to the sparse data set, by adapting the learning rate to the parameters, by having a low learning rate for the parameters associated to frequently occurring features and larger updates to the ones with infrequent features. One of the big application of this was to improve the robustness of the SGD which is quite widely used in large scale neural networks at Google.

So how it differ from the trivial SGD is that, in SGD we have a common learning rate for all the parameters and thus we are able to update all the parameters θ at once. But in Adagrad we try to have a different learning rate for each parameter θ at every timestamp. So the procedure is to have a per-parameter update and vectorize it.

gᵢ = ∇J(θᵢ), where g is the gradient wrt to each parameter θᵢ.

Thus the SGD update comes out to be:

θᵢ=θᵢ-α.gᵢ where we are updating all the parameters at one timestamp.

How Adagrad is different is that it modifies the learning rate α for every parameter θᵢ based on all the past gradients that have been calculated so far for each θᵢ.

Here Gᵢᵢ is an i*i dimensional diagonal matrix where the diagonal terms (i,i) represent the sum of the squares of the gradients w.r.t θᵢ until that timestamp and ϵ is the smoothing out constant to avoid the division by 0 and has a value of 10^(-8).

Next we can vectorize the above equation by doing a matrix vector multiplication between G and g, as the above matrix is diagonal matrix with the sum of squares of past gradients along its diagonal.

Thus we have a method to automatically tune in the learning rate. But we have a disadvantage here that as the G is positive cause of the squared sum being added in all the diagonals, the accumulated sum keeps on increasing during the training and thus the learning rate becomes infinitesimally small.

Thus we resort to another RMSprop method in order to solve the Adagrad’s diminishing learning rate.

RMSprop with Adagrad:

In order to decrease the monotonically learning rate in the above Adagrad algorithm, we try to take a decaying average of all the past squared gradients in a recursive fashion. Thus the running average of the squared gradients depends only on the previous average and the current gradient, the running gradient being denoted by E[g²]ᴛ at time t.

As we see that the parameter update term in the Adagrad is of the form:

We replace the G term here with the above E[g²]ᴛ which is nothing but root mean squared error-criterion of the gradient.

For the learning to work much better, we set the above values for the parameters and divide the gradient by square root of the above term. Thus we get the value of γ to be 0.9 and thus (1-γ) becomes 0.1

Thus the paramter update term becomes:

Adam (Adaptive Moment Estimation):

How Adam differs from the above adaptive learning algorithms is that we try to use both RMSprop and Momentum gradient descent optimization algorithms where we try to store both (exponentially decaying average of the past squared gradients and also the exponentially decaying average of past gradients). We thus compute the decaying average of past squared gradients and past gradients:

where m is the decaying average of past gradients and v is the decaying average of past squared gradients.

Initially m and v are started with 0 vectors and as the decay rates are small enough in the beginning where (both the betas are close to 1), we see that both these are biased towards 0.

Thus we work on these biases by computing bias-corrected terms:

And finally we update the parameters similar to the previous optimization technique:

The value of β1 is 0.9, β2 is 0.999 and 10^(-8) for ϵ for good enough value for the learning rate according to the authors of Adam.

Let’s code the Adam Optimizer in Python. Let’s start with a function x³+3x²+4x. Taking the above values for all the constants and initiating θ=0, m_t=0,v_t=0.

import math
a=0.01
b_1=0.9
b_2=0.999
e=1e-8def function(x):
    return x*x*x-3*x*x+4*x
def gradiant(x):
    return 3*x*x-6*xt=0
theta=0
m_t=0
v_t=0
i=0iter=1000while(i<iter):    t=t+1    
    grad=gradiant(theta)
    m_t=b_1*m_t+(1-b_1)*grad
    v_t=b_1*v_t+(1-b_2)*grad*grad    
    m_cap_t=m_t/(1-b_1**t)
    v_cap_t=v_t/(1-b_2**t)    
    theta=theta-(a*m_cap_t)/(math.sqrt(v_cap_t)+e)
    i=i+1
print(theta)

Comparison of Adam to other optimization algorithms: