Understanding Gradient Descent and Adam Optimization

An intuitive view of Adam and why it is more commonly used in Deep Learning

Published in

Towards Data Science

7 min readJun 11, 2020

Source: https://pixabay.com/illustrations/artificial-intelligence-brain-think-3382507/

How artificial intelligence has influenced our daily lives in the past decade is something we can only ponder about. From spam filtering to news clustering, computer vision applications like fingerprint sensors to natural language processing problems like handwriting and speech recognition, it is very easy to undermine how big a role AI and data science is playing in our day-to-day lives. However, with an exponential increase in amount of data our algorithms deal with, it is essential to develop algorithms which can keep pace with this rise in complexity. One such algorithm which has caused a notable change to the industry is the Adam Optimization procedure. But before we delve into it, first let us look at gradient descent and where it falls short.

Source: https://www.researchgate.net/figure/Sphere-function-D-2_fig8_275069197

In case if you aren’t aware of what a cost function is, I would recommend you to go through this blog first, which serves as a great introduction to the topic: https://medium.com/@lachlanmiller_52885/understanding-and-calculating-the-cost-function-for-linear-regression-39b8a3519fcb

Gradient Descent

Suppose we have a convex cost function of 2 input variables as shown above and our goal is to minimize its value and find the value of the parameters (x,y) for which f(x,y) is minimum. What the gradient descent algorithm does is, we start at a specific point on the curve and use the negative gradient to find the direction of steepest descent and take a small step in that direction and keep iterating till our value starts converging.

https://datascience-enthusiast.com/DL/Optimization_methods.html

I personally find the above analogy to gradient descent very cool, a person starting from the top of a hill and climbing down by the path which enables him to decrease his altitude quickest.

Source: https://www.quora.com/What-turns-gradient-descent-into-gradient-ascent-in-deep-learning

The formal definition of gradient descent is given alongside, we keep performing the update as required till convergence is reached. We can check convergence easily by checking whether the difference between f(Xi+1) and f(Xi) is less than some number, say 0.0001(the default value if you implement gradient descent using Python). If so, we say that gradient descent has converged at a local minimum of f.

If you cannot quite grasp the gradient concept or are interested in more in-depth knowledge of cost function and gradient descent, I strongly recommend the following video from my favorite YouTube channel 3Blue1Brown -

Where Gradient Descent Falls Short

To perform a single step of gradient descent, we need to iterate over all training examples to find out the gradient at a particular point. This is termed as batch gradient descent and was done for many years but with the advent of the era of deep learning and big data, it has become common to have a training set size of the order of millions and this becomes computationally expensive, it may take few minutes to perform a single step of gradient descent. So what is done commonly, is something called a mini-batch gradient descent where we divide the training set into batches of small size and perform gradient descent using those batches individually. This often results in a faster convergence but there’s a major problem here — We only look at a fraction of the training set while taking a single step and hence, the step may not be towards the steepest decrease of the cost function. This is because we are minimizing the cost based on a subset of the total data, which is not a representative of what’s best for the entire training data. Instead of following a straight path towards the minimum, our algorithm now follows a roundabout path, not always even leading to an optimum and most commonly, overshooting (going past the minimum).

Source: https://datascience.stackexchange.com/questions/52884/possible-for-batch-size-of-neural-network-to-be-too-small

Source: https://engmrk.com/mini-batch-gd/

The following figures alongside show the steps of gradient descent in the 3 different batch size cases, and changes in how the cost function minimizes. In both the figures, it is apparent that the cost function is minimizing, but it oscillates even though in general, it is decreasing. The problem is as follows, Can we somehow “smoothen” out these steps of gradient descent so that it can follow a less noisy path and converge faster? The answer, as you might already have guessed, is Adam Optimization.

Adam Optimization Algorithm

Source: https://medium.com/@nishantnikhil/adam-optimizer-notes-ddac4fd7218

There’s a lot going on here. Let’s quickly break it down. First, let’s see the parameters involved.

α — Learning Rate for gradient descent step.
β1 — Parameter for momentum step (also known as first moment in Adam). Generally 0.9
β2 — Parameter for RMSProp step (also known as second moment in Adam). Generally 0.99
ϵ — Parameter for numerical stability. Generally 10^-8
m , v — First and second moment estimates, respectively. Initial values of both set to 0.
t — The timestep parameter for bias correction steps.
g and f — Gradient and function values at θ.

Adam can essentially be broken down as a combination of 2 main algorithms— Momentum and RMSProp. The momentum step is as follows -

m = beta1 * m + (1 - beta1) * g

Suppose beta1=0.9. Then the corresponding step calculates 0.9*current moment + 0.1*current gradient. You can think of this as a weighted average over the last 10 gradient descent steps, which cancels out a lot of noise. However initially, moment is set to 0 hence the moment at the first step = 0.9*0 + 0.1*gradient = gradient/10 and so on. The moment will fail to keep up with the original gradient ,and this is known as a biased estimate. To correct this we do the following, known as bias correction ,dividing by 1 - (beta1 raised to the timestep) -

m_corrected = m / (1 - np.power(beta1, t))

Note that 1 - power(beta1,t) approaches 1 as t becomes higher with each step, decreasing the correction effect later and maximizing it at the first few steps.

Source: https://medium.com/datadriveninvestor/exponentially-weighted-average-for-deep-neural-networks-39873b8230e9

The graph alongside pictures this perfectly, the yellow line refers to the moment(estimate) obtained with a smaller beta1, say 0.5 while the green line refers to a beta1 value closer to 1, say 0.9

RMSProp does a similar thing, but slightly different -

v = beta2 * v + (1 - beta2) * np.square(g)
v_corrected = v / (1 - np.power(beta2, t))

It also computes a weighted average over the last 1/(1-beta2) examples approximately, which is 100 when beta2=0.99. But it computes the average of the squares of the gradient (a sort of scaled magnitude), and then the same bias correction step.

Now, in the gradient descent step instead of using the gradient we use these moments as follows -

theta = theta - learning_rate * m_corrected / np.sqrt(v_corrected) + epsilon)

Using m_corrected ensures that our gradient moves in the direction of the general trend and does not oscillate about too much while dividing by the square root of the mean of squared magnitudes ensures that the overall magnitude of the steps is fixed and close to unit value. This also adds in adaptive gradient, which I am not going to talk about in detail, it’s just a procedure of changing the magnitude of the steps as we approach convergence. This helps prevent overshooting. Finally, epsilon is added to the denominator to avoid division by 0 in case the estimate of the gradients encountered are too small and are rounded off to 0 by the compiler. The value is deliberately chosen to be very small so as not to affect the algorithm, generally of the order of 10^-8.

Effect on Performance

Adam has been in widespread use in Deep Learning models since 2015. It was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto in their 2015 ICLR paper “Adam: A method for stochastic gradient optimization”. Adam, as it may sound, has not been named after someone. It is short for “Adaptive Moment Estimation”. The following figure shows it’s effectiveness compared to other minimizing algorithms when applied to a neural network model on the MNIST dataset.

Source: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/

Adam has been one of the most remarkable achievements in the grounds of optimization. Several incidents where the training of a large model required days have been reduced to hours since usage of Adam. Since it’s inception it has been made the default optimizer used in almost all deep learning libraries. I myself use Adam frequently — on a handwritten digit classification problem, I found that just by changing my optimizer from mini-batch gradient descent to Adam my training accuracy jumped from 79% to 94%, and number of iterations required reduced to about one-third, a pretty significant change considering that my training data was of size about 10,000, not even close to a million, where the effects would be even more significant!