A Complete Guide to Adam and RMSprop Optimizer

Sanghvirajit
Analytics Vidhya
Published in
5 min readFeb 20, 2021

Starting from the algorithm to its implementation.

Optimization is a mathematical discipline that determines the “best” solution in a quantitatively well-defined sense. Mathematical optimization of the processes governed by partial differential equations has seen considerable progress in the past decade, and since then it has been applied to a wide variety of disciplines e.g., science, engineering, mathematics, economics, and even commerce. Optimization theory provides algorithms to solve well-structured optimization problems along with the analysis of those algorithms. A typical optimization problem includes an objective function that is to be minimized or maximized with the given constraints. Optimization theory provides algorithms to solve well-structured optimization problems along with the analysis of those algorithms. Optimization algorithms in machine learning (especially in neural networks) aim at minimizing an objective function (generally called loss or cost function), which is intuitively the difference between the predicted data and the expected values

Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space. Several optimization algorithms based on gradient descent exist in the literature, but just to name a few the classification of Gradient descent optimization algorithms goes as follows,

First-order optimization algorithm

first-order methods use the first derivatives of the function to minimize.

  1. Momentum
  2. Nesterov accelerated gradient
  3. Adagrad
  4. Adadelta
  5. RMSprop
  6. Adam
  7. Adamax
  8. Nadam
  9. AMSGrad

Second-order optimization algorithm

second-order methods make use of the estimation of the Hessian matrix (second derivative matrix of the loss function with respect to its parameters).

  1. Newton method
  2. Conjugate gradient
  3. Quasi-Newton method
  4. Levenberg-Marquardt algorithm.

In this article, we will go through the Adam and RMSprop starting from its algorithm to its implementation in python, and later we will compare its performance.

Let J(θ) be a function parameterized by a model’s parameters θ ∈ Rn, sufficiently differentiable of which one seeks a minimum. The gradient method builds a sequence that should in principle approach the minimum. For this, we start from any value x_0 (a random value for example) and we construct the recurrent sequence by:

where η is the learning rate.

Python code for Gradient Descent

In a normal stochastic gradient descent algorithm, we fixed the value of the learning rate for all the recurrent sequences hence, it results in slow convergence. For adaptive methods like Adam and RMSprop, the learning rate is variable for each parameter. This method is ensured to converge, even if the input sample is not linearly separable, to a minimum of the error function for a well-chosen learning rate.

Although ADAGRAD works well for sparse settings, its performance has been observed to deteriorate in settings where the loss functions are nonconvex and gradients are dense due to the rapid decay of the learning rate in these settings since it uses all the past gradients in the update. To tackle this issue, several variants of the ADAGRAD, such as RMSprop, ADAM, ADADELTA, etc have been proposed which mitigate the rapid decay of the learning rate using the exponential moving averages of squared past gradients, essentially limiting the reliance of the update to only the past few gradients.

RMSprop Optimizer

RMSprop is a gradient-based optimization technique used in training neural networks. It was proposed by the father of back-propagation, Geoffrey Hinton. Gradients of very complex functions like neural networks have a tendency to either vanish or explode as the data propagates through the function (refer to vanishing gradients problem). Rmsprop was developed as a stochastic technique for mini-batch learning.

RMSprop deals with the above issue by using a moving average of squared gradients to normalize the gradient. This normalization balances the step size (momentum), decreasing the step for large gradients to avoid exploding and increasing the step for small gradients to avoid vanishing.

Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as a hyperparameter. This means that the learning rate changes over time.

RMSprop’s update rule:

RMSprop optimizer’s update rule

Let’s code in Python

Python code for RMSprop

ADAM optimizer

Adam (Kingma & Ba, 2014) is a first-order-gradient-based algorithm of stochastic objective functions, based on adaptive estimates of lower-order moments. Adam is one of the latest state-of-the-art optimization algorithms being used by many practitioners of machine learning. The first moment normalized by the second moment gives the direction of the update.

Adam’s update rule:

Figure 1: ADAM algorithm [1]

Let’s code in Python

The python code of the ADAM optimizer will look as follow,

Python code for ADAM optimizer

Yeah, it’s that simple.

Results

I have coded the neural network from scratch and have implemented the above optimizers. This neural network was then trained for 100 epochs on the MNIST handwritten dataset which contains 60,000 training examples and 10,000 testing examples.

Figure 2: Results

Extension

Just for reference, I have also implemented the Adamax optimizer which is an extension to the Adam optimizer, as you can see from the results. If you are more interested in the implementation of Adamax, I recommend the readers to read the paper Diederik P. Kingma, Jimmy Lei Ba. ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION.

Conclusion

In this article, we have seen simple and computationally efficient algorithms for gradient-based optimization. We have seen how the RMSprop and ADAM optimizers are straightforward and easy to implement. The experiments confirm the analysis of the rate of convergence in convex problems.

If you are more interested to look at the detailed code of the neural network from scratch, you can find that on my Github.

https://github.com/sanghvirajit/Feedforward_Neural_Network

References

[1] Diederik P. Kingma, Jimmy Lei Ba. ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION.

[2] Tieleman, T. and Hinton, G. Lecture 6.5 — RMSProp, COURSERA: Neural Networks for Machine Learning. Technical report, 2012.

--

--