Adam Optimization Algorithm

Published in

Optimization Algorithms for Deep Neural Networks

2 min readMay 27, 2020

Adam optimization algorithm is one of the unique algorithms that has really stood up and proven to be effective well across a wide variety of models in deep learning. Adam optimization algorithm essentially takes and ties together Momentum and RMSprop. Adams stands for Adaptive Moment Estimation.

How it works ?

First, it calculates and stores an exponentially weighted average of past gradients in VdW & Vdb (before bias correction) and VdWcorrected & Vdbcorrected (with bias correction) variables.
It then calculates an exponentially weighted average of past gradient squares and stores it in SdW & Sdb (before bias correction) and SdWcorrected & Sdbcorrected (with bias correction) variables.
Finally, updates the parameters in a direction based on combining the “1” and “2” information

Implementation

To implement Adam optimization algorithm, we need to initialize:

Vdw = 0, Sdw = 0, Vdb = 0, Sdb = 0

2. Then on iteration t:

Compute the derivatives dw, db using current mini-batch gradient descent

3. And do the momentum exponentially weighted average. So

VdW = ß1 * VdW + (1- ß1) * dW

Vdb = ß1 * Vdb + (1 — ß1) * db

4. And do the RMSprop update as well. So,

SdW = ß2 * SdW + (1- ß2) * dW2

Sdb = ß2 * Sdb + (1 — ß2) * db2

5.We need to implement bias correction in typical Adam’s implementation. So, we’ll have Vcorrected (where Vcorrected means after correction of the bias).

VdWcorrected = VdW / (1- ß1t)

Vdbcorrected = Vdb / (1- ß1t)

6. And then similarly, we implement this bias correction on S as well.

SdWcorrected = SdW / (1- ß2t)

Sdbcorrected = Sdb / (1 — ß2t)

7. finally, we need to perform the update.

W = W — learning rate * (VdWcorrected / sqrt(SdWcorrected+ ε))

b = b — learning rate * (Vdbcorrected / sqrt(Sdbcorrected+ ε))

where:

psilon ‘ε’ is a very small number to avoid dividing by zero (epsilon = 10–8).
ß1 and ß2 are hyperparameters which control the two weighted averages. In practice we use ß1 = 0.9 and ß2 = 0.999 as the default values.
Alpha is the learning rate and a range of values to be tested to see what works best for different problems.

References:

Deep Learning Specialization by Andrew Ng

Adam Optimization Algorithm

Written by Bibek Shah Shankhar