Adam Optimization Algorithm
Adam optimization algorithm is one of the unique algorithms that has really stood up and proven to be effective well across a wide variety of models in deep learning. Adam optimization algorithm essentially takes and ties together Momentum and RMSprop. Adams stands for Adaptive Moment Estimation.
How it works ?
- First, it calculates and stores an exponentially weighted average of past gradients in VdW & Vdb (before bias correction) and VdWcorrected & Vdbcorrected (with bias correction) variables.
- It then calculates an exponentially weighted average of past gradient squares and stores it in SdW & Sdb (before bias correction) and SdWcorrected & Sdbcorrected (with bias correction) variables.
- Finally, updates the parameters in a direction based on combining the “1” and “2” information
Implementation
- To implement Adam optimization algorithm, we need to initialize:
Vdw = 0, Sdw = 0, Vdb = 0, Sdb = 0
2. Then on iteration t:
Compute the derivatives dw, db using current mini-batch gradient descent
3. And do the momentum exponentially weighted average. So
VdW = ß1 * VdW + (1- ß1) * dW
Vdb = ß1 * Vdb + (1 — ß1) * db
4. And do the RMSprop update as well. So,
SdW = ß2 * SdW + (1- ß2) * dW2
Sdb = ß2 * Sdb + (1 — ß2) * db2
5.We need to implement bias correction in typical Adam’s implementation. So, we’ll have Vcorrected (where Vcorrected means after correction of the bias).
VdWcorrected = VdW / (1- ß1t)
Vdbcorrected = Vdb / (1- ß1t)
6. And then similarly, we implement this bias correction on S as well.
SdWcorrected = SdW / (1- ß2t)
Sdbcorrected = Sdb / (1 — ß2t)
7. finally, we need to perform the update.
W = W — learning rate * (VdWcorrected / sqrt(SdWcorrected+ ε))
b = b — learning rate * (Vdbcorrected / sqrt(Sdbcorrected+ ε))
where:
- psilon ‘ε’ is a very small number to avoid dividing by zero (epsilon = 10–8).
- ß1 and ß2 are hyperparameters which control the two weighted averages. In practice we use ß1 = 0.9 and ß2 = 0.999 as the default values.
- Alpha is the learning rate and a range of values to be tested to see what works best for different problems.
References:
- Deep Learning Specialization by Andrew Ng