Adam optimization algorithm in Deep Learning.

Jelal Sultanov
AI³ | Theory, Practice, Business
3 min readSep 1, 2019

Today I will be explaining another optimization algorithm called- ADAM. The deep learning practitioners have always been looking to optimize the model performance and loss function value through epochs of training the model. So the Adam algorithm is one of the most widespread techniques that work well in the Deep Learning field. There are some main reasons why it works well. Below I will try to explain some of them

Known as the Adaptive Moment Estimation Algorithm, but abbreviated Adam, this optimization algorithm was introduced in 2015 by two researchers — Diederik P. Kingma and Jimmy Lei Ba. This algorithm simply estimates moments and uses them to optimize a function. It is essentially a combination of the gradient descent with momentum algorithm and the RMS (Root Mean Square) Prop algorithm.

The Adam algorithm calculates an exponential weighted moving average of the gradient and then squares the calculated gradient. This algorithm has two decay parameters that control the decay rates of these calculated moving averages.

Below I will show the implementation of Adam optimization algorithm in Python, I will be explaining every moment of the script to give you a better understanding of this technique.

Description: This function takes in an initial or previous value for x, updates it based on the Adam optimization algorithm and outputs the most minimum value of x that reaches precision satisfaction.

Arguments:

x_new — a starting value of x that will get updated based on the learning rate

x_prev — the previous value of x that is getting updated to the new one

precision — a precision that determines the stop of the stepwise descent

l_r — the learning rate (size of each descent step)

beta1 — the first-moment parameter for the first part of the Adam optimizer — gradient descent with moment part

beta2 — the second-moment parameter for the second part of the Adam optimizer — RMS prop

epsilon — a value chosen to ensure there is no division by zero if the RMS prop output is very small

Output:

1. Prints out the latest new value of x which equates to the minimum we are looking for

2. Prints out the number of x values which equates to the number of Adam optimizer steps

3. Plots a first graph of the function with the Adam path

4. Plots a second graph of the function with a zoomed-in Adam path in the important area

All values are chosen randomly.

Conclusion:

This algorithm has helped machine learning practitioners to significantly optimize their models better than regular gradient descent or stochastic gradient descent. The many advantages of this algorithm, which are listed in this tutorial, will help you know when to apply it to your problems. With Adam, you can reduce your cost function and produce useful models that could change the world.

--

--