Demystifying the Adam Optimizer in Machine Learning

5 min readJan 30, 2024

Source: https://www.geeksforgeeks.org/intuition-of-adam-optimizer/

This article is part of the series Demystifying Machine Learning.

Introduction

In the intricate world of machine learning, optimization algorithms are the engines that drive our models towards better accuracy and performance. They guide how a model adjusts its parameters during training to minimize errors. Among the array of optimizers available, Adam stands out as a true powerhouse, widely favored for its efficiency and adaptability. In this blog post, we’ll dive into the heart of the Adam optimizer, unraveling its secrets and shedding light on its uses.

What is Adam?

Adam, short for Adaptive Moment Estimation, is an optimization algorithm that builds upon the strengths of two other popular techniques: AdaGrad and RMSProp. Like its predecessors, Adam is an adaptive learning rate algorithm. This means it dynamically adjusts the learning rate for each individual parameter within a model, rather than using a single global learning rate.

Why Use Adam?

Here’s why Adam has become so prevalent in machine learning:

Speed and Efficiency: Adam leverages past gradient information to accelerate convergence. This often leads to faster training times compared to simpler optimizers like basic Stochastic Gradient Descent (SGD).
Handles Sparse Gradients: Datasets where features occur infrequently (sparse gradients) can challenge some optimizers. Adam is designed to be more robust in such scenarios.
Adaptive Learning Rates: The individual adjustments to learning rates make Adam suitable for problems with varying data or parameter landscapes.
Minimal Hyperparameter Tuning: Adam generally performs well with minimal tweaking of the default settings, simplifying the model development process.

How Adam Works

Let’s break down the mechanics behind Adam’s magic:

Momentum: Adam keeps track of an exponentially decaying average of past gradients (similar to momentum in SGD). This helps to smooth out the updates and navigate noisy gradients.
Adaptive Learning Rates: Adam also computes an exponentially decaying average of past squared gradients. This is used to scale the learning rate for each parameter, allowing for larger updates for infrequent features and smaller updates for frequent ones.
Bias Correction: In the initial iterations, the averages computed by Adam may be biased towards zero. Adam incorporates a bias correction step to counteract this early bias.

When to Use Adam And When Not to

Adam is an excellent starting point for most machine learning problems, particularly in deep learning. Here’s when it shines:

Deep Neural Networks: The speed and adaptability of Adam make it perfect for the complex landscapes of deep neural networks.
Computer Vision Tasks: Adam works well with the vast image datasets frequently used in computer vision.
Natural Language Processing: NLP tasks often benefit from the benefits Adam offers.

However, there are scenarios where other optimizers might be better suited:

Small Datasets: With smaller datasets, simpler optimizers like SGD might sometimes converge faster and generalize better.
Need for Precise Control: For problems requiring very fine-tuned control over the learning process, consider SGD with manual learning rate schedules.

Python Implementation

Here’s a basic Python implementation of Adam optimizer. The code is available in this colab notebook.

import matplotlib.pyplot as plt
import numpy as np

# Define a simple loss function
def loss(x):
  return (x - 2)**2

# Define Adam optimizer
def adam_update(param, grad, learning_rate, beta1, beta2, t):
  """
  Adam update algorithm is an optimization method used for training machine
  learning models, particularly neural networks.

  Intuition:
  Adam combines the benefits of two other popular optimization algorithms:
  AdaGrad and RMSProp.

  1. AdaGrad adapts the learning rate to parameters, performing larger updates
     for infrequent parameters and smaller updates for frequent ones. However,
     its continuously accumulating squared gradients can lead to an overly
     aggressive and monotonically decreasing learning rate.

  2. RMSProp modifies AdaGrad by using a moving average of squared gradients to
     adapt the learning rate, which resolves the radical diminishing learning
     rates of AdaGrad.

  Adam takes this a step further by:
  - Calculating an exponentially moving average of the gradients (m) to smooth
    out the gradient descent path, addressing the issue of noisy gradients.
  - Computing an exponentially moving average of the squared gradients (v),
    which scales the learning rate inversely proportional to the square root of
    the second moments of the gradients. This helps in adaptive learning rate
    adjustments.
  - Implementing bias corrections to the first (m_hat) and second (v_hat) moment
    estimates to account for their initialization at the origin, leading to more
    accurate updates at the beginning of the training.

  This results in an optimization algorithm that can handle sparse gradients on
  noisy problems, which is efficient for large datasets and high-dimensional
  parameter spaces.

  Note: The function requires initialization or previous values of m_prev,
  v_prev, and epsilon (a small number to prevent division by zero).
  """

  # Update biased first moment estimate.
  # m is the exponentially moving average of the gradients.
  # beta1 is the decay rate for the first moment.
  m = beta1 * m_prev + (1 - beta1) * grad

  # Update biased second raw moment estimate.
  # v is the exponentially moving average of the squared gradients.
  # beta2 is the decay rate for the second moment.
  v = beta2 * v_prev + (1 - beta2) * grad**2

  # Compute bias-corrected first moment estimate.
  # This corrects the bias in the first moment caused by initialization at origin.
  m_hat = m / (1 - beta1**(t + 1))

  # Compute bias-corrected second raw moment estimate.
  # This corrects the bias in the second moment caused by initialization at origin.
  v_hat = v / (1 - beta2**(t + 1))

  # Update parameters.
  # Parameters are adjusted based on the learning rate, corrected first moment,
  # and the square root of the corrected second moment.
  # epsilon is a small number to avoid division by zero.
  param -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)

  # Return the updated parameters, as well as the first and second moment estimates.
  return param, m, v


# Initialize parameters and optimizer state
param = np.random.randn()
m_prev = np.zeros_like(param)
v_prev = np.zeros_like(param)
learning_rate = 0.01
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

# Track parameter values and loss over time
param_history = [param]
loss_history = [loss(param)]

# Training loop
epochs = 500
for t in range(epochs):
  # Calculate gradient
  grad = 2 * (param - 2)

  # Update parameter and optimizer state
  param, m_prev, v_prev = adam_update(param, grad, learning_rate, beta1, beta2, t)

  # Track parameter and loss
  param_history.append(param)
  loss_history.append(loss(param))

# Plot results
plt.figure(figsize=(8, 6))

# Plot loss over time
plt.plot(range(len(loss_history)), loss_history, label="Loss")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.legend()

# Plot parameter trajectory
plt.figure(figsize=(8, 6))
plt.plot(range(len(param_history)), param_history, label="Parameter")
plt.axhline(2, color="red", linestyle="--", label="Minimum")
plt.xlabel("Iteration")
plt.ylabel("Parameter Value")
plt.legend()

plt.show()

Conclusion

Adam’s effectiveness, ease of use, and robustness have made it a mainstay in modern machine learning. By understanding its inner workings, you can make informed decisions about its application in your own projects. Remember, the world of optimization is constantly evolving, so keep an open mind and explore other algorithms as well!