Demystifying Deep Learning Optimizers: Exploring Gradient Descent Algorithms (Part 2)

A Comprehensive Guide to Momentum Gradient Descent

Hemant Rattey
Nerd For Tech
6 min readAug 27, 2023

--

In the first part of this series, we explored the foundational optimization algorithm known as Gradient Descent and its variants — Stochastic Gradient Descent (SGD), Batch Gradient Descent, and Mini-Batch Gradient Descent. We discussed how these algorithms update model parameters to minimize a cost function by iteratively computing gradients and adjusting weights.

In this article, we’ll learn about another advanced optimization technique called Momentum-based Gradient Descent that enhances the convergence speed and stability of gradient descent.

Limitations of Standard Gradient Descent

Standard Gradient Descent, while a fundamental optimization technique, faces several challenges when applied to deep learning models due to the non-convex and complicated nature of the cost function:

Slow Convergence in Flat Regions

In regions of the optimization landscape where the cost function is relatively flat, Gradient Descent can suffer from slow convergence. The algorithm takes small steps along the gradient direction, leading to prolonged training times.

Oscillations and Noise

The updates in Gradient Descent are influenced solely by the gradient of the current iteration. This lack of inertia can cause oscillations, leading to erratic and noisy updates. These oscillations can hinder convergence and make it difficult to find the optimal solution.

Getting Stuck in Local Minima

Standard Gradient Descent is susceptible to getting stuck in shallow local minima, where the gradient approaches zero but the algorithm fails to escape and explore other regions of the optimization landscape.

Momentum-based Gradient Descent to the Rescue

Momentum-based Gradient Descent addresses these limitations by introducing the concept of momentum, which imparts inertia to the optimization process.

The concept underlying momentum is rooted in the principle that the influence of prior gradients guides the optimization process. When previous gradients consistently direct movement in a particular direction, momentum amplifies the pace of approaching minima along that trajectory. This augmentation instills a heightened level of confidence in the likelihood of encountering the minima within that designated direction. Conversely, if historical gradients exhibit discordance with the path toward minima, momentum moderates the advancement in that particular direction. This concept of inertia allows the algorithm to accumulate information from previous iterations and overcome the challenges associated with Standard Gradient Descent:

Overcoming Slow Convergence

Momentum enables the optimization process to “build up speed” in directions with consistent gradients. In flat regions, where the gradients are small, momentum helps the algorithm accelerate and traverse these regions more efficiently, resulting in faster convergence.

Reducing Oscillations and Noise

By incorporating information from previous iterations, momentum smooths out the update trajectory. This reduces oscillations and noise in the updates, leading to a more stable optimization path and improved convergence.

Escaping Local Minima

Momentum assists in escaping shallow local minima by allowing the algorithm to accumulate momentum and explore other regions of the optimization landscape. This ability to navigate through challenging regions enhances the algorithm’s chances of finding better solutions.

How to accumulate previous gradients?

Central to Momentum-based Gradient Descent’s prowess is its ability to accumulate previous gradients, instilling a sense of continuity in optimization. This seamless integration is orchestrated through the Exponential Weighted Moving Average (EWMA) technique.

What exactly is Exponential Weighted Moving Average?

Exponential Weighted Moving Average (EWMA) is a moving average where most recent data points have higher weight and significance and the significance of previous points keeps decaying exponentially. The following equation will help you understand better:

where νt​ is the EWMA at time t, β is the decaying factor and θt is the data point at time t. Intuitively, you can think of β as taking moving average over previous 1/(1−β) days i.e. if β = 0.9, the average is taken over last 10 days. If β = 0.98, the average is taken over last 50 days. The following example shows why the equation makes sense and how the weights of previous data points actually decay exponentially:

Here, you can see that previous data points are being multiplied higher powers of β, and since β is always between 0 and 1, higher powers of beta will lead to lower overall value for that term. Hence, the weight of the current data point remains higher as compared to weights of the previous data points. This is the main idea about how Moving Average decays exponentially in EWMA.

Intuition and Working of Momentum-Based Gradient Descent

In standard Gradient Descent, each parameter update is influenced solely by the gradient of the current iteration. However, momentum introduces a notion of inertia, allowing the optimization process to carry forward a portion of the previous update direction. This helps the algorithm “build up speed” in directions with consistent gradients, leading to faster convergence.

Implementation

The momentum-based update rule can be expressed as follows:

Where:

  • vt is the velocity vector at iteration t.
  • β is the momentum hyperparameter, typically set between 0 and 1.
  • J(θt​) is the gradient of the cost function with respect to the parameters at iteration t.
  • θt​ represents the parameters at iteration t.
  • α is the learning rate.

Intuition​

The equation above is quite simple to understand. The vt​ term represents the velocity of the gradients, indicating the previously accumulated gradients. This concept forms the crux that distinguishes momentum-based gradient descent from vanilla gradient descent.

In the first equation, the velocity term is computed by combining the previous step’s velocity with the current gradient’s scaled contribution. This encapsulates the direction of past momentum and the current gradient’s influence.

The second equation showcases the actual parameter update, where the new parameters (θt+1​) are computed by subtracting the scaled velocity from the current parameters (θt​). This mechanism imparts the accumulated momentum to the parameter updates, guiding the optimization process towards regions of faster convergence.

Limitations

While Momentum-based Gradient Descent brings considerable advantages to the optimization table, it’s essential to acknowledge its limitations and potential challenges:

  1. Risk of Overshooting: In some cases, especially when the cost function is rugged or the landscape has steep cliffs, Momentum-based Gradient Descent can overshoot the optimal point due to its accumulated momentum. This overshooting may lead to oscillations around the minima and may hinder convergence.
  2. Plateaus and Saddle Points: While Momentum-based Gradient Descent aids in escaping local minima, it might struggle when dealing with plateaus or saddle points, where gradients are nearly zero. The accumulated momentum might inadvertently perpetuate movement in these regions.
  3. Dependence on Learning Rate: The learning rate (α) interacts with the momentum parameter. Incorrectly chosen learning rates might amplify the drawbacks associated with high momentum values or hinder the benefits of lower momentum.

Next Steps

In this article, I went through a brief discussion on another optimization algorithm which is momentum based gradient descent. We discussed the the limitations of gradient descent in general and how we can use the concept of Momentum to overcome the limitations. Momentum based Gradient Descent itself has a major flaw that leads to slower convergence or suboptimal solution.

But our journey doesn’t end here. We’re moving forward to explore another optimization technique in the next part: Nesterov Accelerated Gradient (NAG). As we learn more, we’re getting better at optimizing, getting our deep learning models to perform even better and more efficiently. Stay tuned!

Please feel free to give me feedback regarding the same.

Come say Hi! to me on Twitter.

--

--

Hemant Rattey
Nerd For Tech

Data Scientist | Writing about Deep Learning and NLP | Portfolio: hemantrattey.github.io