Mastering Gradient Descent: Optimizing Neural Networks with Precision.

8 min readMar 10, 2024

Part 6: Momentum for accelerating Gradient Descent.

Jittery Optimization Path: Especially in Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent, the path to the minimum can be quite noisy. This is because these algorithms use a subset of the data (a single instance for SGD or a mini-batch for Mini-Batch Gradient Descent) to compute the gradient at each step. This can cause the algorithm to bounce around the minimum, which can slow down convergence and make the algorithm settle for a suboptimal solution. In mini-batch mode, where the optimization algorithm updates parameters based on subsets of the training data, the optimization path can be erratic. This jitteriness occurs due to the noise introduced by the mini-batches, resulting in fluctuating updates that hinder smooth convergence. Consider the following function, standard gradient descent does an excellent job of finding the minimum:

Now, we will add some noise. Notice that it doesn’t roll immediately into the basin, but lingers along the ridge somewhat.

Adding more noise causes it to converge fairly rapidly to the saddle point between the two minima.

This is where Momentum comes in. Momentum is a method that helps accelerate gradient descent in the relevant direction and dampens oscillations. It does this by adding a fraction of the update vector of the past time step to the current update vector. In the Momentum-based Gradient Optimizer, a fraction of the previous update is added to the current update, which creates a momentum effect that helps the algorithm to move faster towards the minimum. Think of it as a ball rolling down a hill. It starts slow, but it builds up speed (momentum) as it goes along. This is the idea behind momentum-based gradient descent.

Imagine you’re rolling a ball down a hill to reach the lowest point (the minimum of the cost function). Gradient descent without momentum would be like rolling the ball downhill, adjusting the direction and speed only based on the slope of the hill at the current point. Momentum-based gradient descent adds a concept of momentum to this process, simulating the inertia of the ball as it rolls down the hill. Momentum-based gradient descent can be visualized as a ball rolling down a hill. Without momentum, the ball follows the steepest path downwards (i.e., the negative gradient). This can lead to erratic movements, especially in areas where the surface is uneven. With momentum, however, the ball doesn’t just follow the gradient. It also considers its previous velocity (i.e., the direction it was moving in the previous step). This helps it to smooth out its path and avoid getting stuck in small bumps or ditches. In terms of the learning process, this means that momentum can help the algorithm to navigate through plateaus and avoid local minima. It does this by allowing the algorithm to ‘roll over’ small bumps and continue moving towards the global minimum.

The momentum term can be viewed as a moving average of the gradients. But the major problem with this method, that it considers all the gradients over iterations with equal weightage. The gradient at any iteration has equal weightage as the current one. So to fix this, we need to use some weighted average of the past gradients to give more weightage to the recent gradients. We can resolve this issue using an Exponential Moving Average(EMA). An exponential moving average is a moving average that assigns a greater weight on the most recent values. The EMA for a series Y may be calculated recursively:

Where,

Y(t) is the value at a period t.
S(t) is the value of the EMA at any period t.
β is hyperparameter. It is the smoothing factor, typically between 0 and 1, determining the weight given to recent observations.

Note — The momentum coefficient accumulates gradients over time, effectively smoothing out fluctuations in the gradient values. The momentum term β determines how much of the previous updates are retained. If β is close to 1, more of the past gradients are retained, making the moving average longer. If β is smaller, less of the past gradients are retained, making the moving average shorter.

The update rule for the Momentum-based Gradient Optimizer, in the case of a sequence of gradients, can be expressed as follows:

𝓥(t): is the new weight update done at iteration t or value of EMA at time t.

𝛿(t): is the gradient at iteration t

Assume the weight update at the zeroth iteration t=0 is zero.

The generalized summation includes all previous gradients that have been built up through all iterations. It is clear from the above equation that the past values of gradient have less influence over the current weight update, as any positive number raised to the power of beta will be less than the current beta, so the weightage of past values of gradient keeps on decreasing.

If we expand this equation for (t = 3), we get:

From this, we can see that the contribution of each gradient decreases as we go back in time, with the rate of decrease determined by (β).

For (β = 0.1), the contributions are (0.1² = 0.01) or 1% for (𝛿(1)), (0.1*(1–0.1) = 0.09) or 9% for (𝛿(2)), and (1–0.1 = 0.9) or 90% for (𝛿(3)).

For (β = 0.9), the contributions are (0.9² = 0.81) or 81% for (𝛿(1)), (0.9 *(1–0.9) = 0.09) or 9% for (𝛿(2)), and (1–0.9 = 0.1) or 10% for (𝛿(3)).

What if β is 0.1? At n=3; the gradient at t =3 will contribute 100% of its value, the gradient at t=2 will contribute 10% of its value, and gradient at t=1 will only contribute 1% of its value.

The X-axis represents time. It shows the progression of time steps or iterations in the optimization process. The Y-axis represents the value of the gradient at each time step. This value is influenced by the (β) parameter in the Momentum-based Gradient Optimizer, which determines the rate of decay. Here’s what each line in the graph represents:

The blue line represents (β = 0.95). As you can see, this line decays very slowly, indicating that a high (β) value like 0.95 gives significant weight to past gradients even far back in time.
The green line represents (β = 0.90). This line decays at a moderate rate, showing that a (β) value of 0.90 still gives considerable weight to past gradients, but less so than 0.95.
The orange line represents (β = 0.60). This line decays quite quickly, indicating that a lower (β) value like 0.60 reduces the influence of past gradients more rapidly.
The red line represents (β = 0.00). This line drops straight down to zero, showing that a (β) value of 0 does not consider past gradients at all.

Let’s highlight two common challenges in gradient descent: the impact of learning rate on convergence speed and the issue of oscillations when the learning rate is too high. When the learning rate is too small, weight updates are tiny, leading to slow convergence. Even when the gradient is high, the small learning rate limits the magnitude of weight updates, prolonging the optimization process. Conversely, when the learning rate is too high, weight updates are too large, causing the optimization process to oscillate around the minimum of the cost function. These oscillations prevent the algorithm from converging to the minimum and may even diverge, leading to instability.

Now, let’s summarize how momentum helps mitigate these issues:

Case 1: When all the past gradients have the same sign: The momentum term becomes large due to the accumulation of gradients in the same direction. This results in larger steps during weight updates, accelerating the descent even when the learning rate is low. This helps address the issue of slow convergence when the learning rate is too small.
Case 2: When the gradients alternate between positive and negative: The momentum term becomes small due to the cancellation of positive and negative gradients. This results in smaller weight updates, damping the oscillations around the minima. This helps address the issue of oscillations when the learning rate is too high.

SGD without and with momentum. Reference

In the realm of optimization algorithms for training neural networks, one size doesn’t fit all. Traditional gradient descent methods often grapple with challenges like selecting the appropriate learning rate and handling oscillations, hindering their efficiency and convergence speed. To address these shortcomings, a family of adaptive gradient algorithms has emerged, offering tailored solutions to these optimization hurdles.

In Python, you can implement momentum in gradient descent using the following code snippet.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Build a simple linear regression model
model = Sequential()
model.add(Dense(units=1, input_dim=1))

# Compile the model with SGD optimizer and momentum
optimizer = SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=optimizer, loss='mean_squared_error')

# Train the model
model.fit(X, y, epochs=1000, verbose=0)

In this example, we use TensorFlow to create a simple linear regression model, compile it with the SGD optimizer and momentum, and then train the model on the generated data.

Closing note — As we wrap up Part 6, we’ve explored the concept of momentum and its role in optimization. Let this understanding propel you forward in your journey towards mastering gradient descent. Stay focused, stay committed. Part 7 awaits, offering further revelations. Until then, keep striving for excellence, keep pushing boundaries, and let’s continue our exploration of optimization techniques together.

Mastering Gradient Descent: Optimizing Neural Networks with Precision.

Written by om pramod