Why to Optimize with Momentum

Published in

Analytics Vidhya

3 min readOct 6, 2019

In this post, we’ll explain what Momentum is and why Momentum is a simple and easy improvement upon Stochastic Gradient Descent. We also show a minimal code example on the MNIST dataset where adding Momentum improves the accuracy and training loss of the model.

1. Exponential Smoothing (or Exponentially Weighted Averages)

When looking at noisy time series data such as your training/validation error graphs in tensorboard, you may often notice that the raw values are often quite noisy. Quite often, you are able to see a trend in the graph. These trends often become more obvious when you add some smoothing to the raw graph values.

Exponential Smoothing is one of the simplest ways to add smoothing to your data.

Exponential Smoothing Formula:

Figure 1: Exponential Smoothing

In the above equation, momentum specifies the amount of smoothing we want. A typical value for momentum is .9. From this equation, we can see that the value at time step t takes into account previous values from time steps (t — 1) and before. The weighting of previous time steps drops off exponentially so the most recent time step has the greatest impact.

2. Momentum Optimizer

The momentum optimizer uses this same equation when optimizing a loss function. In this case, the Exponential Smoothing is called Momentum and it allows the optimizer to maintain the velocity and trajectory from previous time steps to affect the present.

from __future__ import absolute_import, division, print_function, unicode_literals
# Install Tensorflow
import tensorflow as tfmnist = tf.keras.datasets.mnist(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

Example Without Momentum

From the graphs above, we can see that the training loss reaches .4 after the first epoch. The accuracy reaches ~94% after 5 epochs. This result provides a baseline result with which we can compare the effects of adding Momentum to Stochastic Gradient Descent.

import matplotlib.pyplot as pltmodel = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])sgd = tf.keras.optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.0, nesterov=True)
model.compile(optimizer=sgd,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])history = model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

Figure 2: Training Loss curve without Momentum

Example With Momentum

After a single epoch, the optimizer was able to reach a loss of around .17. The final accuracy after 5 epochs was ~97%. From this experiment, we can see that the momentum optimizer is faster and better at optimizing the model in this problem. This experiment is a quick way to see that momentum is an easy and quick way to improve upon standard Stochastic Gradient Descent for optimizing Neural Network models.

import matplotlib.pyplot as pltmodel = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])sgd = tf.keras.optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])history = model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

Figure 3: Training Loss Curve with Momentum

Conclusion

In this post, we explain what Momentum is and why it’s a simple improvement upon Stochastic Gradient Descent. We also give a minimal code example showing how Momentum can improve both accuracy and training loss in an a Deep Learning model.

Resources

Tensorflow Quickstart for Beginners