A (Very Short) Visual Introduction to Learning Rate Schedulers (With Code)

Théo Martin
6 min readJul 9, 2023

--

Learning rate is one of the most important hyperparameters in the training of neural networks, impacting the speed and effectiveness of the learning process. A learning rate that is too high can cause the model to oscillate around the minimum, while a learning rate that is too low can cause the training process to be very slow or even stall. This article provides a visual introduction to learning rate schedulers, which are techniques used to adapt the learning rate during training.

What is a Learning Rate?

In the context of machine learning, the learning rate is a hyperparameter that determines the step size at which an optimization algorithm (like gradient descent) proceeds while attempting to minimize the loss function.

Now, let’s move on to learning rate schedulers.

What is a Learning Rate Scheduler?

A learning rate scheduler is a method that adjusts the learning rate during the training process, often lowering it as the training progresses. This helps the model to make large updates at the beginning of training when the parameters are far from their optimal values, and smaller updates later when the parameters are closer to their optimal values, allowing for more fine-tuning.

Several learning rate schedulers are widely used in practice. In this article, we will focus on three popular ones:

  1. Step Decay
  2. Exponential Decay
  3. Cosine Annealing

Let’s dive into each of these schedulers with visual examples.

1. Step Decay

Step decay reduces the learning rate by a constant factor every few epochs. The form of the step decay is defined as:

where:

  • lr_0​ is the initial learning rate,
  • d is the decay rate,
  • s is the step size, and
  • epoch is the index of the epoch.

Let’s visualize this with a toy example.

# Parameters
initial_lr = 1.0
decay_factor = 0.5
step_size = 10
max_epochs = 100

# Generate learning rate schedule
lr = [
initial_lr * (decay_factor ** np.floor((1+epoch)/step_size))
for epoch in range(max_epochs)
]

# Plot
plt.figure(figsize=(10, 7))
plt.plot(lr)
plt.title('Step Decay Learning Rate Scheduler')
plt.ylabel('Learning Rate')
plt.xlabel('Epoch')
plt.grid()
plt.show()

Now the plot more clearly demonstrates the nature of the step decay scheduler, with the learning rate dropping by a factor of 0.5 every 5 epochs.

2. Exponential Decay

Let’s modify the parameters for the exponential decay scheduler to make the decay more visible. We’ll use a larger initial learning rate and a larger decay rate.

  • lr_0​ is the initial learning rate,
  • k is the decay rate, and
  • epoch is the index of the epoch.
# Parameters
initial_lr = 1.0
decay_rate = 0.05
max_epochs = 100

# Generate learning rate schedule
lr = [
initial_lr * np.exp(-decay_rate * epoch)
for epoch in range(max_epochs)
]

# Plot
plt.figure(figsize=(10, 7))
plt.plot(lr)
plt.title('Exponential Decay Learning Rate Scheduler')
plt.ylabel('Learning Rate')
plt.xlabel('Epoch')
plt.grid()
plt.show()

This plot more clearly shows the exponential decay in the learning rate as the number of epochs increases.

3. Cosine Annealing

Cosine annealing reduces the learning rate using a cosine-based schedule. The form of the cosine annealing is defined as:

where:

  • lr_min​ is the minimum learning rate,
  • lr_max​ is the maximum learning rate, and
  • epoch and max_epochs are the current and maximum number of epochs respectively.
# Parameters
lr_min = 0.001
lr_max = 0.1
max_epochs = 100

# Generate learning rate schedule
lr = [
lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(epoch / max_epochs * np.pi))
for epoch in range(max_epochs)
]

# Plot
plt.figure(figsize=(10, 7))
plt.plot(lr)
plt.title("Cosine Annealing Learning Rate Scheduler")
plt.ylabel("Learning Rate")
plt.xlabel("Epoch")
plt.show()

As observed in the plot, the learning rate decreases following a cosine function, starting from the maximum learning rate and going down to the minimum learning rate. This is characteristic of the cosine annealing learning rate scheduler.

Conclusion

Learning rate schedulers are an important tool in the machine learning practitioner’s toolkit, providing a mechanism to adjust the learning rate over time, which can help to improve the efficiency and effectiveness of the training process. The best learning rate scheduler to use can depend on the specific problem and dataset, and it is often helpful to experiment with different schedulers to see which one works best.

Bonus

Some more learning rates in one plot.

import numpy as np
import matplotlib.pyplot as plt


def polynomial_decay_schedule(initial_lr: float, power: float, max_epochs: int = 100) -> np.ndarray:
"""
Generate a polynomial decay learning rate schedule.

Args:
initial_lr: The initial learning rate.
power: The power of the polynomial.
max_epochs: The maximum number of epochs.

Returns:
An array of learning rates for each epoch.
"""
epochs = np.arange(max_epochs)
lr = initial_lr * ((1 - (epochs / max_epochs)) ** power)
return lr


def natural_exp_decay_schedule(initial_lr: float, decay_rate: float, max_epochs: int = 100) -> np.ndarray:
"""
Generate a natural exponential decay learning rate schedule.

Args:
initial_lr: The initial learning rate.
decay_rate: The decay rate.
max_epochs: The maximum number of epochs.

Returns:
An array of learning rates for each epoch.
"""
epochs = np.arange(max_epochs)
lr = initial_lr * np.exp(-decay_rate * epochs)
return lr


def staircase_exp_decay_schedule(initial_lr: float, decay_rate: float, step_size: int, max_epochs: int = 100) -> np.ndarray:
"""
Generate a staircase exponential decay learning rate schedule.

Args:
initial_lr: The initial learning rate.
decay_rate: The decay rate.
step_size: The step size.
max_epochs: The maximum number of epochs.

Returns:
An array of learning rates for each epoch.
"""
epochs = np.arange(max_epochs)
lr = initial_lr * np.exp(-decay_rate * np.floor((1 + epochs) / step_size))
return lr


def step_decay_schedule(initial_lr: float, decay_factor: float, step_size: int, max_epochs: int = 100) -> np.ndarray:
"""
Generate a step decay learning rate schedule.

Args:
initial_lr: The initial learning rate.
decay_factor: The decay factor.
step_size: The step size.
max_epochs: The maximum number of epochs.

Returns:
An array of learning rates for each epoch.
"""
epochs = np.arange(max_epochs)
lr = initial_lr * (decay_factor ** np.floor((1 + epochs) / step_size))
return lr


def cosine_annealing_schedule(lr_min: float, lr_max: float, max_epochs: int = 100) -> np.ndarray:
"""
Generate a cosine annealing learning rate schedule.

Args:
lr_min: The minimum learning rate.
lr_max: The maximum learning rate.
max_epochs: The maximum number of epochs.

Returns:
An array of learning rates for each epoch.
"""
epochs = np.arange(max_epochs)
lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(epochs / max_epochs * np.pi))
return lr


def exponential_decay_schedule(initial_lr: float, decay_rate: float, max_epochs: int = 100) -> np.ndarray:
"""
Generate an exponential decay learning rate schedule.

Args:
initial_lr: The initial learning rate.
decay_rate: The decay rate.
max_epochs: The maximum number of epochs.

Returns:
An array of learning rates for each epoch.
"""
epochs = np.arange(max_epochs)
lr = initial_lr * np.exp(-decay_rate * epochs)
return lr


# Define the learning rate schedules
schedules = {
"Step Decay": step_decay_schedule(initial_lr=1.0, decay_factor=0.5, step_size=10),
"Exponential Decay": exponential_decay_schedule(initial_lr=1.0, decay_rate=0.05),
"Cosine Annealing": cosine_annealing_schedule(lr_min=0.01, lr_max=1.0),
"Polynomial Decay": polynomial_decay_schedule(initial_lr=1.0, power=2),
"Natural Exp. Decay": natural_exp_decay_schedule(initial_lr=1.0, decay_rate=0.05),
"Staircase Exp. Decay": staircase_exp_decay_schedule(initial_lr=1.0, decay_rate=0.05, step_size=10),
}

# Define a color palette
colors = ['b', 'g', 'r', 'c', 'm', 'y']

# Plot with defined colors
plt.figure(figsize=(15, 10))
for color, (schedule_name, schedule) in zip(colors, schedules.items()):
plt.plot(schedule, label=schedule_name, color=color)

plt.title('Learning Rate Schedules', fontsize=20)
plt.ylabel('Learning Rate', fontsize=15)
plt.xlabel('Epoch', fontsize=15)
plt.grid(True, which='both', linestyle='--', linewidth=0.6)
plt.minorticks_on()
plt.legend(prop={'size': 12})
plt.show()

--

--

Théo Martin

Senior Machine Learning Engineer @ unifai.fr, I like writing rather short and straight to the point articles.