Mastering Gradient Descent: Optimizing Neural Networks with Precision.

om pramod
9 min readMar 10, 2024

--

Part 3: The Role of Learning Rate in Optimization

The learning rate, also known as step size or alpha, is a crucial hyperparameter in the gradient descent algorithm. It defines the size of the steps that are taken to reach the minimum of the function. The size of these steps is called the learning rate (α) that gives us some additional control over how large of steps we make. The choice of learning rate can greatly affect the efficiency and effectiveness of the gradient descent algorithm.

Reference
Results for various learning rates. Reference

Small learning rate: If the learning rate is too small, the algorithm will converge very slowly. This is because with a small learning rate, we’re taking tiny steps in the direction of the negative gradient, and thus it takes a longer time to reach the minimum point. However, the advantage of a smaller learning rate is that it allows the algorithm to explore the loss function more thoroughly.

Large learning rate: On the other hand, if the learning rate is too large, the algorithm might overshoot the optimal point and diverge, resulting in a failure to converge, or poor performance of the model. This is because with a large learning rate, we’re taking big steps in the direction of the negative gradient and we might skip the minimum point.

The challenge is to find a good balance: a learning rate that is large enough to make rapid progress, yet small enough to avoid overshooting the minimum. This often involves tuning and it can depend on the specific dataset and problem you’re working on. Unfortunately, there is no magic formula to find the right learning rate. Most of the time, you have to fumble around and try several values before you find the right one. This is called hyperparameter tuning, and there are different strategies to do this properly.

Reference

Two of the most common techniques for learning rate schedule are,

  • Constant learning rate: as the name suggests, we initialize a learning rate and don’t change it during training;
  • Learning rate decay: we select an initial learning rate, then gradually reduce it in accordance with a scheduler.

In practice, it’s common to start with a larger learning rate, and then reduce it over time as you get closer to the minimum. This approach is called learning rate decay or learning rate scheduling. It is a technique used in optimization algorithms to dynamically adjust the learning rate during training.

  • Initial Phase: At the start of the training, we can afford to make larger updates to our parameters, as the initial parameters are usually far from the optimal ones. So, we start with a larger learning rate.
  • Middle Phase: As training progresses, our parameters get closer to the optimal ones. Now, large updates can lead to overshooting the minimum. So, we gradually reduce the learning rate. This is where learning rate decay comes into play.
  • Final Phase: Towards the end of the training, we want to converge to the minimum, so we continue to reduce the learning rate to make smaller and smaller updates.

There are several strategies to reduce the learning rate over time:

  1. Step Decay: Reduce the learning rate by some factor every few epochs. For example, we might halve the learning rate every 5 epochs.
Reference

2. Exponential Decay: The learning rate is decayed exponentially over time. It follows an exponential decay function.

Reference

3. Inverse Square Root Decay: The learning rate decreases as the inverse square root of the number of training iterations.

4. Adaptive Learning Rate: In this method, the learning rate is reduced based on a performance measure. If the error rate is not decreasing, it reduces the learning rate.

Note –

Starting with a small value: A smaller learning rate could get closer to the minimum but would require more iterations to converge. Common starting points are 0.1, 0.01, or 0.001, and it’s often useful to tune this as a hyperparameter.

Adapting the learning rate: If the cost function is reducing very slowly, it might be helpful to increase the learning rate to speed up convergence. Conversely, if the cost function is exploding or being erratic, it might be beneficial to decrease the learning rate to allow the algorithm to converge.

The manual tuning of the learning rate involves experimenting with different values and observing the behavior of the optimization process. However, this process can be time-consuming and may not be efficient, especially when dealing with large and complex models.

Several adaptive learning rate algorithms have been proposed to address the challenge of choosing a suitable learning rate automatically. They are designed to converge faster and more robustly. Here are explanations of a few popular adaptive learning rate algorithms:

Gradient Descent with Momentum — Momentum based gradient descent:

In standard Gradient Descent, the update at a given step depends solely on the learning rate and gradient at that step only and does not take into account any information from previous steps. This can sometimes lead to following problems:

Stagnation at Saddle Points or plateaus:

The term “saddle point” in the context of machine learning refers to a specific point in the optimization landscape of a cost function where the gradient is zero, but the point is neither a minimum nor a maximum. It’s a point where the surface of the cost function resembles a saddle, with some dimensions curving upward and others downward.

Reference

The red and green curves intersect at a generic saddle point in two dimensions. Along the green curve the saddle point looks like a local minimum, while it looks like a local maximum along the red curve.

When Gradient Descent encounters a saddle point, the gradients become very small. This causes the updates to the parameters to also become very small. As a result, the algorithm makes little progress and appears to be stuck, or to be proceeding extremely slowly. Gradient descent proceeds extremely slowly near a saddle point. You can see this quite clearly in the following picture:

Reference

Gradient descent proceeds slowly near a saddle point, so the points cluster more closely around the saddle points.

For example, let’s consider the function (f(x) = x³). Its first and second derivative vanish for x = 0. The derivative of (f(x) = x³) is (f’(x) = 3x²). Setting the derivative equal to zero, we find that (x = 0) is the only point where the derivative is zero. However, for (x < 0), (f’(x) < 0) and for (x > 0), (f’(x) > 0). This means that the function is decreasing for (x < 0) and increasing for (x > 0), so (x = 0) is not a local maximum or a local minimum. Therefore, (x = 0) is a saddle point for the function (f(x) = x³)

In the plot of (f(x) = x³), you can see that the function decreases for (x < 0) and increases for (x > 0), with a point of inflection at (x = 0). In the plot of (f’(x) = 3x²), you can see that the derivative is zero at (x = 0), confirming that this is a saddle point.

In the context of optimization, both saddle points and minima are characterized by a gradient (or derivative) of zero. This means that in both cases, there’s no slope, making it difficult for the algorithm to determine which direction to proceed. A key difference, however, is that at a minimum (either local or global), the function’s value is lower than at all neighboring points, while a saddle point is not a peak or valley — the function’s value can be both higher and lower in different directions. Distinguishing between saddle points and minima is challenging, and optimization algorithms may mistake a saddle point for a minimum.

A plateau is a flat region of the loss landscape where the gradients are very small. This is often the case when using activation functions like the sigmoid or hyperbolic tangent, which have flat regions in their output. When an optimization algorithm such as gradient descent encounters a plateau, it can cause problems because the gradients are very small. The gradient is used to update the parameters of the model, and if it’s close to zero, the updates will also be very small. This can significantly slow down the learning process, as the algorithm makes tiny steps without making much progress towards the minimum. In some cases, the learning process might even appear to stagnate, as the algorithm gets “stuck” on the plateau. For example, takes a real-valued input and squashes it to range between 0 and 1. When the input is very large or very small, the function saturates at these extremes, causing the gradient to be nearly zero. This leads to a plateau effect during backpropagation, where the weights and biases don’t get updated effectively.

Lets consider another example, tanh which takes a real-valued input and squashes it to the range between -1 and 1. Like the sigmoid function, tanh also has a flat gradient for large positive or negative inputs, which leads to a plateau during backpropagation.

In both plots, you can see that the derivative of the function becomes very small (nearly zero) for large positive or negative inputs. This is the flat region, or the plateau, which can slow down learning during the training of neural networks. This is one of the reasons why ReLU (Rectified Linear Unit) and its variants like Leaky ReLU and Parametric ReLU are commonly used. They help mitigate the problem of small gradients and the resulting slow learning on the plateaus of the loss landscape.

Note — However, it’s important to note that while ReLU and its variants can alleviate the problem of vanishing gradients, they are not without their own issues. For instance, ReLU units can sometimes become “dead,” i.e., they stop learning entirely, if a large gradient flows through them. This is known as the dying ReLU problem. (ReLU) is a popular activation function defined as f(x) = max(0,x). It’s simple and efficient, but it has a significant drawback known as the “Dying ReLU” problem. In simple terms, a ReLU “dies” when it starts outputting 0 for all inputs, and once a ReLU dies, it’s unlikely to come back to life. This happens because, during the training process, if a large negative input is passed through the ReLU, it outputs 0 (because max(0,x) is 0 for x<0).

During training, the weights of neural networks are updated using gradient descent-based optimization algorithms. However, if a neuron consistently outputs zero (i.e., its activation is always negative), the gradient of the loss function with respect to its weights becomes zero. As a result, the weights associated with such neurons stop updating, causing the neuron to remain inactive, or “die.” The derivative of ReLU is 0 for negative inputs, so during backpropagation, no gradient will flow back through the neuron, and the weights will not get updated. If this happens often, the neuron is likely to remain dead and always output 0.

Closing note — As we draw to a close in Part 3, you’ve delved into the crucial role of learning rate in optimization. Let this understanding empower your journey towards crafting efficient learning algorithms. Stay steadfast, stay eager. Part 4 is on the horizon, promising further enlightenment. Until then, continue to refine your skills, continue to seek knowledge, and let’s navigate the complexities of machine learning together!

--

--