Learning Rates and the Convergence of Gradient Descent — Understanding Efficient BackProp Part 3

Vaishnavi Nandakumar
The Startup
Published in
11 min readJul 27, 2020

Introduction

In the previous article we dived into the fundamental theory behind standard backpropagation and also introduced the different aspects that are responsible for practically optimizing the process. If you need a small recap or haven’t read it yet, you can check it out here. The third part of this blog series intends to introduce yet another essential concept known as learning rates and theoretically demonstrate how different implementations affect the convergence of a network.

Overview
1. Introduction to Learning Rates.
2. Deriving the Optimal learning rate in 1D — Taylor Method
3. Deriving the optimal learning rate in 1D — Graphical Method
4. Deriving the optimal learning rate in multidimension
5. Introduction to Hessian Matrices
6. Eigenvectors and Eigenvalues
7. Revisiting input transformation

Please do note that all the images/diagrams in this article have been taken from the research paper and other sources as mentioned in the References below.

Introduction to Learning Rates

You must have come across this term a lot of times while referring different types of resources such as technical articles and other tutorials. Personally, I’ve always found that most of the time, this term is just briefly described and thrown into the mix without much explanation. However, as I happened to read more about it I realized that this concept was a lot more important than I thought it was.

So what do you mean by the ‘learning rate’ of a model?

The learning rate (also known as ‘step size’) is a positive hyper parameter that plays an important role in determining the amount by which a model adapts when the weights are updated. Hence, the value of the network’s learning rate needs to be carefully selected. If the value is too small, it will require more training epochs and the process will take more time whereas, if the value is too big, the network might converge very quickly and won’t be efficient.

This is where we introduce another term called optimal learning rate. Now this can be defined as the learning rate that can precisely move the weight to its minimum (most efficient) value in one step. The next few sections of this article will focus on how to derive the optimum learning rate in both one dimension and multiple dimensions.

But before we get on to the derivations, let’s see how different learning rates affect the efficiency of a model through the figures given below. (Here, η and ηopt represents the learning rate and optimal learning rate respectively)

Gradient Descent for different learning rates ( Fig 6(i) in Source Paper)

The figure above illustrates 4 different cases which diagrammatically represents the graphical outcome of the relationship between the value of the learning rate and the optimal learning rate. These four cases can be categorized into the following.

Case 1: η < ηopt
The step size is small and the convergence requires multiple time steps.
Case 2: η = ηopt
This is the ideal case where the point of convergence is reached in a single step.
Case 3: η > ηopt
For the value of this learning rate, the weight will oscillate around Wmin and eventually converge.
Case 4: η > 2 ηopt
This is the worst case scenario which results in divergence. That is, the value of the weight will end up farther from Wmin than before.

Now that we’ve seen how learning rates can influence the process of convergence, let’s see how we can derive these values in different dimensions.

Deriving the optimal learning rate in one dimensional space

Method 1 — Taylor Series

The Taylor series of a function is the infinite sum of terms that are expressed in terms of that function’s derivatives at a single point. It can be used to calculate the value of an entire function at every point, if the value of the function, and of all of its derivatives are known at a single point. The general formula for Taylor Series can be given by:

Source — https://en.wikipedia.org/wiki/Taylor_series

which can be generalized into,

Source — https://en.wikipedia.org/wiki/Taylor_series

Let’s apply this to the error function we have at hand. It’s best to get a pen and paper ready cause this might be a little hard to follow. Now in our case, the Taylor Series for the error function where the ‘single point’ is the current weight or Wc will be:

Eqn 21 in Source Paper

Differentiating both sides with respect to W, we get

Eqn 22 in Source Paper

NOTE

If E is a quadratic function, the second order derivative is a constant and the higher order terms disappear after this step.

If we substitute W = Wmin and dE(Wmin)/dW = 0 and reframe the equation, we are left with

Eqn 23 in Source Paper

Comparing this equation with updation formula,

Eqn 11 in Source Paper

We finally get,

Eqn 24 in Source Paper

Method 2 — Graphical Method

If the last method seemed too confusing, another way to calculate this value is via the graphical representation of the error function we used before.

Fig 6(ii) in Source Paper

The given graph plots the gradient of E as a function of W. The gradient is a straight line since E is quadratic with a value of zero at the minimum and d(E(Wc))/dW at the current weight. The slope of this line is given by

Eqn 25 in Source Paper

Solving this equation brings us back to the final few steps used in Method 1 after which we get the same value for the optimal learning rate.

NOTE

1. Although the learning rate that gives the fastest convergence is η, the largest value of the learning rate that can be used without causing divergence is 2 ηopt
2. If the error function is not quadratic and has higher order terms, the equation with Wmin will just be an approximation. In that case, multiple iterations are required to find the minimum value. (Convergence will still be fast)

Deriving the optimal learning rate in multidimensional space

In the cases where there are more than one weight involved, the error function can not be represented the way we derived it before. Before we get into this derivation, we need to be clear with mainly two topics — Hessian matrices and the geometric representation of eigenvalues and eigenvectors.

Source — http://mlexplained.com/2018/02/02/an-introduction-to-second-order-optimization-for-deep-learning-practitioners-basic-math-for-deep-learning-part-1/

Introduction to Hessian Matrices

In a multidimensional space, there is a slope for each direction. The rate of change of this slope in each direction gives the value of its second order derivative. In very simple terms, a Hessian matrix is one which consists of all the second partial derivatives of a function. Here, it can be used for a function with any number of variables.

Each row represents the change of the gradient in a certain direction whereas every column is the gradient of one element of the gradient.

Source — https://en.wikipedia.org/wiki/Hessian_matrix

The figure above gives us an example of a Hessian Matrix for a given function. This can be generalized as

Eqn 27 in Source Paper

These second order derivatives represent the curvature of a function. This is the basic concept behind Hessian matrices. Let us now learn how it can be applied to optimize a neural network. For this, we need to understand the geometric theory and concept behind it.

Eigenvectors and eigenvalues

The eigenvectors of a matrix are vectors that do not change direction when multiplied with it, and the eigenvalues represent the change in length of the eigenvector when multiplied with the matrix. This can be represented as,

where vi is an eigenvector and λi is an eigenvalue.

Now there are a few important points that are to be noted here. For Hessian matrices implementing an error function, each eigenvector represents a direction where the curvature is independent of the other directions. Whereas, the curvature in the direction of the eigenvector is determined by the eigenvalues.

Lines of the Error Function (Fig 7 in Source Paper)

This is the graph that is illustrated when we plot the contours of the loss function for a quadratic cost (two dimensions). In this example, contours are the set of points where the loss function is equal. As we can see, the contours are ellipses where the eigenvectors form their axes. Meanwhile, the eigenvalues represent its curvatures. I came across a really cool article on this concept which explained it in detail. A proper derivation of how Hessian matrices affect optimization along with more theory on its loss function, eigenvalues and eigenvectors can be found here.

NOTE

1. Larger the eigenvalue, the faster the convergence from the direction of its corresponding eigenvector.
2. Every eigenvalue of H is also a measure of the covariance or spread of the input along the corresponding eigen directions.

So in multidimensional spaces, the ideal case is to have different learning rates along different directions. If eigen directions are lined up with the coordinate axes of the weights, we could have assigned different learning rates to different weights based on the eigenvalues. However, if the weights are correlated a different approach is required. In this case, the Hessian matrix must be diagonalized such that the coordinate axes line up with the eigen directions.

Consider the rotation matrix

where Λ is the diagonal and ΘTΘ = I

Using a base equation of the cost function in multiple dimensions defined by

We can substitute the value to get

Eqn 35 in Source Paper

Introducing a change of coordinates to v = Θ(W- Wmin) thus brings the equation to

Eqn 36 in Source Paper

Comparing the equation to the weight updation formula, this equation is transformed into

Eqn 37 in Source Paper

I know this derivation is not easy to understand. I spent hours trying to figure it out before I finally got it. So in case you find yourself stuck somewhere, I’d recommend you to understand the core concepts behind the theory discussed in this article. For example, the article on Hessian matrices I had linked before was a great help. Do give it a read!

CONCLUSION

1. I -ηΛ is diagonal with diagonal components of 1-ηλi
2. Equation will converge when

3. In order to avoid divergence, if constrained to have a single scalar learning rate, we must require η < 2/λmax
4. For fastest convergence we have η = 1/λmax
5. Convergence time is proportional to λmax/λmin. That is, if λmin is a lot smaller than λmax, then convergence will be slower in the direction of λmin.
6. Since we have rotated H to be aligned with the coordinate axis,

consists of N different 1D equations. Therefore we can choose a learning rate for each weight independent of the others where the optimal rate for ith weight vi is given by η = 1/λi

DEMONSTRATION

In this example, we have 100 points represented in two Gaussian distribution classes as given in the graph below. The mean of the Gaussian denotes where it is centered. Thus, both of these classes are centered at (0.4, -0.8) and (0.4, 0.8). The eigenvalues of the covariance matrix are 0.84 and 0.036.

Now we take a simple linear network with two inputs, two weights, one output and one bias. We train this network using least mean square error in batch mode. From point 3 given above, in order to avoid divergence the learning rate should have a value that’s lesser than η < 2/λmax

Thus in this case, η < 2/λmax = 2/.84 = 2.38

The learning rates used in this example are 1.5 and 2.5. When implemented, their weight trajectories are represented in the graphs below.

Fig 10 in Source Paper

If batch mode is replaced by stochastic learning, the trajectory becomes a lot more noisier. This is because only an estimation of the gradient is used at each iteration. Another difference to note is that, in stochastic learning one epoch corresponds to 100 weight updates whereas in batch mode, one epoch corresponds to one weight update.

This paper also goes on to include a small section on multilayer networks. However, it is quite small and only the basic outline of the architecture is described. If you are interested in knowing more about it, you can check it out in the source paper linked in the References given below.

Revisiting input transformations

In the previous article we read all about different ways to practically optimize the process of backpropagation. The theory discussed in this article can be used to justify these techniques.

1. Subtracting mean from input variables
A non zero mean in the input variables creates a very large eigenvalue. This makes the convergence very slow as the cost surface becomes very steep in one direction and shallow in others. Thus, this can be avoided by subtracting the mean from the input variables.

2. Normalization

Fig 8 in Source Paper

Consider the graph given above. Inputs that have a large variation in spread along different directions will have slow learning. This is why the variance of input variables are normalized. Correlated input variables can reduce the eccentricity of the error surface and cause the eigenvectors of H to be rotated away from the coordinate axes. This can be represented as the comparison given below.

Fig 7 in Source Paper

3. Decorrelation
The graphical representation above explains why decorrelation is necessary. However, even after the input variables have been decorrelated, the gradient is not in its best descent direction. This problem is solved by assigning each weight its own learning rate which makes the descent direction point directly towards the minimum.

Summary

This article started off with an introduction to optimal learning rates and its derivation for different types of dimensions using several methods. We were also able to explain the basics behind Hessian matrices and how its eigenvalues and eigenvectors play a role in maintaining the efficiency of during the process of convergence. The last section of this series revisited some old topics explored in the previous articles and showed how they can affect the concepts introduced in this one. Our next article will be the last one in the series. It will show the various second order algorithms that can be used for optimization and will offer a general comparison between them.

Thank you for reading!

--

--