Standard backpropagation and practical optimization - Understanding Efficient BackProp Part 2

Vaishnavi Nandakumar
12 min readJul 19, 2020

--

Introduction

In the previous article, we explored the concepts behind generalization and how a simple predictive learning algorithm works by demonstrating an example and visualizing its outputs. If you haven’t read it yet, you can find it here. The second part of this blog series, will focus on the theory behind backpropagation and will illustrate how it functions practically as well. This section also focuses on some tricks that can be used to improve the efficiency of the model and enable faster training.

Overview
1. Deriving standard backpropagation equations
2. Optimizing backpropagation
3. Stochastic vs Batch Learning
4. Shuffling Examples
5. Normalizing Inputs
6. The Sigmoid Function
7. Choosing target values
8. Initializing weights
9. Choosing learning rates
10. Radial basis functions vs Sigmoid units

In this article, I’ll be covering the topics mentioned in the overview above. Do note that sections three to ten will not be demonstrated in detail. The theoretical and practical implementation will be explained more in the upcoming posts.

Please do note that all the images/diagrams in this article have been taken from the research paper and other sources as mentioned in the References below.

Standard backpropagation

Backpropagation is a very commonly used term in machine learning whose function and necessity are fundamental in any application. In very simple terms, backpropagation is the method used during training that consistently adjusts the parameters involved to minimize the loss function and increase its accuracy. Practically, backpropagation can be implemented in a few lines under the training function. However, understanding this concept theoretically requires much more effort.

In the process of getting to know the underlying functions behind backpropagation, I came across an incredible article which really helped me understand the concept better. The author had also provided a GitHub repository which implemented a practical example demonstrating the theory involved. You can find both of them here.

So let’s start by deriving the equations required for backpropagation. Consider a multilayer feed forward neural network that implements the following function:

Here, Yn denotes the output value of the forward pass. This value is then passed on to an activation function. Now these functions vary according to each problem but the resultant output (Xn) is used as the input for the next layer.

Depending on the number of hidden layers, the forward pass is continuously applied until a final output value is obtained. The model then proceeds to compute the discrepancies between the predicted and expected output value using an error function. In most cases the mean squared error function is used denoted by the formula:

Source — https://www.dataquest.io/blog/understanding-regression-error-metrics/

Once the total cost function is obtained, the model now has enough information to perform the backward pass. In multilayer networks, manually computing gradients individually is very difficult. Thus, the chain rule is used. In fact, another way to look at backpropagation is to consider it as a continuous implementation of the chain rule.

According to the chain rule, given two functions f(x) and g(x) which are differentiable then:

1. If F(x) = (f∘g)(x)

2. If y = f(u) and u = g(x)

With that in mind, if we implement the chain rule on the original equations used for the forward pass we obtain the classical equations of backpropagation as follows.

Backpropagation works by calculating the first derivative of the error function with respect to each network weight. The derivation of the total cost error with respect to one weight determines how much that particular weight contributes to the total error. Hence, the weights that are more responsible for the overall error will generate higher derivative values and will be changed more when gradient descent is applied. The weights are iteratively adjusted according to the given formula.

The value of variable η depends on the procedure used. It can be a scalar that denotes the learning rate or can represent a diagonal matrix as well. This value is important to the model and how to choose the right value will be discussed later.

The practical example I had mentioned before, demonstrates each step of this process perfectly and I’d recommend you to read it. The neural network used in that example is given below:

Source — https://mattmazur.com

I implemented the python code provided and modified it a little to observe the change in the values of the weights. The updation as shown in the output below is the result of training the model for just one epoch as this comparison shows how much the value differs from the randomly initialized weights.

Optimizing Backpropagation

The small demonstration shown above is the case for just one example. But as the complexity of a model increases, there is a dire need to practically optimize it for an efficient training. We’ll now proceed to discuss some of the practical tricks that can make a huge difference if implemented properly.

1. Stochastic vs Batch Learning

Before we can get into which method is more efficient, it’s important to understand the difference between the two.

In the example that was shown before, the final equation that was used to determine the value of the adjusted weights required the completion of the forward pass across the whole dataset. Since an entire ‘batch’ of data needs to be processed before computing the gradient that determines the new weights, this process is known as batch learning.

Meanwhile in stochastic learning, the paper states that a single random example is chosen from the training set at iteration t where an estimate of the true gradient is calculated based on the error Ei of that example and the weights are updated according to the formula mentioned above.

At a glance, the advantages of both the techniques are given below. We’ll elaborate on them further.

Stochastic Learning

Faster than batch learning
Consider a large redundant training dataset with M copies of N samples. Thus, the total size of the dataset become M*N. Since it consists of redundant copies of the same sample set, calculating the average gradient over a dataset of M*N values will be the same as computing the gradient of the first N values. In this case, computing the batch gradient would take up more time as well as become computationally expensive. Hence, especially in the case of large redundant data, stochastic learning is largely preferred because of its ability to perform faster.

Better solutions
The estimated gradient introduced in stochastic learning was initially thought of as a disadvantage as it reduced the precision of the weight updation because of the noise present in it. However, in the case of non linear networks with multiple local minimums, this works to an advantage. The noise present in these updates makes it easier for the weights in stochastic learning to detect a much deeper local minimum value which prevents the possibility of a sub optimal trained model. If you’re unfamiliar with the concept of local and global minimum, you can read more about it here.

Source — https://mlfromscratch.com/content/images/2019/12/gradient-descent-optimized.gif

Tracking changes
Practical and industrial models where the data distribution changes over time requires efficient generalization. Stochastic learning consistently tracks and yields good approximation results while this generally goes undetected in the case of batch learning.

Batch Learning

Conditions of convergence are understood
Ironically enough, the noise updates that are advantageous for stochastic learning prevents the full convergence to the minimum value due to the fluctuations in the weights. These fluctuations are dependent on the value of mew and the problem of minimum convergence can be solved by either decreasing the learning rate or having an adaptive batch size.

Convergence rates are simpler/ Can implement many acceleration techniques
The fluctuations which stalls convergence also depend on the degree of noise. A method that is used in batch learning to prevent this is by using ‘mini batches’ which progressively increases the batch size as training increases. A drawback of this method is the risk of overtraining. Perhaps, another advantage batch learning has is the ability to increase the speed learning process by using second order methods. This process increases the speed by estimating the cost surface which exponentially helps in finding the approximate location of the actual minimum.

2. Shuffling examples

The paper states that the best way to ensure that the networks learn faster is by shuffling the training set so that the successive examples do not belong in the same class. This randomization of training values and providing input samples that initially produces a large error helps the network learn efficiently. The disadvantage of providing certain input values more frequently is the risk of the network prioritizing undesirable features which decreases the overall competence.

3. Normalizing inputs

Normalization is a preprocessing technique that scales down the input values to improve the convergence properties of a network. This is because convergence tends to become faster as the average of each input variable over the training set becomes closer to zero. Consider a dataset which consists of many attributes all scaled over different ranges. This not only causes problems but also makes it difficult for a network to train which eventually decreases its training speed.

The input transformation thus involves scaling the input values such that its average approaches zero and the covariance almost remains the same. Performing input transformation helps in balancing the rates at which the weights connected to the input nodes learn. Another method which is effective but comparatively difficult to implement is decorrelating inputs. The term correlated inputs basically mean values which are dependent on each other. Observe the figure given below.

Linearly Dependent Inputs (Fig 2 in Source Paper)

This is an example of a very simple network with inputs z1 and z2 which are uncorrelated to each other. In this case, it’s a lot easier to find the value of w1 with the least cost function irregardless of what happens to w2 and vice versa. This is only possible because the given values are independent of each other. In a network containing correlated inputs, both the values must be calculated simultaneously which makes the problem harder. The linear dependency between the values negatively affects the network output causing the gradient to become zero.

A technique used to remove linear correlation is called principal component analysis or Karhunen Loeve expansion. The input transformation for such a scenario is as given in the diagram below.

Transformation of Inputs (Fig 3 in Source Paper)

4. The Sigmoid Function

The sigmoid function is a common activation function that gives neural networks its nonlinear capabilities. Common examples of this function are the standard logistic function and the hyperbolic tangent as given below. The logistic function is generally not recommended because the output values are always positive. This has a negative effect on the network because the data won’t be normalized as the average will never be close to zero. Therefore, sigmoids that are symmetric about the origin are more preferred as they have the ability to converge faster as well. A disadvantage of this method is that the error surface can get very flat near the origin. In such cases it is helpful to add a small linear term (also known as a twisting term) in order to prevent the saturation of sigmoids which leads to flat spots.

(a) Standard Logistic Function (b) Hyperbolic Tangent (Fig 4 in Source Paper)

5. Choosing target values

This section of the paper deals with the drawbacks of setting the target values for classification problems at the value of the sigmoid’s asymptotes. In this case, the weights are driven to larger values where the sigmoid derivative is close to zero. When the gradients are multiplied with this value, the weight updates to a value close to zero and becomes stuck.

The second drawback involves the lack of confidence score when the outputs start to saturate. This becomes a problem in the cases when the output class is uncertain and the model gives a wrong prediction. The paper states that the solution to these problems lies in setting the target value to be within the range of the sigmoid rather than at the asymptotic values.

6. Initializing weights

In the example that was demonstrated in Part1 we mentioned that the weights were randomly initialized. However, this can have a negative effect as the starting values of these weights have a significant impact on the training process. The paper states that the weights should be chosen randomly such that the sigmoid is activated in its linear region. This ensures that the intermediate weights will have gradients that are large enough for the learning process and the network learns faster.

For the process of weight initialization, assuming that the training set has been normalized and the appropriate sigmoid function (like the one given in Fig 4b) has been used, the weights should be randomly drawn from a distribution with mean zero and a standard deviation of

where m denotes the number of inputs connected to the node.

7. Choosing learning rates

This section of the paper deals with finding the ideal learning rate η. The problem with some of the methods introduced for this purpose is that the fluctuations of the weights involved does not make it appropriate to be used for stochastic gradient methods. The paper proceeds to state that choosing a different learning rate for different weights improves the convergence. The value of the learning rate is dependent on the curvature of the error surface. Some weights might require a small learning rate to prevent divergence whereas others require a larger learning rate to converge at a reasonable speed. Thus learning rates in the lower layers are larger than the higher layers. In models where shared weights are used, η should be proportional to the square root of the number of inputs. The paper then proceeds to provide practical tricks to improve convergence. This includes the concept of momentum and adaptive learning rates which I found a bit hard to understand. However, I’m planning to write a separate article based on it as there’s more to explain.

8. Radial basis vs Sigmoid units

The final segment of the practical tricks chapter introduces us to the concept of radial basis functions or RBF. This is an alternative where the dot product between the weights and input values (WiEi) are replaced with the euclidean distance between them and the sigmoid function is replaced by an exponential function. The output is computed by the given formula:

A more detailed explanation for this function is given here.

The difference between the sigmoid function and RBF is that a single RBF unit covers only a small local region as compared to the sigmoid which can cover the entire space. The advantage of this method is that the learning becomes faster. However, in high dimensional spaces, this locality property becomes a drawback. Hence, for a given network the RBF is more appropriate for low dimensional upper layers whereas the sigmoid is recommended for high dimensional low layers.

Summary

In this article we started off with the basics of standard backpropagation and demonstrated a small example to show how it works. The second section of this article focused more on how to optimize this process. It covered mainly 8 different techniques ranging from the type of learning to the activation function used. In the next article, we’ll take a deeper look into convergence of gradient descent. It will include more theory, derivations and will also provide an introduction to Hessian matrices.

Thank you for reading!

--

--