Learning and Generalization — Understanding Efficient BackProp Part 1

Published in

Analytics Vidhya

6 min readJul 12, 2020

Introduction

Ever since I started exploring various fundamental concepts in the vast field of machine learning, I started coming across some incredible research papers that simply fascinated me to dive deeper into them. ‘Efficient BackProp’ by Yann Lecun, Leon Bottou, Genevieve Orr, and Klaus-Robert Muller is one such paper that has motivated me enough to write more about it in an attempt to understand it better.

Considering that I’m relatively quite new to this field, I’ve written this blog in the process of understanding the paper myself. So please do let me know if there are any errors that I could have overlooked. Since, this research paper is quite long, this blog series will thus be divided across different topics to make it more structured and simpler to understand.

Topics explored in ‘Efficient BackProp’ by Yann LeCun

Learning and Generalization
Standard Backpropagation
Convergence of Gradient Descent
Classical second order optimization methods
Analysis of Hessian in multilayer networks
Applying second order methods to multilayer networks

Gradient based learning

‘Employing learning algorithms for predictive models based on neural networks is basically an optimization problem to obtain the parameters that results in a state of minimum error energy. ‘

Gradient based learning is the first topic that is introduced in this paper. It starts off by stating how most of the successful methods taken while approaching automatic machine learning involve a gradient based learning method. But what do you mean by that? A very generalized way to explain this concept is through the diagram given below.

This is a first order optimization model. The term ‘first order’ here refers to an algorithm which requires at least one first derivative/gradient. From the figure given above, the input patterns are given as Z0, Z1, Z2 and so on where W refers to the weights of the model which represents its adjustable parameters. As the model is trained, a cost function (usually the mean squared error function) is applied to calculate the difference between the predicted and desired outputs. Thus, the average of all errors are calculated over the training set and the weights are adjusted. The process occurs repeatedly to minimize the cost function.

If this was confusing to follow, let’s illustrate this process with a small example.

So I made a very small data set with ten values each containing one input (x) and an output (y) value. Now, we know that the formula used here is:

But all that our model is going to be presented with is the values of x and y. It has no prior knowledge of the linear relationship that exists between the two. The only way that this system can learn on its own is by continuously updating its parameters until it accurately satisfies the given dateset.

The values of x correspond to the input values of Zn as given in the figure. The weights, W are randomly initialized and fed to the linear regression model used in this example.

Output values for the initialized weights before training

Next, we run this model for 100 epochs and list out the updated parameters. The readjusted value of the weights comes out to be the following.

As explained before, this adjustment was because of the cost function which calculated the error difference between the values and updated the weights. Now this entire process of updation is called backpropagation but that’ll be explained in more detail later.

Now from the output it’s easy to identify that the first coefficient is almost near the actual one. But the second value is way off. So we repeat the process again.

The updated weight values after 200, 400 and 600 epochs respectively

Although the values aren’t exactly the same it is obvious that the model has now begun to understand the relationship between the two variables and can thus progressively produce more accurate results with a minimized loss.

Now this is a small example demonstrating a simple linear relationship. What happens when the size of the dataset increases and different sets of samples are provided? Or how can we optimize a model that requires a classification algorithm instead of regression? In such cases, merely reducing the cost function is not enough. This is where we introduce a concept called generalization.

Generalization

Generalization is the ability of a model to predict the correct target values for an entirely new set dataset which it has not been trained on. This technique largely tries to optimize learning by minimizing or correcting the errors introduced in the training set. This process is also known as Empirical Risk Minimization. The basic principle behind empirical risk minimization is to convert the problem of training a model into an optimization problem. It generally focuses on two parts, the loss function and a regularizer which is used to penalize certain values of the parameters involved. A hyper parameter is also used to maintain the balance between the two. This is a very brief explanation but you can read more about it here. The article linked to ERM gives a concise explanation about it along with the demonstration of loss functions, regularizers and some special cases as well.

The analysis of most minimization processes are done by decomposing the generalization error into two terms — bias and variance. Bias’s are learnable parameters in the network just like weights. It gives the total measure of difference between the average of all the network outputs across the different samples of datasets and the desired output. Meanwhile, variance refers to deviation from the average of predictions. Minimal error can only be achieved only when both of these values are at their least.

From the graph shown above we can understand that the initial phases of training sees a large bias and a low variance. However, this scenario soon gets reversed as the training progresses because the network now begins to learns better. But does this mean the longer a model is left to train, the more efficient it gets? No. An important point that has to be noted, is that if the network is trained for too long, it will also understand the noise (meaningless information) and adapt to the specific features within the training dataset. This is a problem known as ‘overfitting’. Early stopping and regularization are some of the techniques used to prevent it which will be covered in detail later. This point proves that even if excellent cost minimization strategies are implemented, improper model selection which negatively impacts generalization causes the whole network to become inefficient.

Summary

This article was based on mainly two topics — learning and generalization in machine learning. We were able to go through some fundamental concepts in gradient based learning with a small example to illustrate how it works. The importance of generalization in terms of accuracy and optimization was also discussed. A brief explanation on empirical risk minimization and the factors on which generalization error is analyzed was also explained. I hope this article was able help you understand or give you an idea at the least about the topics I’ve written about. The next article will be focusing on standard backpropagation as well as some practical tricks that include initialing weights, shuffling examples, normalizing input values and so on.

Thank you for reading!

References

Efficient BackProp — Yann Lecun, Leon Bottou, Genevieve Orr, and Klaus-Robert Muller
Comparing gradient based learning methods for optimizing predictive neural networks
Empirical Risk Minimization
Cost Functions and Gradient Descent
Generalization in Machine Learning
Generalization, Regularization, Overfitting, Bias and Variance in Machine Learning