What is backpropagation in neural networks in a pure mathematical sense?
Backpropagation is used by computers to learn from their mistakes and get better at doing a specific thing. So using this computers can keep guessing and get better and better at guessing like humans do at one particular task.
Then these computers can club a lot of small tasks that they are very good at to make a system that can do bigger things like drive a car.
High School Student:
Artificial intelligence can be obtained by giving a computer a lot of data along with the correct solution provided by humans(called labelled data) and training a model like a Linear Classifier, Neural Network etc which can generalize to data it hasn’t seen before. Backpropagation is the technique used by computers to find out the error between a guess and the correct solution, provided the correct solution over this data.
Backpropagation has historically been used to traverse trees in a fast way and in data structures. We have the computer guess a value and use calculus to calculate the error via partial derivatives. Then we fix this error to get a better guess and backpropagate again to find a more fine tuned error. This method of repeatedly guessing shows the recursiveness of backpropagation. The training of an artificial intelligence model takes a long time and a lot of computation and it needs everything it can get to speed it up.
Some people also think that backpropagation is a bad methodology and should be replaced by solutions that involve integrating the partial derivatives to directly get the final error in one go instead of in steps that take a lot of computational time, or some other algorithms. Replacing backpropagation with something else is still an area of research.
The complete algorithm for this is known as Gradient Descent. The part of Gradient Descent(or similar algorithms) where you infer the error(usually with calculus) and correct it is known as Backpropagation.
So backpropagation in Computer science is the algorithmic way in which we send the result of some computation back to the parent recursively.
In Machine learning, backpropagation sends feedback to the neural net.
So any training step includes calculating the gradient(differentiation in calculus) and then doing backpropagation(integrating the gradient to get back the way the weights should change).
Simple Case study to understand the calculus: Apples vs Oranges
So let’s say in a simple example we are training a simple line classifier that draws `y = mx + c`. The goal of the classifier is to figure out the correct m and c values that does binary classification on apples and oranges given the radius of the fruit(x axis) and rgb value(y axis). So the classifier has to draw a straight line on the x-y plane to divide it into two parts, one for apples and one for oranges.
The gradient will be d/dx(mx+c) => m, which is the partial derivative of dy/dx, in this case the slope of the line.
Then we calculate the loss on some function we have to optimize, say RMS prop(root mean square of correct values vs predicted values) in which we apply the data to our predicted gradient values by substituting real points like (x1, y1), (x2, y2), … (xn, yn) to the equation.
loss =RMS prop(guess_m, guess_c)(data, correct_answers).
The reason for calling this loss is because it gives us the error between the guess and the correct answers. The loss is calculated by using calculus on the optimisation function of RMS prop.
This will give us a “delta” by which our current m and c values have to change.
Backpropagation step is when we calculate that “delta” and use it to update m and c values.
Now in more complex scenario say neural network
Every neuron/perceptron in the neural network consists of a weight which represents the data/bias it accumulates in training. This weight can be for example a 128 x 64 matrix of 0's and 1’s. The 128 in this example will be the number of nodes in the previous layer as input and the 64 will be the number of output nodes in the next layer.
In the Adam optimization function we have equations having logarithmic value to normalize scale. This means doing integration and differentiation on “log”, or equations with “e” which can be represented as an infinite series in mathematics.
The Adam optimization function has an RMS prop value and a momentum function which it gets from AdaGrad. For the sake of backpropagation, I assume you are familiar with Gradient Descent which will be explained better in a lot of other places.
The backpropagation step will send back the “delta” of the values given in the Wikipedia link.
m = momentum,
v = velocity
So weight(t+ 1) = weight(t) - delta.
In reality, the weights of every node in a layer of the neural net are calculated at the same time in parallel by a technique called vectorization which increases its performance via parallelisation on GPUs or TPUs or some microprocessor by using the matrix multiplication flag.
So a bit more on the
backpropagation(integrating the gradient to get back the way the weights should change)
Let’s say you train the model 100 times on the y=mx + c
So in iteration 1:
m1 = m0(random init) + delta(RMS loss equation), where the delta is obtained by partial differentiation of RMS loss equation for m
c1 = c0(random init) + delta(RMS loss equation), where the delta is obtained by partial differentiation of RMS loss equation for c
In iteration 2:
m2 = m1 + delta(RMS loss equation)
c2 = c1 + delta(RMS loss equation)
In iteration 1 to 100:
m100 = m0 + summation(delta_from 1 to 100(RMS prop equation))
which is the same as
m100 = m0 + integration(delta_from 1 to 100(RMS prop equation)), which works well for our simple apples vs oranges, except if the network is deep you would not be able to do it in one go but layer by layer.
So now that you understand this simple process which will also work on a single node of 128 x 64 weights, we have to discuss how to apply it to a network of such nodes in a neural network. For a quick refresher on partial derivatives and how the calculus part is computed check out Backpropagation in 5 minutes which you will now understand better.
If we are able to apply backpropagation to more complex neural networks like those described in Backpropagation through time, used in LSTM’s, RNN’s or GRU’s we can do text translation, generation of music etc.
So what we discussed until now was for training weights of one node which has output function f.
In a deep neural network(deep means many layers, the width of a layer is the number of nodes in it), you would have to forward propagate through every layer to get the predicted value, calculate the error and then backpropagate the error(update the weights by delta) across every layer.
So the delta will be calculated for every layer in reverse order as and the network will train as shown below. If a network has 5 layers, to calculate the backpropagated error for layer 1, we would have to forward propagate from layer 1 => layer 2 => … => layer 5 => output activation and then backpropagate from output activation => layer 5 => layer 4 … => layer 1 and then using this delta fix the weights of every node in layer 1.
or for Convolutional Nets you would have additional difficulties like backpropagating through a pooling layer
So in deep nets, after forward propagation, you would have to backpropagate the activation function(on the last layer) first like
y = x, if x>0
= 0, if x<0
or Softmax which is multiple sigmoids stacked together
or Sigmoid => y = 1/(1+pow(e, -x))
and then you would pass the “delta” as the loss function to the second last layer which will calculate the delta of delta and so on. The functions described above seem easy to differentiate no?
And an even more complex scenario would be having to backpropagate a deep net which is being trained batch-wise or backpropagating through time(like in RNN’s or LSTM’s or GRU’s).
These Recurrent Neural Networks or RNN’s have something known as a “skip connection” which allows them to propagate changes easily in very deep networks. Also Gated Recurrent Units or GRU’s have a “memory gate” which keeps track of some “context” in the data and a “forget gate” to mark where the context is irrelevant. LSTM’s also have an extra “output gate” to the GRU gates and perform much better on larger datasets.
These LSTM’s can be used to build networks that can predict ethnicity, gender, translate between languages, generate music and mix art styles. Similarly, LSTM’s too have a unique way to backpropagate their errors you can estimate by partially differentiating their equations for the various gates. Look at this paper on Backpropagation through time for more details.
The fundamentals of backprop are that there will be a random Weight matrix W for every node, which backprop will update as W = W-delta, where delta is the derivative of the output function of that node which represents the error to be corrected in a step in the node.
Also check out this video: Backpropagation in 5 minutes
For code references on getting a feel for implementing it yourself, check the medium articles at the bottom.
Postgraduate & PhD:
It is best to understand this concept in greater detail from research papers and videos by experts. I’ll provide some links for this.
Here is the paper for Adam.
Here is a video of Backpropagation by Yoshua Bengio who has supported backpropagation a lot over the decades.
Here is an article about why Geoffrey Hinton thinks we should abandon backpropagation and find a better method.
Tsvi Achler, M.D./PhD Computational Neuroscience & Neurology, University of Illinois at Urbana-Champaign (2006)
I don’t think Hinton goes far enough. Backprop is not the fundamental problem, it is network structures. Neural networks should not be limited to a feedforward configuration* . Backprop can only train feedforward networks and will remain one of the best solutions as long as networks are feedforward.
Also for a broader intuition of how you can get intelligence from randomness and repeated guesswork take a look at Monte Carlo methods and Markov chains.
If you do the course by Andrew Ng on Deep learning, he makes you implement backpropagation in code on various equations, my favorite loss function is
L(y’,y) = -(y*log(y’) + (1-y)*log(1-y’))
because it is used in a lot of machine learning algorithms as:
y = 1==>
L(y',1) = -log(y')==> we want
y'to be the largest ==>
y' biggest value is 1
y = 0==>
L(y',0) = -log(1-y')==> we want
1-y'to be the largest ==>
y'to be smaller as possible because it can only has 1 value.
and then we do differentiation on L(y’, y) to get the gradients.
Part of being a Postgrad or Doctor means being able to read research papers to gain insight. NIPS and ICLR are famous events in this space to learn more and will help you with a lot of material you can go through.
I haven’t included code examples for backpropagation in numpy because you will usually end up using a library like PyTorch or Tensorflow which implements it for you. Also since backpropagation is closely tied to the type of network, there is usually a different implementation per node type but it remains constant in the fact that you use calculus to get the errors for the mathematical function you have, so you have to read all the papers to understand each backpropagation implementation. The point of learning backpropagation for most people is to understand the mathematics behind it for research or what their libraries are doing under the hood. I hope this article gives you intuition on this concept before you checkout some other advanced articles that start off at the difficulty of implementing it directly like
or some articles that explain gradient descent intuition like https://medium.com/datathings/neural-networks-and-backpropagation-explained-in-a-simple-way-f540a3611f5e
That's all folks!