Do You Understand Gradient Descent and Backpropagation? Most Don’t.
A simple mathematical intuition behind one of the commonly used optimization algorithms in Machine Learning.
Binary classification is very common in machine learning. It is a good example of getting into the dark world of gradient descent and back-propagation.
Gradient descent in logistic regression
We recall that in a neural network for binary classification, the input goes through an affine transformation, and the result is fed into a sigmoid activation. The output is therefore a value between 0 or 1, the likelihood of predicting positive or negative class.
Below we see the expression of this classification output P(Y=1). Note that the affine transformation on the predictor X depends on two parameters β0 and β1.
During training, for any input Xi, the neural network can compute the likelihood Pi and compare it to the true value Yi. The error is typically calculated using binary cross-entropy. It does the same job as the mean squared error used for regression.
Binary cross-entropy measures how far away from the true value Yi (which is either 0 or 1) is the prediction Pi. The formula used to calculate that binary cross-entropy Li is given below.
As you can see, the loss depends on the weights β0 and β1. After the gradient descent optimizer has chosen random weights, it picks a random Xi, then does a forward pass to calculate the loss.
The optimizer repeats this calculation for all input data points if we are using the basic gradient descent, or just for small batches of data if we are using the stochastic version. The full loss is obtained by summing up all individual losses.
After the optimizer has calculated the total loss, it will compute its derivative for the weights. Based on the sign of the derivative, it will update the weight either positively or negatively.
We know that we want to go in the opposite direction of the derivative and we know we want to be making a step proportionally to the derivative. The learning rate λ is what controls that proportion for every weight W (β0 or β1).
Back-propagation in logistic regression
As you can see in the image above, updating the weights requires calculating the partial derivatives of the loss concerning each weight.
How do we calculate derivatives? Well, we can use high school math to do this.
In the image below we try to compute the derivation of the cross-entropy loss function for β0 and β1.
For a network with one neuron, this is a great solution. But imagine a network with hundreds of neurons as we usually encounter in deep learning. I bet you don’t want to calculate the resulting derivative.
Even if you succeed in doing so, you will have to update your formulas every time the architecture of the network changes, even just a little bit. Here is where backpropagation comes into play.
The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn’t fully appreciated until a 1986 paper by Geoffrey Hinton.
Backpropagation uses the chain rule, which is convenient mnemonics for writing derivatives of nested functions.
For example, if we have a network with one neuron that feeds into a second and finally into a third to get an output. The total loss function f is a function of the loss function g of the first two neurons, similarly, g is a function of the loss function h of the first neuron.
While we go forward from the inputs calculating the outputs of each neuron up to the last neuron, we also evaluate tiny components of the derivative already.
In the example above, we can calculate dh/dx already when going forward through the first neuron.
Next, we can calculate dg/dh when going forward through the second neuron.
Finally, we start calculating df/dg going backward through the neurons and by reusing all of the elements already calculated.
That’s the origin of the name backpropagation. There are several implementations and flavors of this technique. For the sake of simplicity, we keep it simple here.
To illustrate how the chain rule and backpropagation work, let’s return to the loss function for our 1-neuron network with sigmoid activation.
The loss function was defined as binary cross-entropy, which can be separated into two parts A and B, as shown below.
Let’s have a closer look at part A of the loss function. It can be divided into blocks highlighted with red boxes on the image below.
Backpropagation requires to compute the derivative of that function at any given data point X for any given weight W.
This is done by calculating the derivative of each block and putting it all together using the chain rule.
Below we see how this would work for X=3 and W=3.
All we need is to be able to calculate derivatives of the small blocks (called variables above).
Such blocks are known because activation functions are usually known. They can be sigmoid, linear, ReLu, …etc.
These are differentiable functions with known derivatives. You can find the full list of trendy activation functions here.
Therefore the calculations above can be built up during run time using a computational graph.
High-level APIs such as Keras can look at your network architecture and the activation functions used at each neuron, to build a computational graph during model compilation.
That graph is used during training to perform forward pass and backpropagation.
An example of a computational graph for the cross-entropy loss function is presented below.
Understanding how gradient descent and backpropagation work is a great step in understanding why deep learning beautifully works. In this article, I showcased the learning process in a binary classification context.
Thanks for reading.