Machine Learning

Do You Understand Gradient Descent and Backpropagation? Most Don’t.

A simple mathematical intuition behind one of the commonly used optimization algorithms in Machine Learning.

Michel Kana, Ph.D
Jun 29 · 5 min read

Binary classification is very common in machine learning. It is a good example of getting into the dark world of gradient descent and back-propagation.

Image for post
Image for post
Photo By MIKHAIL VASILYEV on Unsplash

Gradient descent in logistic regression

We recall that in a neural network for binary classification, the input goes through an affine transformation, and the result is fed into a sigmoid activation. The output is therefore a value between 0 or 1, the likelihood of predicting positive or negative class.

Below we see the expression of this classification output P(Y=1). Note that the affine transformation on the predictor X depends on two parameters β0 and β1.

Image for post
Image for post
The sigmoidal logistic function

During training, for any input Xi, the neural network can compute the likelihood Pi and compare it to the true value Yi. The error is typically calculated using binary cross-entropy. It does the same job as the mean squared error used for regression.

Binary cross-entropy measures how far away from the true value Yi (which is either 0 or 1) is the prediction Pi. The formula used to calculate that binary cross-entropy Li is given below.

Image for post
Image for post
The binary cross-entropy loss function.

As you can see, the loss depends on the weights β0 and β1. After the gradient descent optimizer has chosen random weights, it picks a random Xi, then does a forward pass to calculate the loss.

The optimizer repeats this calculation for all input data points if we are using the basic gradient descent, or just for small batches of data if we are using the stochastic version. The full loss is obtained by summing up all individual losses.

After the optimizer has calculated the total loss, it will compute its derivative for the weights. Based on the sign of the derivative, it will update the weight either positively or negatively.

Image for post
Image for post
The partial derivative of binary cross-entropy function.

We know that we want to go in the opposite direction of the derivative and we know we want to be making a step proportionally to the derivative. The learning rate λ is what controls that proportion for every weight W (β0 or β1).

Image for post
Image for post
The weights update during gradient descent.

Back-propagation in logistic regression

As you can see in the image above, updating the weights requires calculating the partial derivatives of the loss concerning each weight.

How do we calculate derivatives? Well, we can use high school math to do this.

In the image below we try to compute the derivation of the cross-entropy loss function for β0 and β1.

Image for post
Image for post
How to calculate partial derivatives by hand.

For a network with one neuron, this is a great solution. But imagine a network with hundreds of neurons as we usually encounter in deep learning. I bet you don’t want to calculate the resulting derivative.

Even if you succeed in doing so, you will have to update your formulas every time the architecture of the network changes, even just a little bit. Here is where backpropagation comes into play.

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn’t fully appreciated until a 1986 paper by Geoffrey Hinton.

Backpropagation uses the chain rule, which is convenient mnemonics for writing derivatives of nested functions.

For example, if we have a network with one neuron that feeds into a second and finally into a third to get an output. The total loss function f is a function of the loss function g of the first two neurons, similarly, g is a function of the loss function h of the first neuron.

Image for post
Image for post
The chain rule.

While we go forward from the inputs calculating the outputs of each neuron up to the last neuron, we also evaluate tiny components of the derivative already.

In the example above, we can calculate dh/dx already when going forward through the first neuron.

Next, we can calculate dg/dh when going forward through the second neuron.

Finally, we start calculating df/dg going backward through the neurons and by reusing all of the elements already calculated.

That’s the origin of the name backpropagation. There are several implementations and flavors of this technique. For the sake of simplicity, we keep it simple here.

To illustrate how the chain rule and backpropagation work, let’s return to the loss function for our 1-neuron network with sigmoid activation.

The loss function was defined as binary cross-entropy, which can be separated into two parts A and B, as shown below.

Image for post
Image for post
The binary cross-entropy loss function.

Let’s have a closer look at part A of the loss function. It can be divided into blocks highlighted with red boxes on the image below.

Image for post
Image for post
The first part of the binary cross-entropy loss function.

Backpropagation requires to compute the derivative of that function at any given data point X for any given weight W.

This is done by calculating the derivative of each block and putting it all together using the chain rule.

Below we see how this would work for X=3 and W=3.

Image for post
Image for post
Calculations made during backpropagation.

All we need is to be able to calculate derivatives of the small blocks (called variables above).

Such blocks are known because activation functions are usually known. They can be sigmoid, linear, ReLu, …etc.

These are differentiable functions with known derivatives. You can find the full list of trendy activation functions here.

Therefore the calculations above can be built up during run time using a computational graph.

High-level APIs such as Keras can look at your network architecture and the activation functions used at each neuron, to build a computational graph during model compilation.

That graph is used during training to perform forward pass and backpropagation.

An example of a computational graph for the cross-entropy loss function is presented below.

Image for post
Image for post
Example of a computational graph generated by Keras.

Conclusion

Understanding how gradient descent and backpropagation work is a great step in understanding why deep learning beautifully works. In this article, I showcased the learning process in a binary classification context.

Image for post
Image for post
Photo By MIKHAIL VASILYEV on Unsplash

Thanks for reading.

The Best of Tech, Science and Engineering.

By Towards AI — Multidisciplinary Science Journal

Towards AI publishes the best of tech, science, and engineering. Subscribe with us to receive our newsletter right on your inbox. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Michel Kana, Ph.D

Written by

Head of Data Science @Socialbakers. Top Medium Writer. Fellow of Harvard University. Certified Learner. I help curious minds become AI practitioner.

Towards AI — Multidisciplinary Science Journal

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Michel Kana, Ph.D

Written by

Head of Data Science @Socialbakers. Top Medium Writer. Fellow of Harvard University. Certified Learner. I help curious minds become AI practitioner.

Towards AI — Multidisciplinary Science Journal

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store