Backpropagation (1/2) — Working Backwards To Push Innovation Forward — Intuitive Approach

Rohan Jagtap
CodeX
Published in
6 min readApr 1, 2021

Introduction

This article focuses on a new concept called Backpropagation as it is in this process that an Artificial Neural Network trains itself to become what is commonly referred to as “Intelligent”.

Backpropagation is the process of calculating gradients that eventually leads our model to update the weights to the optimal values corresponding to the lowest cost function output through numerous iterations. Since backpropagation has a bad reputation for being an intimidating topic to cover, I’ll be going over the intuition behind backpropagation first and then the necessary calculus that is required in a future article.

Background

In the article where I introduced neural networks, I covered a concept called Forward Propagation. The concept explains how input values make their way from the input layer, through the hidden layers, and finally out of the output layer of the Artificial Neural Network.

Remember that in forward propagation, the value of each node is the output of an activation function that takes the sum of all the inputs multiplied by their respective weights on each connection plus the respective bias as input. This process continues for each node in each layer until the output layer is reached. When the output layer is reached, the model outputs the node with the highest probability, representing that for a set of values inputted, the node which is outputted is what class the model thinks that inputted values should correspond to.

The orange connections illustrate all the connections for the last node in the first hidden layer and the green connections display a possible path for the value of the first node in the input layer.

Once we obtain the output from a given set of input values, we can then calculate the loss. There are numerous loss functions for different situations, but, in essence, a loss function basically provides a quantitative value as to how “off” the model was from the correct answer. For simplicity’s sake, this article will assume that the loss function is the difference between the value outputted by the model and the actual value from the dataset.

Once we obtain the loss values, we can then utilize gradient descent which is a process that is aimed at minimizing the cost function — updating the weights of the model to reduce the error of the outputted value. Gradient Descent is done by taking the derivative of the loss function with respect to the weight parameters. This is where backpropagation comes into action. Backpropagation is the tool that Gradient Descent uses to calculate each gradient of the loss function.

How Does It Work ?— Intuitive Explanation

In our example, let’s say that the top node in the output layer is the correct output for our model based on a set of input values. The model will then be able to recognize that the value of the top node should increase, while the values of the rest of the nodes in the output layer should decrease.

To change the values of the output layer according to the arrows in the image, we need to first recall how these values are calculated. These values in the output layer are calculated by multiplying the outputs of the second hidden layer by the weights of each connection to each of the corresponding nodes, adding the respective bias values, and passing this sum through an activation function. To change the output values based on how they are calculated, we could do one of three things:

  • Change the bias values
  • Change the values in the activation function
  • Change the weights of each connection

Since we cannot directly change any of the options involving values that we are given, we need to change the weights of each connection which will influence change in the values outputted through the activation function.

The connections highlighted in purple are the weights that need to be changed.

Now, remember that just like the output layer, the values of the second hidden layer are calculated similarly. The second hidden layer values are calculated by multiplying the values outputted by the first hidden layer by the corresponding connections to the second layer, adding the bias value, and passing this summed value through an activation function. Meaning that we will have to change the weights on the connections of the last layer to change the weights on the connections currently highlighted in purple.

We are changing the values of the weights in the previous layers because the value of weights in the proceeding layers is based on the weights in the previous layers. Changing the weights in the previous layer will influence the values of the weights in future layers.

This process of changing the connections of the previous layer to change the connections of the proceeding layers is what puts the “back” in “backpropagation”. We continue this process of backpropagation until the connections proceeding the input layer are reached.

Once this process of backpropagation is completed, the values of weights in the model should be nudged in the right direction. Specifically, the values of weights corresponding to the correct node will be increased and the values of the weights corresponding to the incorrect nodes will be decreased.

Another interesting note is that in addition to nudging the values of the weights in the right direction, backpropagation is also working to efficiently update the values such that the intervals of updates will minimize the cost function value most efficiently. This means that the proportion in which some weights are updated relative to other weights may be higher or lower — depending on the effect that the update will have on the rest of the neural network model to lower the cost function.

The actual values that we end up getting for each weight are actually the derivative of the cost function with respect to each of the individual weights. Although we did go through this example for a single node (the top node in the output layer), this exact process would be the same for all the other nodes. The only difference would be the values involved in each of the different calculations.

For each of the different nodes, since nodes in the previous layer are connected to all of the nodes in the next layer, conflicting nudges may arise. For example, one nudge may tell a particular weight to increase while another nudge for the same weight may tell a particular weight to decrease. When the backpropagation process is done for all of the weights, an average is taken, and it is this average nudge that is applied to each weight.

These average nudges in weights are the gradients of the loss function with respect to each weight. These gradients are then repeated until the minimized cost function value is reached. It is at this point where we call a neural network to be “trained”.

That was quite a lot of intuition and theoretical knowledge to be thrown at someone. In the next article, I will cover the nitty-gritty calculus which will perform all of the computations mentioned in this article.

More about me — my name is Rohan, I am a 16-year-old high school student learning about disruptive technologies and I’ve chosen to start with A.I. To reach me, contact me through my email or my LinkedIn. I’d be more than glad to provide any insight or learn the insights you may have. Additionally, I would appreciate it if you could join my monthly newsletter. Until the next article 👋!

--

--

Rohan Jagtap
CodeX
Writer for

Hey everyone! My name is Rohan, a Third Year student at the University of Waterloo learning about Artificial Intelligence.