Backpropagation Explained

2 min readJan 31, 2017

When we are training a neural network we need to figure out how to alter a parameter to minimize the cost/loss. The first step is to find out what effect that parameter has on the loss. Then find the total loss up to that parameters point and perform the gradient descent update equation to that parameter.

X = X - η∇(x) (Gradient Descent Update Equation)

We have to know how much and by what means we need to change each parameter value so our network will converge.

We can compute a gradient, which tells us the effect that each parameter has on the cost

We utilize the relationship that two nodes have to a connected gate as an application to the chain rule from calculus to compute the effect that a specific node has on the total cost.

To start out lets review the chain rule:

Basic Derivative:

Chain Rule:

Chain Rule: compute change in f with respect to x by relating the change in f with respect to g and multiplying it by the change in g with respect to x. Pretend f and x and nodes that are both connected to g but not to each other.

X, W1, b1 are just neurons / nodes.

We seek to update node X in this example with the gradient descent update equation (see above).

the dLoss/dS1 is the accumulator of all the gradients up until that point. Remember we move backwards for backpropagation (red arrows).

To update X we must multiply the accumulated loss and the loss with respect to X node.

Well we know that L1 is a common node between X and S1. For this reason we can use the chain rule and relate our differentials with respect to L1 in order to solve for the loss at node X. Then we multiply the loss due to X (blue box) to the accumulated loss (red box) to receive the total loss at X. Then use our gradient update equation to update X for the next run.

Backpropagation Explained

Written by Jonathan Mitchell