Glossary of Deep Learning: Backpropagation

Jaron Collis
Deeper Learning
Published in
5 min readApr 12, 2017

The mathematical foundations of neural networks are differentiable graphs — chains of computations whose derivative can be calculated. It is this property that allows the weights between nodes to be optimised through a process called Backpropagation, and what allows a network to learn from its mistakes.

Backpropagation is so named because the weights are updated backwards, from output towards input. This involves the following steps:

  1. First, pass the training set batch through our neural network, this is known as the forward pass, and generates predictions
  2. Calculate the error (difference between the known labels &predictions)
  3. Calculate the gradient, (dot product of the transposed input & delta)
  4. Finally adjust the weights (backwards) from layer i to preceding layer i-1

Let’s consider each step in turn.

1 The Forward Pass

Say we have a simple 2 layer network with a linear node l1​​, a sigmoid node s, and another linear node l​2​​, followed by a node that calculates the cost, C.

2 Calculate The Error

The forward pass generates predictions for each sample in the batch, which we can compare to the true label y. This gives us the information necessary to calculate the output layer error, which is the difference between desired target and actual output.

3 Calculate The Gradient

As each of the values of the nodes’ weights flow forwards, they eventually contribute to the cost C. Accordingly, a change in l2 ​will produce a change in C, and the relationship between these changes can be written as a gradient: ∂C / ∂l2

The gradient is a slope that tells us the change in the cost ∂C given a change in l2​​, (∂l​2)​​. So a node with a larger gradient with respect to the cost is going to contribute a larger change to the cost. In this way, we can assign blame for the cost to each node. The larger the gradient for a node, the more “blame” it gets for the final cost — and the more we’ll update this node in the gradient descent step.

4 Backpropagate, updating weights

Updating a particular node’s weights with gradient descent requires the gradient of the cost with respect to those weights. Hence in the second layer, l2​​, we calculate the gradient of C with respect to w​2​​, written: ∂C / ∂w2

As you can see in the graph, weight w​2​​ contributes to the output of l​2​​, so any change in w​2​​ will propagate and create a change in C. Ideally, this will be a constructive change and the error will decrease. This is achieved by assigning blame to w​2​​ by sending the cost gradient back through the network. First, there’s how much l​2​​ affected C, then how much w​2​​ affected l​2​​. Multiplying these gradients together gets the total blame attributed to w​2​​.

This is the basic idea behind backpropagation. To find the gradient for a particular node, we multiply the gradients for all nodes in front of it, going backwards from the cost. The gradients are passed backwards through the network and used within the gradient descent calculation to update the weights and biases. If a node has multiple outgoing nodes, we just sum up the gradients from each subsequent node.

What about layers further back? To calculate the gradient for w​1​ we use the same method as before, walking backwards through the graph, so the derivative calculations look like this:

As CS231n explains, backpropagation is an elegantly local process. During the forward pass, each nodes receives its own local inputs and can immediately compute both its output value and the local gradient of its inputs with respect to its output value. They can do this independently without being aware of any of the details of the wider network they are embedded in.

Once the forward pass is over, backpropagation will ensure the nodes eventually learn about the gradient of its output value on the final output of the entire network. Applying the Chain Rule means each node takes the gradient and multiplies it into every gradient it normally computes for all of its inputs. This extra multiplication (for each input) due to the chain rule can thus turn a single and seemingly useless node into a key cog in a complex distributed machine that is the entire neural network.

Backpropagation in Python using basic linear algebra functions from NumPy looks like this (full code here):

# Convert inputs list to 2d array
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(targets_list, ndmin=2).T

### Step 1: Implementation of the Forward pass ###

# Hidden layer
hidden_inputs = np.dot(self.weights_input_to_hidden, inputs)
hidden_outputs = self.activation_function(hidden_inputs)
# Output layer# calculate signals into final output layer
final_inputs = np.dot(self.weights_hidden_to_output, hidden_outputs)
# set signals from final output layer
final_outputs = final_inputs
### Implementation of the Backward pass ###

# Step 2: Compute output layer error (difference between known target and network output)
output_errors = targets — final_outputs

# Step 3: Compute backpropagated error
# Calculate hidden layer gradients
hidden_gradients = (hidden_outputs * (1-hidden_outputs))

# Calculate errors propagated to the hidden layer
hidden_errors = np.dot(output_errors, self.weights_hidden_to_output)

# Step 4: Backpropagate and update the weights
# First, update hidden-to-output weights with gradient descent step
self.weights_hidden_to_output += self.lr * np.dot(output_errors, hidden_outputs.T)# Then update input-to-hidden weights with gradient descent stepself.weights_input_to_hidden += self.lr * np.dot(hidden_errors.T, inputs.T) * hidden_gradients

--

--