[DL] 3. Backpropagation

Awaits
Learning
Published in
4 min readMar 6, 2020

1. Error Backpropagation

In last chapter ‘[DL] 2’ we learned how to update the weight w. take a step into the direction of the negative gradient of error function E(w) with respect to the weight w.

There are several ways to calculate the gradient of E(w). In this chapter, we will deal with the following two ways.

  • Numerical differentiation
Figure 1. Equation to compute gradients in numerical way

This is an alternative to compute the gradients. Requiring no complicate equation, straight-forward, however, this is much less effective.

  • Backpropagation with Chain rule
Figure 2. Structure of the network

Suppose a simple neural network as is depicted in the above figure. The network output one prediction 𝒚 and compare it with the target value t. The detail mathematical representation is as below.

Figure 3. Mathematical representation for nodes in figure 2

We have two different weight matrix here, W(L) and W(L-1). The way we calculate such weights according to chain rule is as follows.

Figure 4. Calculating gradients with respect to weights using backpropagation with chain-rule

why is the backpropagation often used?

Backpropagation is an efficient method to compute the gradients(derivatives) and is applicable to any type of network and error function.

Figure 5. from bishop

Blue arrow denotes the direction of the Feed-Forward Network(information flow in the forward propagation) whereas red implies the direction of backpropagation(flow of error propagation). We calculate the gradients of the final layer of the network first and then propagate such gradients to the preceding layers in order to calculate corresponding gradients in those layers.

Since its way of calculating gradients that is shown in the above figure, this backpropagation can be applied to any sort of network and it is more effective than the numerical way as it requires less computing power and time.

In addition, because we need to calculate gradients for training and updating the weights, this is why we specified earlier that the activation functions are supposed to be differentiable.

Backpropagation in matrice form

  • As the error function has a value of scalar, its gradient with respect to particular w must have the same shape as w.

As we use matrices for input(Think of SGD we learned in the last chapter), weights in most cases when we use neural networks, We can use this property to calculate the gradient by backpropagation in matrix form.

Suppose another example in the figure below.

X is a layer with D nodes and the batch size is N. W denotes a weight matrix of size [D × M] and the resulting layer from X and M is the layer Y which consists of M nodes.

Figure 6. Chain-rule of Backpropagation in Matrix example

If we directly apply chain rule, as we can see above, the dimensions from both sides don’t match. In order to resolve this issue, we need to transpose the jacobian matrix(∂Y/∂X)

Figure 7. partial derivatives of Error E with respect to the input matrix X
Figure 8. partial derivatives of Error E with respect to the weight matrix W

Why is the sigmoid function not used as an activation function anymore?

We can find the answer from its derivative.

Figure 9. Sigmoid function

The derivative of sigmoid function goes to zero when the sigmoid function goes to either zero or one. On top of that, as is shown in the figure 9, the sigmoid function returns a value close to zero when its input is smaller than -6 and to one when its input is larger than 6, resulting in its gradient being close to zero. This means that the sigmoid function easily saturates.

  • Why a gradient being zero is a problem?

Recall the way we calculated the partial derivatives of Error function(Loss or cost function). We accumulated the calculated gradients and passed it to preceding layers based on the chain rule to compute the gradients of weights in such layers. Now, think about the case when the accumulated gradient is zero. Then since the chain rule works based on the multiplication, regardless of how far or deeply we pass the gradients to layers in early steps of the network, the result will be always zero. Therefore, there will be no updates on weights and the training stops.

Figure 10. Suppose L is the last layer and L-1 is the layer right ahead of L

This is the reason why we don’t use the functions which easily saturate as our activation function.

2. Reference

[1] Bishop. Pattern Recognition and Machine Learning. Springer, 2006

[2] GoodFellow

[3] http://cs231n.stanford.edu/vecDerivs.pdf

Any corrections, suggestions, and comments are welcome

Contents of this article are reproduced based on Bishop and Goodfellow

--

--