Deep Learning: Back Propagation

Now the problem that we have to solve is to update weight and biases such that our cost function can be minimised. For computing gradients we will use Back Propagation algorithm.

Why Back Propagation?

This seem to be the first question for every beginner and if we consider one of the scenario where you choose to decide to regard the cost as a function of the weights C=C(w) alone. You number the weights w₁,w₂,…, and want to compute ∂C/∂wᵢ for some particular weight wᵢ. An obvious way of doing that is to use the approximation

Equation 1

where ϵ>0 is a small positive number, and eᵢ is the unit vector in the iᵗʰ direction. In other words, we can estimate ∂C/∂wᵢ by computing the cost C for two slightly different values of wᵢ, and then applying Equation 1. The same idea can be used to compute the partial derivatives ∂C/∂b with respect to the biases. This approach looks very promising, simple to understand and would not take more than 3–4 line to code but then what’s the hack?

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight wᵢ we need to compute C(w+ϵeᵢ) in order to compute ∂C/∂wᵢ. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example).

Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient. What’s clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives ∂C/∂wᵢ using just one forward pass through the network, followed by one backward pass through the network. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes of the previous method.

Back Propagation Algorithm

Lets see what Back propagation Algorithm doing?

Figure 1: The real-valued “circuit” on left shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit. (Image Courtesy: Andrej Karpathy slides, CS231n)

Now we know that chain rule will take away our misery, lets formulate our algorithm? Lets now start to consider more complicated expressions that involve multiple composed functions, such as f(x, y, z) = (x+y)z. This expression is still simple enough to differentiate directly, but we’ll take a particular approach to it that will be helpful with understanding the intuition behind back propagation. In particular, note that this expression can be broken down into two expressions: q = x+y and f = qz(Figure 1). Moreover, we know how to compute the derivatives of both expressions separately, as seen in the previous section. f is just multiplication of q and z, so ∂f/∂q=z, ∂f/∂z=q and q is addition of x and y so ∂q/∂x=1,∂q/∂y=1.

However, we don’t necessarily care about the gradient on the intermediate value q — the value of ∂f/∂q is not useful. Instead, we are ultimately interested in the gradient of f with respect to its inputs x, y, z. The chain rule tells us that the correct way to “chain” these gradient expressions together is through multiplication. For example, ∂f/∂x=(∂f/∂q)*(∂q/∂x). In practice this is simply a multiplication of the two numbers that hold the two gradients.

At the end we are left with the gradient in the variables [df/dx,df/dy,df/dz], which tell us the sensitivity of the variables x,y,z on f!. This is the simplest example of back propagation.

Back Propagation Intuitively

Notice that back propagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things: 1. its output value and 2. the local gradient of its inputs with respect to its output value.

Lets get an intuition for how this works by referring again to the example(Figure 1). The add gate received inputs [-2, 5] and computed output 3. Since the gate is computing the addition operation, its local gradient for both of its inputs is +1. The rest of the circuit computed the final value, which is -12. During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4. If we anthropomorphise the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as “wanting” the output of the add gate to be lower (due to negative sign), and with a force of 4. To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both x and y 1* -4 = -4). Notice that this has the desired effect: If x, y were to decrease (responding to their negative gradient) then the add gate’s output would decrease, which in turn makes the multiply gate’s output increase.

Back propagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

Back Propagation (Neural Network)

I won’t be explaining mathematical derivation of Back propagation in this post otherwise it will become very lengthy. If you want to see mathematical proof please follow this link.

Before discussing about algorithm lets first see notations that I will be using for further explanation. We’ll use wˡⱼₖ to denote the weight for the connection from the kᵗʰ neuron in the (l−1)ᵗʰ layer to the jᵗʰ neuron in the lᵗʰ layer. We use bˡⱼ for the bias of the jᵗʰ neuron in the lᵗʰ layer. And we use aˡⱼ for the activation of the jᵗʰ neuron in the lᵗʰ layer.

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives ∂C/∂wˡⱼₖ and ∂C/∂bˡⱼ. We first introduce an intermediate quantity, δˡⱼ, which we call the error in the jᵗʰ neuron in the lᵗʰ layer.

Solving it with the help of chain rule we finally get the following algorithm.

Let’s explicitly write this out in the form of an algorithm:

Various Approaches to Back Propagation
Figure 2

Some approaches to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical values describing the gradient at those input values. We call this approach “symbol-to-number” differentiation. This is the approach used by libraries such as Torch(Collobert et al., 2011b) and Caffe (Jia, 2013).

Another approach is to take a computational graph and add additional nodes to the graph that provide a symbolic description of the desired derivatives. This is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012)and TensorFlow (Abadi et al., 2015). An example of how this approach works is illustrated in Figure 2. The primary advantage of this approach is that the derivatives are described in the same language as the original expression. Because the derivatives are just another computational graph, it is possible to runback-propagation again, differentiating the derivatives in order to obtain higher derivatives.

Please provide your feedbacks, so that I can improve in further articles.

Articles in Sequence:

  2. Deep Learning: Basic Mathematics for Deep Learning
  3. Deep Learning: Feedforward Neural Network
  4. Back Propagation

One last thing…

If you liked this article, click the💚 below so other people will see it here on Medium.