Welcome to the Self-driving car course part 7. This blog course will introduce us to the world of self-driving cars, how do self-driving cars work, self-driving cars pros and cons, what are self-driving cars companies, I hope we now understand the terminologies used in building neural networks. If you haven’t read it already you can visit part 1 and part 2 and part3. Full article available here.

This is a section about backpropagation and we will dig into details and understand with a numerical implementation. I want to make sure that we have our base strong which would help us in understanding the concepts about self-driving cars. Full article available here.

Let's take the example above, suppose the image above is a part of a huge computational graph and we are just taking a part of that computational graph where the node is getting inputs x & y. A function is then applied to both the inputs and an output z is produced, and at the end of the graph, the loss or the error is computed. One thing we can do directly once the node computes z is that we can also compute the gradient of z with respect to x & y…… This would really be helpful when we compute the gradient of the loss w.r.t inputs x & y which would be dl/dx = dl/dz*dz/Dx. This gradient tells us whether its influence on the loss is positive or negative. Full article available here.

We will take an example in details as shown below. I have taken the function and translated it into a computational graph and initialized with random weights. I have also mentioned the derivatives expressions at the bottom of the below image that will be used in our problem to compute the gradient. These expressions can be easily derived using calculus. Full article available here.

Example for backpropagation

Here we have 5 inputs and 1 output. The nodes with (+) can be considered as a binary plus gate. I have made up these gates for simplicity. The weights are mentioned on the edges and the final output generated by the forward pass is 0.73. Now lets start backpropagation. Full article available here.

Step 1:

Consider the actual output to be 1. Now we will backpropagate through 1/x operation. So when the 1.37 was going through a forward pass it must have a computed the local gradient using the formula f(x) = 1/x → df/dx = -1/x². The x value over here is 1.37, so the local gradient at that point is -1/(1.37)² = - 0.53. Always remember that there are always 2 components for chain rule at each step. First, the local gradient which is -0.53 and the second is the gradient coming from the above, in this case 1.0.

In order to compute the gradient at 1.37, we need to multiply local gradient with the gradient coming from the layer above. Hence gradient at the edge of 1.37 is -0.53. Since this value is negative it has a negative effect on the output. I hope we are together on this and if you did not understand whats going on read through the concept again and again until its clear. Lets continue backpropagation. Full article available here.

Step 2:

Lets take a look at step 2 which we have taken in the backward direction. The node is just a (+1) gate. It means that its just adding a constant to the value that flows through it in the forward pass. So we can use this formula for computing the local gradient.

f(x) = c + x → df/dx = 1

The computed local gradient is 1 and the gradient flowing from above is -0.53 which comes from step 1. Applying chain rule, the gradient at the edge of 0.37 is -0.53. I hope we now got some idea of how gradients are computed in backpropagation step. Still not? Lets try another one. Full article available here.

Step 3:

Moving backwards :P the next operation is the exp and it recieved the input -1 during the forward pass. So for computing the local gradient we go the the formula highlighted in the image below. Full article available here.

We get e-¹ and the gradient we recieved from step 2 is -0.53. Applying chain rule to compute the final gradient for edge -1.0 we get -0.20. Huuussshhhh…. that was some math. Since we have use 3 formulas out of the 4, lets also try the forth one. :)

Step 4:

Next operation on the node is the (*-1) operation which I also call as the flipper operation because whatever input it gets in the forward pass it flips the value in the opposite direction

As i said we will be using the last formula on the list to compute the local gradients. the a in the formula is the -1 and the x is 1.0. Hence the local gradient is -1. And the gradient from the above step is -0.20. Computing the final gradient using chain rule we get 0.20. ~And this makes sense because the operation is a flipper operation as I said. In the backward step it received -0.20 and it flipped it to 0.20. The next step is important as we encounter a different scenario. Full article available here.

Step 5:

Now I am sure we all understand backprop. But there is one more scenario that needs to be covered. The next backward step receives two inputs coming from different nodes. First lets compute the gradient for the edge connecting w2. The value of -3 passes through that edge during the forward pass and hence when we compute the local gradient of the constant -3, it comes out to be 1.0 and the gradient coming from above is 0.20. Thus we get the final gradient to be 0.20 for the edge connecting w2.

Similarly the final gradient for the other edge will also be the same since its the constant (4.0) passing through it, and the derivative of a constant is 1. So we can say that a (+) gate is considered to distribute its gradients to the edges connecting it in the backward path.

Final step

We can now look at the final step wherein we compute the final gradients for the inputs. As I said the multiplier operation acts as a flipper and we can see that in the computation of the gradients of x0 and w0. For computing the gradient of x0 we multiply 2.0 with the gradient coming from the above and computing the gradient w0 is done by multiplying -1 with the gradient coming from the above step. Full article available here.

One thing that i wanted to mention over here is that we can concatenate some of these gates to form a single gate. As seen in the diagram below the entire 4 operations can be concatenated which is nothing but the sigmoid function that we have read in the earlier part of this series. And the derivative of the sigmoid function is (1 — sigma(x))* sigma(x)

In our scenario, sigma(x) = 0.73 and computing the gradient for the entire sigmoid cell comes up to 0.2 which is nothing but the gradient for the edge having input value 1.0.

I hope I was able to make you guys understand how gradients flow in a neural network.

Its is completely ok if you were unable to understand these backprop and calculus concepts since we have got amazing libraries like tensorflow, keras, Pytorch etc. that provides us with automatic differentiation. In other words these packages performs backpropagation with just a line of code which we would be dealing with in the next section. Till then, lets program self driving cars using this self driving cars course. Full article available here.

--

--