Building Self Driving Cars Course # 6— Neural Networks (Part 3)

Chandan Verma
Autonomous Machines
11 min readMar 7, 2019

--

Welcome to the Self-driving car course part 6. I hope we now understand the terminologies used in building neural networks. If you haven’t read it already you can visit part 1 and part 2. In this section, we will look at how to build these non-linear models or neural networks and how these deep neural networks function. This section is going to be math heavy and will form the basis of building self-driving cars. So let’s get started.

Neural Networks

So let’s understand what’s behind these neural networks? Neural networks combine multiple linear models and transform it into a non-linear model. As we can see in the image below it’s just like a mathematical operation performed on the two linear models say

DDI Editor’s Pick — Machine Learning By Stanford University

linear model 1 + linear model 2 = non-linear model 1

Let’s try to understand it mathematically. A linear model is something that gives us the probability of each point to be blue. Suppose the first linear model gives a probability of 0.7 for a point being blue and the second linear model gives a probability score of 0.8 for the same point. Now the question arises, how do we combine these two models?

The first thing that comes to our mind is just taking the sum of the probability scores i.e 0.7 + 0.8 = 1.5. As we see the resulting score is 1.5 which doesn’t look like a probability score as prob score is between 0 and 1. Now how do we convert these scores into probability scores? Well, we have come across similar situations in the past and yes we will put this score in a sigmoid function which gives us the probability of 0.82. Similarly, we compute the probability space for all the points. Simple enough? right? let’s make it a bit more complex.

What if we want the 1st linear model to have more effect say 7 and 2nd model to have less effect(5)? We would then have a weighted sum something like

(7 * 0.7) + (5 * 0.8) — 6(bias term)= 2.9

Then we apply a sigmoid function to this score and we get a probability score of 0.95. The resulting model is nothing but a linear combination of the two models as we can see above. I hope we all are on the same page and we know what neural networks are made of. We can now understand any complex relationship between the variables which will be just a combination of these linear models resulting in a complex non-linear model. But our goal is not only to understand the concept behind these neural networks but also implement it, which is the intention of this self-driving car course.

Now many of you would be thinking that the diagram above looks like a perceptron. Isn’t it? Yes, you guys are correct. These perceptrons are the building blocks of neural networks. Let’s dig in a bit deeper and take the same example above and represent the two linear models as perceptrons and see how they form the basic building blocks of neural networks. Suppose the equation of the first linear model is (5*x1–2*x2 +8) and for the second model is (7*x1–3*x2 -1) and the output of these models is then combined with a weighted dot product (7 * output of model 1) + (5 * output of model 2). This combining of different perceptrons with a weighted dot product is known as neural networks.

The edges on the left tell us what equations the linear models have and the edges on the right tell us in what linear combination these models should be multiplied to get the non-linear curve on the right. The -8 and 1 can also be represented as a separately with 1 on the node and -8 and 1 on the edges as seen below.

Layers

Neural networks have a special architecture that consists of layers. If we see the above diagram the node x1 and x2 form the part of the input layers. The middle layer consists of a set of linear models that are created with the help of these input nodes which are known as hidden layers. We may have one or many hidden layers based on the complexity of the problem. And the final layer is known as the output layer where the linear models combine to give a non-linear model.

Here since we have only 2 input nodes we get a 2D output layer. But what if we have 3 input nodes or even more? In the case of 3, the output will be represented as a plane in a 3D space. Till now we have looked at binary classification where we have a point represented as blue or red. What about if the output has more than 2 possible outcomes? Such problems are known to be a multiclass classification problem. So if we have an image to classify red, green, and blue points then, we have an output node for each of the class giving the probability score for each of the class.

We talked about more nodes in the input layer and more nodes in the output layer but we also might have more nodes in the hidden layers. In fact, we may have more than one hidden layer through which we can more and more complex relationships which are very much essentials in self-driving cars, playing agents, etc. Such networks are called deep neural networks.

Feedforward

Feedforward is the process neural networks use to turn the input into an output. Let’s study it more carefully before we dive into how to train the networks.

Training a neural network is to define or probably tune the weights on our edges that will solve the problems in hand. In order to understand how to train them, we must understand how these neural networks process the inputs to produce the outputs.

We must be comfortable with the image above by now. Before understanding what a feedforward network does let’s understand the notations in the image, in this case, x1, x2 and 1 are the input nodes with bias. Weight matrix denoted by W(1)11, the superscript 1 represents the 1st layer and the first 1 in the subscript represents the number of the input node and the second 1 represents the number of the hidden node. Similarly, W² represents the weight matrix for the second layer. The values in this matrix tell us in what combination the linear models from layer one must be combined in order to get the optimal solution.

The feedforward network multiplies the inputs with the weight matrix W¹. The output is then passed on to a sigmoid function that gives in probability values. These values are then multiplied with W² and passed on to another sigmoid that yields the final prediction ȳ. Once we have this prediction we will evaluate the performance of the model using the error function. If you don’t know what error function is you can go through it here. If you are getting confused a bit, don’t worry its normal. We will go through with a numerical example to help things clear out.

Numerical implementation

We will look at how to implement the multilayer perceptron feedforward network as indicated in the diagram below. The input layer (i1, i2, i3), the 1st hidden layer (j1, j2, j3), the 2nd hidden layer (k1, k2, k3) and the output layer l which indicated whether the given point is blue.

Here we can see the input to be [0.1, 0.2, 0.7] and the output is [1.0] which indicates that the given point is blue. Since there are 2 hidden layers we will have 3 weight matrix one will be between the input and the first hidden layer Wij, the second between the first and the second hidden layer Wjk and third between the second hidden layer and the output layer Wkl. We have initialized these weight matrices by random numbers for illustration purposes. In order to decide the dimension of the weight matrix Wij, we set the # rows in Wij to the # of nodes in j i.e 3 and # columns in Wij is the #nodes in i i.e 3.

Let’s start matrix computation as seen in the image below for the first layer.

After multiplying the input vector with the 1st weight matrix Wij the sigmoid function is applied to it which outputs the values for [j1, j2, j3]. These values are then multiplied by the 2nd weight matrix Wjk followed again by the sigmoid function to give [k1, k2, k3].

And finally, we multiply these outputs with Wkl to give the final predictions(ȳ). This ȳ is the probability that the given point is blue. As we can see the model generates good prediction saying that it is 91% confident that the given point is blue. But there still exists error which can be computed the error using the formula.

In our case, if we substitute the value of ȳi and yi we get an error of 0.041. Since we are using only one example here we don’t take the average of the error. We already know that the goal in any machine learning or deep learning problem is to minimize this error. But how do we do this? Here let me introduce to you the most important concept that is useful in training neural networks which is backpropagation.

Backpropagation

As we saw in the predictions generated by the network is pretty good and we can further tune its weights to get more accurate predictions. Let’s assume that the point is predicted badly as seen in the figure below.

The blue points ask the curve to move closer. Let look at the linear models in the hidden layer, the first model classifies the point incorrectly while the second model correctly predicts the point to be blue. We would want to listen more to the bottom model and less to the first model i.e we reduce the weight coming in from the first linear model and increase the weights coming from the second model. We can do even more by looking at the two linear models. We can more accurately classify by moving the lines in the first linear model towards the point and away from the blue point in the second model. This change is the model is updating the weights of the two linear model. We haven’t considered the bias in this case for simplicity but during backpropagation, we update both, the weights and biases. I hope you have got a high-level concept of what backpropagation is. Now let’s dig in a bit deeper. Tighten your seat belts maths coming in.

The output of the feedforward gives us the prediction (ȳ). We then compute the error the error function formula and then the gradient for the error function is computed. Before computing, the gradient of the error function lets understand chain rule which is something we have studied in our college days. Understanding of chain rule is a prerequisite for understanding the gradient computation.

Chain rule states that if you have a variable x and apply a function f to get f(x) which we call as A. And we have another function g which we apply to the function f(x) to obtain g.f(x) called B. If we want to find the partial derivative of B with respect to x, it will be partial derivative of B with respect to A multiplied by the partial derivative of A with respect to x. As we know feedforward is nothing but functions applied to other functions and so on, while backpropagation is a series of partial derivatives at each piece of the functions applied. Confused??? Don’t worry we will look at the numerical implementation where things will get more clear.

Now let’s continue with the computation of gradient for the error function. We will start with computing the derivative of the error function. The error function is a function of ȳ while ȳ is the function of W¹(W11, W12) and W²(W21, W22). Hence the error function can be written as a function on all the Wij. So the gradient of the error function is a vector formed by the partial derivative of the error function E with respect to each of the weights Wij as seen in the diagram below.

Once we compute the derivative of E with respect to each of the weights Wij we then subtract the current weights with the partial derivative of E with respect to weights Wijmultiplied by a small factor(α) which is the learning rate which we discussed in here.

The new weights and biases give us an entirely different model which will be much more accurate in our predictions. Updating of weight and biases is done until we minimize the error function and find the optimal solution.

Huuusssshhh….. That was too much to digest. I assume it’s the same for you. But I hope we now understand what neural networks are and the mathematics behind it. We also understood how forward pass and backward pass work. If you are still confused about backpropagation we will try to get things clear with the help of a numerical implementation. If you are not interested in the mathematics behind backpropagation you are allowed to do so because there are libraries like Tensorflow, Keras, Pytorch and many more that allow automatic differentiation and computes the derivatives automatically. We don’t have to code complex steps involved in backpropagation. But it’s good to know what’s going behind the scenes. So guys see you soon and let's build self-driving cars using this self-driving car course. full article published on https://theautonomousmachines.com/building-self-driving-cars-course-6/

Editor’s Disclosure: The editors sometimes post affiliate links to useful resources. If you find them useful and make a purchase, we will earn some big bucks. No, I’m not talking about upsizing my fries kind of big. I’m talking about extra pepperoni on a large pizza kind of big. Thank you for your continued support, we will continue to work hard for the p̶e̶p̶p̶e̶r̶o̶n̶i̶ publication.

--

--