NEURAL NETWORK LEARNING INTERNALS(LINEAR REGRESSION)

22 min readOct 30, 2017

The main frustration that I’ve had when I was trying to learn about neural networks, was that most of the tutorials, are either too math oriented and too complex to understand, and you feel like you will never understand what they are actually talking about, or that they are too superficial, and they just present you with a framework that does everything for you. In both cases you end up not really understanding what is going on under the hood. Most books will tell you about the back-propagation algorithm, which is the standard algorithm for training neural nets, but they will not give you an intuitive explanation of how it’s working and what exactly is doing. Also try to find yourselves some tutorials about neural nets on the web, and you will instantly know what I’m talking abut here. You will find how on many of them will just tell you that a neural net is just a stack of layers , and each layer has from one to as many as you want neurons. And that each neuron in one layer will connect to each neuron in the next layer, yielding a complete graph like structure. And that a neuron will get some inputs, and each input is multiplied by a weight, a.i. a real value number, and in the end, all this products will be summed up and they will go through a sigmoid like function and this process is repeated on each neuron in each layer. And by using the back-propagation algorithm, somehow magically this giant multiplying structure will be able to recognize faces in pictures and will be able to translate text from english to french and so on. That surely looks very magical and forbidding when you attempt to understand why a bunch of multiplications and sums of real numbers really do the tricks that a regular algorithm would only dream of doing. Although the standard neuron like structure looks very simple for some reason if you have enough of them and enough data for training, these very simple structures will compute the hardest things. Things that will be impossible to compute for a normal algorithm, despite how complex and sophisticated will be.

The main goal of this post is to try to go a little bit in the details of what it really means to train a neural network. To understand what is happening and how the weights are changed in order to make this rather simple system of neuron like structures so to be so effective in computing almost everything. We are not going to present any framework here and you are not going to see any source code. This lecture is meant to understand the theoretical aspects of neural net training and why some things are done the way they are done, without going into the very complicated mathematical or statistical aspects of it. I’ve tried as much as I could to keep the math as simple as possible. For instance many tutorials will tell you that back-propagation is relying heavily on partial differentiation. But reading many of those tutorials you are not going to understand why partial differentiation is needed. It is presented as black box that will do the magic for you. You just have to trust it somehow. At least that was my feeling.

Before you start, beware that you should be familiar with some mathematics such derivatives and partial derivatives. if you have problems with that you should try the calculus course on Khan Academy. No need to go through the whole course. At least the function limit and the derivative part of it. Check the playlist here

So I guess we should start. Have fun :)

Before going into any details we should kick off with some example of what a very basic neural network can do. For instance a rather simple neural net that can recognize handwritten digits will take as input, images of 28*28 pixels each and will output the number displayed on that image. The input can be thought as an array of 784(28*28) entries each representing the pixel intensity as a number between 0(black) and 255(white) and the output will be an array of 10(all digits from 0 to 9) entries that are either 0 or 1. So if for instance we give the neural net as input an image with a handwritten 1 on it, we expect the output vector to be all 0 expect the second position that will be 1. So you get the point. Between the input layer of 784 neurons and the output layer of 10 neurons, we can have as many layers as we want each of different sizes. Usually in this case we gonna have one hidden layer of 300 neurons. Bellow is a representation of this neural net. Obviously I couldn’t represent all 784 neurons in the input layer nor the 300 in the hidden layer.

This is a sample image that could be recognized by the neural net:

This network, although is a very small among the neural nets, it is still a very complicated system. We can think of the problem of recognizing handwritten digits in more abstract terms as a function of this type :

We take as input a tuple of 784 real numbers (the intensity of each pixel in the image) and we output a tuple of 10 numbers that could have only the values 0 or 1. In our case only one value will be 1 the rest will be 0. We are not talking about the special case when the input image does not contain any digit at all. We are going to keep things simple here. The neural network in this case will attempt given enough annotated training data to approximate this very complicated function. In very mathematical terms a neural network is actually a universal function approximator. that’s what makes them so powerful and effective. One example of function approximation to a certain degree of error is a statistical method known as regression. It is heavily used in machine learning. You can find more about regression on the web. In this post we are going to limit ourselves to the simplest type of regression. The linear regression.

Linear regression

In general we have a bunch of data, that we will call the training data. The training data will contain pairs of input and their desired output. In our handwritten example, the training data will contain a few thousands pairs like (X,y) where X is an array of 784 values and y is an array of 10 values of either 0 or 1. We then construct our neural net where all the weights are initialized with some random values. Since the values are random, we expect that when we give the neural net an image as input the result won’t match the desired output for that particular image. Therefore the job of training a neural net is to make sure that we find a way to adjust the weights such that the result of the neural net when we give it as input any image, is as close as possible to the desired output of that image from the training data.

Now instead of thinking in very high dimensions it is better to reduce the problem a bit and think about it in just two dimensions. It is way better to understand the above problem in just two dimensions.

Instead of thinking that we have a training data of thousands of images we are going to have a “training set” of just 8 points in the xOy plane. Each point will be a pair of the form (x, y) where the x coordinate is the “data” and the y coordinate is the desired output.

Since an image is worth a thousands words, I think it is better to actually see some examples :

A handful of points plotted

As you can see, there are 8 points that we’ve plotted here.The big challenge is that we want find a function or a model, or a neural network if you will, that can approximate these sets of points. To put is simply, the problem is that we want to find a straight line(the straight line will be out “model”) that can best approximate these points. What does it mean to approximate these points? Well first off, every line in plane has an equation of the form y = mx + b. x is the input parameter. It is a value on the Ox axis. Based on that value, we can use the above formula to find the corresponding point that sits on that line. A.i. the point that has as coordinates (x,y).without going into too much detail on that, since you can find plenty of material on the web, every line on the plane has an intercept and a slope. For each value on the Ox axis, we can find the corresponding y value of the point that is on the line having as coordinates the x and the y values using the line equation formula. The m is the line slope while the b is the intercept. A.i. the value on the Oy axis where the line intersects it.

We can think of the m and the b values as the parameters of out very simple model.In fact mx+b could be thought as a single neuron/perceptron with one entry. x is the input that is multiplied by the m value aka the line slope, or the weight in neural nets terminology, and then we add the line intercept or the bias in neural net terminology. We are not using any activation function in this example, because we want to make it look simple.

Now given the points that we’ve plotted earlier, let’s draw a random line. By random line i mean that we are choosing the slope and the intercept randomly. Bellow might an example of that:

Now if we look for instance at the first point. The x value is 1 while the y value is 2. The corresponding point on the line(by corresponding I mean the point that sits on the line that has the same x value) has an y value of 1.125. We have calculated that using the line equation which in this case is y = x/8 + 1. See the image bellow for all the corresponding points on the line for the given plotted points:

So if the value that is given to our “model” is 1 for example, the desired output is 2. We use the line equation , our “model” to calculate the output and we get the value 1.125 which is different from 2. Not only that is different but it seems that is not even close to 2, the desired value.

As we can see in the above image, there is quite a distance between any given point and the corresponding point on the line. The challenge for us is to find a new slope and a new line intercept that will minimize the distance between what is the desired values, and what the line produces given an x value. In order to do that, we first have to come up with a formula that measures the whole error in this case.

In the image above, the red bars(also known as the error margin or residual error) represent the distances between a given point, and it’s corresponding point on the line. We can sum up all these differences, and we will get something like that :

. This sum will give us an error estimate of our approximation. In the above case, it is obvious that the error will be quite high which would mean that maybe this line with this equation might not be the best approximation for the given points. Let’s calculate the error to see with our own eyes.

Given that the line equation is f(x) = x/8 + 1, we have the following values:

First point : y1 = 1/x + 1 => y1 = 1.125 => distance between the plotted point and the point on the line(same x coordinte) is 2–1.125 = 0,875.

For the second point, the distance(error) is : 1.75. Third point error: 2.15. Fouth point error: 0,5. Fifth point error: 1,0625. Sixth point error: 1,55. Seventh point error: 1.65 and finally, the last point error is: 2,5375

Now if we use the above formula to calculate the whole error we would obtain the following:

0,875+1,75+2,15+0,5+1,0625+1,55+1,65+2,5375= 12,075.

Now dividing this by the number of points we will get : 1,509375. That the total error for this line trying to approximate the plotted points. Note that we divide by the number of points, because we are interested in the average error. If say a few points have a very small error, in the end that would mean not that much since we are interested in the average error across all the points. We want the error to be small on average.The above formula is known as the MSE or Mean squared error. For more info check this link: Mean Squared Error.The reason for the squared root in the formula comes from the euclidean distance formula, but since we have the same X coordinate for the points and their counterparts on the line, that difference from the distance formula is zero.You may say why do we need to square root that squares a difference.The two operations, square and square root, because they are inverse, will cancel out. The reason is that we want the resulted error to be a positive number. Either use this or the absolute value of that difference. Such as :

It makes no difference if we use either one of the two formulas. So we’ve just computed the total error based on some particular values for the slope and the intercept of the line that we’ve drawn. What if we change the line slope or the intercept. We will have a different cost value. If the line is moved from it’s current position, “closer” to the points we will end up with a lower cost, and therefore a better approximation. So in other words, the cost function shown above, is a function of the slope and the intercept of the line. So the error function is a function of two variables in this case m and b. Obviously, in real world situations, such as recognizing handwritten digits, we are going to deal with neural networks of many many parameters, therefore the cost function will be a function of thousands or even millions of parameters. In our case the cost function, let’s call it C(m,b) will be the following:

where y’(m,b) is the value of the Y coordinate on the line(computed by the “model”), and y is the Y coordinate of a given point, or the desired value from the training data if you will.
Let’s play a little bit with both the intercept value and the slope in order to get a more accurate picture of what is happening, and how the cost function itself will look like.
Let’s fix the slope to a certain value, and let’s vary the intercept and let’s see what we will get.
When m = 1/8 and b = 0 we will have the following situation:

The cost of the error function in this case is :

(1,875 + 2,75 + 3,15 + 1,5 + 2,0625 + 1,95 + 2,65 + 3,5375)/8 = 2,434375. You cal calculate that yourselves.

When m = 1/8 and b = 1 we will have the following situation:

Clearly, you do not have to calculate the cost to see that it is in fact smaller the the previous cost. The line is closer to the plotted points than it was in the previous picture. We will calculate the cost anyway just to make sure.

(0,875+1,75+2,15+0,5+1,0625+1,55+1,65+2,5375)/8 = 1,509375

When m = 1/8 and b = 1,5 we have the following:

The cost now is (0,375 + 1,25 + 1,65 + 0 + 0,5625 + 1,05 + 1,15 + 2,0375)/8 = 1,009375
When m = 1/8 and b = 2 we have the following:

The cost is now (0,125 + 0,75 + 1,15 + 0,5 + 0,0625 + 0,55 + 0,65 + 1,5375)/8 = 0,665625
As you can see the line is basically in the “middle” of the points, which make it closer to all the points than in any other picture. So from b=0 to b=2 the cost has decreased.It is obvious why.When b was 0, the line was farther apart from the plotted points than it is when b=2.Now if we continue to vary b even more, a.i tho increase it’s value beyond the value of 2, the cost will start to increase,which again makes sense because the line will become farther apart from the plotted points.Let’s look at a picture to see exactly.

When m = 1/8 and b = 3,5 we have the following:

The cost is now (1,625 + 0,75 + 0,35 + 2 + 1,4375 + 0,95 + 0,85 + 0,0375)/8 = 1. A much more larger value than in the situation when b was 2.

When m = 1/8 and b = 4 we the following:

The cost is now (2,125 + 1,25 + 0,85 + 2,5 + 1,9375 + 1,45 + 1,35 + 0,4625)/8 = 1,490625. Even higher than previously. Obviously now you can see a trend. When the slope value is fixed and we vary the intercept, from small values to higher values.The cost function will go from from a higher value initially, but as we increase b, we will get closer the the plotted points, and as a consequence, the cost will decrease, until we will get to a minimum value. From that point on, if we continue to increase the value of b the cost will start to increase again.So having a high value, then decreasing to some minimum value, and then increasing again.That’s a parabolic pattern. If we think of the cost function as a function of the parameters m and b then we can plot the points representing all the above computed values of the cost function when m = 1/8 and b varies from 0 the 4. Let’s do that:

Above we have the plotted 6 points that represent the value of the cost function when m was fixed to 1/8 and b had the values of 0, 1, 1.5, 2, 3.5 and 4
Now let’s play exactly as before, but this time we will fix the value of b = 2 and we will vary the value of m instead.
When m = 1 and b = 2 we will have the following situation:

The cost function is now: (1 + 1 + 1,3 + 4 + 3,7 + 4 + 4,6 + 4,5)/8 = 3,0125

Let’s change m further. This time m will be 0.7. And we will get this picture:

The cost is now : (0,7 + 0,4 + 0,46 + 2,8 + 2,41 + 2,44 + 2,8 + 2,43)/8 = 1,805. Clearly way better than in the previous setting. Let’s continue with tweaking of m. If we change m from 0.7 to 0.3 we will get this now:

The cost is now: (0,3 + 0,4 + 0,66 + 1,2 + 0,69 + 0,36 + 0,4 + 0,33)/8 = 0,5425. As you see it gets better. Now let’s decrease the value of m to zero. This is what we gonna get:

Now the cost in this setting with m = 0 is: (0 + 1 + 1,5 + 0 + 0,6 + 1,2 + 1,4 + 2,4)/8 = 1,0125.

Well, apparently it seems that now the cost is increasing again. Let’s try another final value for m. We will set it to -0.5:

The cost is now: (0,5 + 2 + 2,9 + 2 + 2,75 + 3,8 + 4,4 + 5,85)/8 = 3,025.

Well that’s pretty bad. The more we decrease the value of m past this point the worse the cost will get. So just like in the case of tweaking the b value, we see the same parabolic pattern. The cost goes down, and then goes up again. We also decided to plot the above cost values in 3D to see how they look like:

Since the cost function is a parabola by m and a parabola by b we can conclude that the cost function graph is a bowl shaped surface like in the image bellow:

The blue point that you see in the graph is the minimum of this function. So to recap what we’ve done so far, we’ve started with a random value for m and a random value for b. With those values in place, the cost function had a rather large value. That means that it represented a specific point on the bowl shaped surface of the cost function. And our stated mission is to find some values for m and b such that the error is the lowest possible aka minimum.Voila. The thing that we need to do is to find out the values for m and b such that the value of the error function will be the blue point that we see in the picture. Now suppose that we start in the upper point in the image bellow.Our mission is to get to the point at the bottom of the bowl.

One way to do that is using derivatives. Or more precisely gradient descent. That is actually the mathematical tool used by the back-propagation algorithm to train a neural net. The main idea of the gradient descent is to imagine that we have ball that is initially situated in the upper point. And we let the ball roll all the way to the bottom of the bowl shaped surface. When the ball arrives at the bottom of the surface, it simply stops. I’m not going into the details of what derivatives are. There is plenty of great material on the web about that. The best in my opinion can be found on Khan Academy, the calculus course. But we are going to talk a little bit about derivatives and partial derivatives just to get the big picture. So in a nutshell, the derivative of a function is the rate of change of that function, or in more mathematical terms, the derivative of a curve in a point is the slope of the tangent line to that curve in that point. So we can use the derivative to change our position on a curve. I think some images will be worthwhile to understand what I’m trying to convey here. Let’s suppose that we have a function like such the classic parabola function:

The graph of this function is the parabola. Now imagine that we are positioned at the blue point on the graph as in the following example:

Since the derivative of x2 is 2x and we are at x=4, the derivative, or more precisely the slope of the tangent line to the curve at the blue point is 2*4 = 8. Now the gradient will use the derivative information to move in either of the two directions. a.i up or down. Since the derivative is positive, if we add the derivative at that point to the x value of that point, we will get to a new value up on the curve. If we instead subtract the derivative value from the x value we will get somewhere down on the curve. In this case 4–8 will give a new values for x of -4. Not very bright. We overshooted on the other side of the graph. We do not want that. We want to have a smooth sliding from our current position all the way to the bottom of the curve. So maybe it’s not wise to use the whole value of the derivative, but a small fraction of it. And this how we gonna do it:

Here α is a small value between (0,1) which has the role of taking a small fraction of the derivative value. So let’s say that α is 0.1 in this case. Using this formula, the next value on the curve will be x = 4–0.1*8 => x = 4–0.8 => x = 3.2. So we’ve updated our current position form 4 to 3.2. In the image bellow you can see that we’ve moved from the blue point to the new red point. And we did that using the gradient descent formula.

If we will even smaller values for α, then the sliding will be even more smoother. For α = 0.01 the next position for x will be 3.9992, which is very close to the original position. If we use the gradient formula iteratively we will end up moving from the current position to the next and so on until we will get to a position where we will get stuck. A point on the graph with derivative zero.In our example that would be at the bottom of the bowl.At that point we’ve found the minimum of the function. The same algorithm works for multivariable functions. The difference is that we will be using the partial derivatives. So since we are having a function of type f(x,y) we will have two derivatives. The derivative with respect to x and the derivative with respect to y. The first one tells us the rate of change in the x direction, while the other one tells the rate of change on the y direction. The most intuitive way to understand partial derivative is to see some pictures.

You can think of a partial derivative with respect to the x value, as taking a plane parallel to the yOz plane. The yellow plane in the image is cutting the surface at that constant value of y. The intersection between the plane and the surface will give us a curve highlighted in red. Taking the derivative of that curve at the blue point, is the partial derivative of the function with respect to x at the blue point. Any derivative of that curve will give us the partial derivative of the function f(x,y) with respect to x in that particular value of x when y has a constant value. The same is true for the partial derivative of f(x,y) with respect to the y value.

In the above image, we have a plane that is parallel to the xOz plane(x has a constant value) and the intersection between the surface and the plane forms a parabola and we can take the derivative of that parabola, in any point on that parabola and we will say that we are taking the partial derivative of the function f(x,y) with respect to y on particular value of y. Bellow we have both planes, in the x direction and in the y direction, cutting the surface and intersecting in that blue point.

As we’ve seen in the 2D example, the gradient equation, will give us a new value for x, subtracting from the current value of x a small fraction of the derivative of the curve in that point. So in other words, using the gradient, we move along the Ox axis in very small increments towards the closest point with zero derivative or to a local/global minimum. Those small increments could thought of as vectors, that represent the moving of the current point to the new position. Bellow you can see that when we moved from the blue point to the red one, we have an equivalent green vector on the Ox axis that represent the increment and the direction of the movement.

The same is true in 3D space with the exception that we gonna have a vector for each dimension :

In other words, the gradient now is a vector that has as components, the partial derivatives with respect to both x and y. To get from the current upper point to the next point that is at a lower altitude, we will subtract a fraction of the partial derivative w.r.t to x from the current x value and we will also subtract a fraction of the partial derivative w.r.t to y from the current y value like that:

The next point will be :

which is the lower point on the surface. The small changes or nudges in both the x direction and the y direction can be thought as vectors. Getting from the current point to the next lower point can be thought of as adding those two red vectors. The sum will be the yellow vector in the image.

For more pertinent information about the gradient ascent and gradient descent checkout Professor Leonard Youtube channel, or this course:

So the gradient descent is a method of moving from a point on the surface of the error function (the mean squared error in this case) to the closest minimum point, or a critical point(a point where both derivatives are zero). The way to do that is to calculate the partial derivative w.r.t. to each parameter(in our case only two of them. m and b) and subtracting from the current value of each parameter a fraction of its partial derivative. By doing that we will simulate a ball rolling down a hill. This is very simple presentation of how a single neuron with only two parameters the m and the b is learning how to fit a line such that it is the best possible approximation for those plotted points that we’ve seen in the beginning of this lecture. In reality neural networks have lots and lots of neurons with hundreds or thousands of inputs each. For instance the neural network that was designed to recognize handwritten digits from the beginning, has over 25 thousands parameters. The input is an array of 784 neurons and the hidden layer has 300 neurons. That means that we have 784*300 = 235200 weights only between the input layer and the hidden layer. Plus 3000 weights between the 300 neuron hidden layer and the 10 neuron output layer. Plus the 300 + 10 biases. Obviously we cannot plot a cost function in almost 25,000 dimensions. We cannot do that in 4 dimensions let alone 25,000. That’s why understanding the internals of these systems, is better done with showing you how the simplest type of neural network is learning. The neural net with one neuron with only one weight and one bias.

To be continued…

NEURAL NETWORK LEARNING INTERNALS(LINEAR REGRESSION)

Written by Serban Liviu