Feed forward and back propagation back-to-back — Part 1 (Linear Equation as a Neural Network building block)
Preface
Understanding what is feed forward and back propagation is key in mastering how a neural network works. While these two concepts are simple, many of us struggle with the maths, specially the one behind back propagation. If this is your case, as was mine, this series of posts is for you!
First a note of comfort to the eventual reader . I won’t let terms like gradient and gradient descent, calculus and multivariate calculus, derivatives, chain rule, linear combination and linear equation become boulders blocking your path. By the end of this series, hopefully, these terms will be perceived by the reader as the powerful tools they are and how they are simply applied to building neural networks.
Linear equation
If you remember the graph of a line you drew so many times during your last years of K12 math classes, you know exactly what a linear equation is. If you don’t remember I will refresh your memory.
The equation of a line is commonly written as described below.
Equation 1 is in 2-dimensional space because there are two variables: x, which is called the independent variable; and f(x), which is the dependent variable, for its value depends on x. This line is plotted on a plane and the plane has two dimensions: width and height, thus 2-dimensional space.
Figure 1, below, contains the chart of the linear equation f(x) = 2x + 1. Another common notation for the same equation is y = 2x + 1. This implies that y = f(x). If we assign to x the value 4, y = 2*4 + 1, yielding in y = 9. We can say that the point (4,9) is exactly on the line. Making similar calculation for x = 8, yields y=17, in other words, the point (8, 17) is also on the line. Both points are plotted in figure 1.
What is the interpretation of the value a in a linear equation? It is the variation rate for the points along the line. In layman’s terms: as we vary the value of x, y varies proportionally to a. The variation rate a is also known as the angular coefficient of the line. Angular coefficient? Of what angle?
Remember that in a right triangle the tangent of an angle is the opposite side of the triangle divided by the adjacent one? So the angular coefficient a is the tangent of the angle between the line itself and the x-axis and is geometrically interpreted as the inclination of the line. So a is:
The symbol Δ is the Greek letter Delta and one should interpret it as the sign that represents the difference between two numbers. So a is the ratio between the difference between the x coordinates divided by the difference between y coordinates of two given points. Go back to figure 1 (I find helpful to seeing the geometric interpretation of these concepts) and verify that the segment labeled Δx, in green, is parallel to x-axis while Δy, also in green, is parallel to y-axis. With the points plotted in it, we calculate a as shown below.
Figure 3 contains another example of a linear equation, only this time a = -2.
If we walk from left to right, a positive a ( as in figure 1 ) means we are going up the hill. If, instead, the value of a is negative ( as in figure 3 ) we would be walking down the hill.
This simple interpretation will be key to define how to nudge a (weight in neural network terms) and b (bias in neural network terms) to minimize the output error of f(x). Error of f(x)? Yes! Points coordinates (data in neural network terms), as those in figure 1, are given to the neural network, while a and b are not. So based on these points, and with an initial guess for a and b, the network learns their approximate values by minimizing the error, which is the difference between the real f(x) and the computed one.
Another way of expressing the equation of a line is by using what is called its “general form” as follows:
So the general form of y = 2x + 1 is 2x — y + 1 = 0. Substituting x and y with the coordinates of point (4,9) plotted in figure 1 we have: 2*4–9+1 = 0 , resulting in 8–9+1 = 0 which is true. The same applies to point (8,17).
Now lets substitute, for example, two points with the following coordinates in 2x-y+1: (5, 5) and (5, 15). We get: 2*5–5+1 = 6 which is greater than 0 for the first point and 2*5 -15 +1 = -4, i.e., less than zero. If you go back to figure 1, clearly the first point is below the line while the second is above.
If we substitute the coordinates of the point (5, -9) and (0, 1) in the line of figure 3 we get: -2*5 +9+1 = 0 and -2*0+1–1 = 0, meaning these points are on the line. And what about point (6,-5) and (6,-15)? Lets see. For the first -2*6+5+1 = -6 which is less than zero. For the second: -2*6+15+1 = 4, which is greater than zero. Go back to figure 3 and you will verify that the first is above the line while the second is clearly below.
We these examples we can conclude that using the general form of a linear equation, all points applied to it resulting in a negative number means that these point are above the line. All points applied to the equation yielding zero means that these points are on the line and all points applied to the equation resulting in a positive number means the the point is below the line. This concept is fundamental for classification.
Practical application — Non uniform movement
In Newton’s mechanics, velocity in a rectilinear motion is given by the following formula:
In the above formula v is the speed we want to calculate (the dependent variable), t is the time for which we are calculating it (the independent variable), v₀ is the speed at t = 0 and a is the acceleration. So a is the variation rate (or angular coefficient which is its geometrical interpretation) of speed in time.
What about b?
Make x = 0 in any of the above equations. This yields in f(x) = b. So the point (0, b) is where the line crosses the y-axis.
Case closed!
Epilogue of part 1
In this part we understood what a linear equation is. Isn’t it simple? Yes! This simple and powerful concept is one of the fundamental blocks to building neural networks. We shall see in part 2 of this series how to combine one or more linear equations exactly like the ones we saw here, squishing the resulting value of f(x) to fit it within a range, lets say between 0 and 1, to perform a feed forward pass of a neural network.