Neural Network from scratch on Python with mathematical proofs and code explanation
If you get scared seeing the length of this article, or the equations at the bottom, you might want to have a look at an easier neural network first with only one neuron.
Let’s run the code first and then analyze every part
Let’s clone this Github project, or simply run the following code in your favorite IDE.
If you need help setting up the environment to run the code, see this article to get started with PyCharm.If everything goes okay, you will see a graph like this.
The Problem — Fahrenheit from Celsius
We will train our machine to predict Fahrenheit from Celsius. As you can understand from the code (or the graph), the blue line is the actual Celsius-Fahrenheit relation. The red line is the relation predicted by our baby machine without any training. Finally, we train the machine, and the green line is the prediction after training.
The magic is in line#69, where we are training the machine. You can see that before and after training, it is predicting using the same function (get_predicted_fahrenheit_values()). So what magic train() is doing? Let’s find out.
Network Structure
Input: A float representing celsius
weights1: 1 X 2 matrix representing 2 weights from input to layer1
biases1: 1 X 2 matrix representing 2 biases from input to layer1
weights2: 2 X 1 matrix representing 2 weights from layer1 to output
biases2: 1 X 1 matrix representing 1 bias from layer1 to output
Output: A float representing predicted Fahrenheit
So, we have total 7 parameters (2 weights on the first layer, 2 biases on the first layer, 2 weights on the second layer, and 1 bias on the second layer).
Forensic Analysis of the Code
Now let's explain the code part by part:
In Line#12, we are generating an array of 100 numbers between -50 and +50.
In Line#13, we are converting the one-dimensional array to a 100 X 1 matrix, which will be helpful to feed the data to the machine as you will see shortly.
To get an understanding of how reshape() works, if x is an array of [10, 20, 30], x.reshape(3, 1) will produce a 2D matrix, each row containing one column only. In other words, each row will become an array of one element — in this case, 1st row will be an array containing [10], 2nd row [20], and 3rd row [30].
In Line#14, we are generating the Fahrenheit for each celsius value.
From Line#16–20, we are initializing our network parameters with random values.
train()
We are running 2000 iterations of training here. Each iteration is made of:
- forward (Line#58) pass
- backward (Line#59) pass
- update_parameters (Line#60)
If you are new to python, it might look a bit strange to you — python functions can return multiple values as tuple
Notice that update_parameters is the only thing we are interested in. Everything else we are doing here is to evaluate the parameters of this function, which are the gradients (we will explain below what gradients are) of our weights and biases. Each gradient variable here is a matrix of the same dimension of the corresponding parameter variable. For example, since weights_1 is 1 X 2 matrix, so is grad_weights_1.
- grad_weights_1: 1 X 2 matrix. Gradients of 2 weights from input to layer1
- grad_biases_1: 1 X 2 matrix. Gradients of 2 biases from input to layer1
- grad_weights_2: 2 X 1 matrix. Gradients of 2 weights from layer1 to output
- grad_biases_2: 1 X 1 matrix. Gradients of 1 bias from layer1 to output
We get these values by calling backward, but it requires:
- output1: Output of layer1
- output2: Output of layer2. The final predicted Fahrenheit value.
And, to get these values, we need to run forward with the celsius values and corresponding actual Fahrenheit values.
forward()
Notice that here celsius_values and fahrenheit_values are matrices of 100 rows:
After executing Line#24, for an element in celsius_values, say 42.92929, output1 will contain 2 numbers:
output1[0] =42.92929 * weight1[0] + biases1[0]
output1[1] =42.92929 * weight1[1] + biases1[1]
So, for 100 elements in celsius_values, output1 will be basically a 100 X 2 matrix, and we can evaluate it with dot product and matrix addition as shown in the code.
Line#25 is doing the same operation, but here input is the output of the previous layer (output1).
Line#26 is calculating loss using Mean Squared Error (MSE) loss function, which is just a fancy name of the square of all differences divided by the number of samples (100 in this case).
Small loss means better prediction. If you keep printing loss in every iteration, you will see that it is decreasing as training progresses.
Finally, in Line#27 we are returning these values.
backward
Now things are going to get real messy. Grab a coffee if you need it, and give me your full attention.
Remember, we are only interested to update our weights and biases. To update those values, we have to know their gradients, and that is what we are calculating here — that is the only reason for this backward function, and possibly the only reason to write the whole article.
Notice gradients are being calculated in reverse order (the last layer being calculated first), and so the name “backpropagation”. The reason is, to calculate gradients in a layer, we need some information from the next layer — so that we can use them in the chain rule formula.
Now let’s have a look at what gradient and chain rule are.
Gradient
For the sake of simplicity, consider we have only one value of celsius_values and fahrenheit_values, 42.92929 and 109.27273 respectively.
Now, if you take a piece of paper and breakdown the calculation in Line#26, you will find:
As you see, loss depends on 7 parameters. Consider only one parameter for now — the first weight of the first layer. In the code, it is weights1[0][0]. Imagine, we initialized it with a random value, say, 0.8, and after evaluating the equation above, we get 123.45 as the value of loss. Based on this loss value, you have to decide how you will update weights1[0][0]. Should you make it 0.9, or 0.7?
You have to update weights1[0][0] in a way so that in the next iteration you get a lower value for loss (remember, minimizing loss is the ultimate goal). So, if increasing weights1[0][0] increases loss, we will decrease it. And if increasing weights1[0][0] decreases loss, we will increase it.
Now, the question, how we know if increasing weights1[0][0] will increase or decrease loss. This is where the gradient comes in. Broadly speaking, gradient is defined by derivative. Remember from your high school calculus, ∂y/∂x (which is partial derivative/gradient of y with respect to x) indicates how y will change with a small change in x.
If ∂y/∂x is positive, it means a small increment in x will increase y.
If ∂y/∂x is negative, it means a small increment in x will decrease y.
If ∂y/∂x is big, a small change in x will cause a big change in y.
If ∂y/∂x is small, a small change in x will cause a small change in y.
So, from gradients, we get 2 information. Which direction the parameter has to be updated (increase or decrease) and how much (big or small).
Chain Rule
Informally speaking, the chain rule says:
Or let’s go a bit further,
See — why it is called chain rule?
Consider example of weights1[0][0] above. We need to calculate grad_weights_1[0][0] to update this weight, which will be calculated by:
With chain rule formula, we can derive it:
Let's break down all gradients for all 7 parameters, and draw a dependency diagram.
You don’t need to follow all the boxes here.
We are interested only in the blue boxes, which are gradient for the parameters. We need to calculate other boxes only to evaluate those 7 boxes. Notice a few interesting aspects:
- To calculate gradient for a parameter in a layer (for example ∂loss / ∂weights[0][0]), we need only 2 information:
(a) Derivative of output of the layer with respect to the parameter (∂output1/∂weights[0][0])
(b) Derivative of loss with respect to output of the layer (∂loss / ∂output1) - To calculate output gradient (∂loss / ∂output1) for a layer, we need output gradient for all subsequence layers (∂loss/∂output2 in this case). Meaning all gradients calculations here depend on the gradient for the last layer output. That is why we are calculating ∂loss/∂output2 first on the backpass (Line#31).
- Gradient for a parameter does not depend on anything from previous layer. For example ∂ loss/∂ weights2[0][0] does not depend on ∂loss/∂output1.
- Gradient for a parameter does not depend on any other parameter (or gradient for any other parameter) on any layer. For example, ∂ loss /∂ weights[0][0] does not depend on weights2[0][0], or ∂loss /∂weights2[0][0]).
- So, from these observations, we understand that, during backpass, at every layer, we are calculating output gradient (for this example, ∂loss /∂output2)for 2 possible reasons:
(a) to calculate gradients of parameters on this layer, meaning for this example:
∂ loss / ∂ weights2[0][0],
∂ loss / ∂ weights2[1][0], and
∂ loss / ∂ biases2[0][0]
(b) To pass this gradient (∂loss /∂output2) to the previous layer for it to be used in chain rule to calculate gradients for parameters on that layer (weights1[0][0], weights1[0][1], biases1[0][0], and biases1[0][1] in this case).
In high-level ML frameworks, for example in PyTorch, you don’t have to write codes for backpass! During the forward pass, it creates computational graphs, and during back pass, it goes through opposite direction in the graph and calculates gradients using chain rule.
∂ loss / ∂ output2
We define this variable by grad_output2 in code, and as you see in line#31, it is calculated this way:
Let’s find out the reason behind the formula.
Remember, we are feeding all 100 celsius_values in the machine together. So, grad_output2 will be a 100 X 1 matrix, each element containing gradient of output2 for the corresponding element in celsius_values. For simplicity, let us consider, there are only 2 items in celsius_values.
So, breaking down line#26,
where,
output2_1 = output2 value for 1st celsius value
output2_2 = output2 value for 2nd celsius value
fahreinheit_values_1 = Actual fahreinheit value for 1st celsius value
fahreinheit_values_1 = Actual fahreinheit value for 2nd celsius value
Now, the resulting variable grad_output2 will contain 2 values — gradient of output2_1 and output2_2, meaning:
Let us calculate the gradient of output_2_1 only, and then we can apply the same rule for the others.
Calculus time!
Which is the same as line#31
Gradients of second layer weights
Let’s calculate the gradient of the first weight on 2nd layer
Which is same as Line#32. Notice that, after the dot product, for each weight, gradient values for each of the 100 input values are being summed up, which is okay, since we are multiplying all the gradients with a small factor before updating the parameters (see the last section Updating Parameters).
Gradients of second layer bias
Which is same as Line#33. Like weight gradients, these values for each of the 100 inputs are being summed up. Again, it is fine since gradients are multiplied with a small factor before updating parameters (see the last section Updating Parameters).
Gradients of first layer weights
Let’s calculate the gradient of the first weight in 1st layer
Doesn’t it look same as Line#36? In code grad_output1 is a matrix representing gradients of 2 outputs from the hidden layer, and grad_output1_1 in the equation above is the first element of it — meaning gradient of the first neuron in the hidden layer. Let's evaluate it:
Which is same as line#35. On that line, all gradients of output for all 2 neurons in the hidden layers are calculated in one operation with help of matrix dot product.
Gradients of first layer bias
Let’s calculate the gradient of the first bias in the first layer.
grad_output_1_1 in the equation is the gradient of output of first neuron in the hidden layer (first layer). So, for every bias, gradient is equal to gradient of output of the corresponding neuron — which written in line#37 in the code. Look above — the same thing applies to bias in the output layer as well!
Updating Parameters
Finally, we are updating the parameters. Notice that the gradients multiplied by a small factor (LEARNING_RATE) before getting subtracted, to make the training stable. A big value of LEARNING_RATE will cause an overshooting issue and an extremely small value will make the training slower, which might need a lot more iterations. We should find an optimal value for it with some trial and error. There are many online resources on it including this one to know more about learning Rate.
Notice that, the exact amount we adjust is not extremely critical. For example, if you tune LEARNING_RATE a bit, the descent_ variables (Line#45–48) will be changed, but the machine might still work. The important thing is making sure these amounts are derived by scaling down the gradients with the same factor (LEARNING_RATE in this case). In other words, “keeping the descent of the gradients proportional” matters more than “how much they descent”.
Also notice that these gradient values are actually sum of gradients evaluated for each of the 100 inputs. But since these are scaled with the same value, it is fine as mentioned above.
In order to update the parameters, we have to declare them with global keyword (in Line#43).
Where to go from here
If you didn’t have a clear understanding of backpropagation before reading this article, and after reading it if you think you understood everything, you are definitely wrong. It is impossible to realize backpropagation, until you run codes and trace variables by yourself, what you should be doing now.
Any questions - please leave a comment. I will surely address it.