Intuition + Mathematics + Python behind Basic 3 Layers Neural Network

Marvin Martin
The Startup
Published in
10 min readMar 4, 2020

Let’s get an overall idea of what Neural Networks are and then let’s get to the mathematics. Here in this article, the architecture of the Feed Forward Neural Network is fixed to be a 3 layers Network (Input Layer + Hidden Layer + Output Layer). After mathematics, let’s code! (Beginner Tutorial)

Photo by Moritz Kindler on Unsplash

Intuition

Neural Networks are one of the most popular methods of machine learning, especially thanks to Python libraries that have become very easy to use. Before getting into mathematics, it is important to give a general overview of what Neural Networks are.

The easiest way to see a Neural Network is to consider it as a function with several parameters. The purpose of this function is to predict targeted things. It takes some inputs, and output the results. For example, it takes pixels from an input image and provides the probability that this image is a cat. The main idea here is to find the best parameters for this function to be really good at detecting cats.

The question now is how to find these parameters? The answer is not so complicated, we will find them through a “Learning process”. We will find these parameters by examining a massive amount of data. Before the training process, these parameters are initialized randomly (which means that our function is very bad at doing the job), and at the end of the training, our parameters will be optimal to produce the desired answer.

The Training process

The training process is based on 3 steps: the forward propagation, the error computation, and the backward propagation.

  • The forward propagation is basically a prediction (Y) for a given input (X) and parameters (W and B).
  • The error computation is just comparing this prediction (Y) with the actual answer (Y* also called the “label”).
  • The backward propagation is simply finding some new parameters that minimize that error. There are many optimization algorithms, here we will use the most famous one: The gradient descent.

We are going to iterate these steps on each data of the dataset (of size N). In general, going through the dataset once is not enough, so we iterate again on “Epochs”, meaning that we are going to update the parameters N x Epochs.

After the training process, the W and B parameters will be optimal to perform the task. Now it is time to make predictions on unseen data.

Training and predicting processes

Things are getting interesting, aren’t they? There are still a couple of ideas to develop before diving into the mathematics.

The Cost function & Gradient Descent

I will be brief on this subject because it is not the main topic. However, it is quite important to understand what a cost function and how to minimize it thanks to the gradient descent algorithm.

Like said previously, to “Learn” our Neural Network will have to minimize the error. The error is also called a Cost function. The Cost function is generally symbolized as a convex function. The are several functions that allow modeling the error, here is 2 examples:

Sum of Square Error (SSE ) & Mean Square Error (MSE): (square functions gives convex functions)

The Gradient Descent Algorithm:

An easy way to see the gradient descent algorithm is to see a tennis ball going down a mountain. The ball will follow the slope of the mountain. Here, in terms of mathematics, the slope corresponds to the derivative of the cost function at a given point. Gradients are just derivates!

Don’t forget, we want to minimize the cost function to find the parameters W and B. So, we look at the derivative of the cost function depending on those parameters. If you look at them carefully (especially with the visualization), you can easily understand that the sign of that derivate indicates the way you should move on the abscissa (which are parameters W and B).

  • If the derivative is positive (/), you want to move to the left side, so you need to take the opposite of the derivative.
  • If the derivative is negative(\), you want to move to the right side, so you need to take the (positive) derivative.

If we generalize the situation, we minimize the cost function by updating the parameters thanks to gradients:

Gradient descent

Here, α is called the Learning rate, it is a constant that adjusts how much percentage of the derivative should be stepped. It is generally a very small value (for example α = 0.01). If the learning rate is too high, it might lead to divergence.

The Neural Network Architecture

It is one of the most important parts of the article, we are going to define the variables for the maths (most of these variables are matrix and vectors).

A Neural Network is a set of Layers composed of “neurons” (which are just numbers) linked together by weighted links. In this article, we are going to focus on the simplest architecture, which has only 3 layers:

  • The Input Layer (1, i) has as many neurons as there are inputs (i). For example, if the input is an image of size 28x28 (pixels), there will be 28x28 = 784 neurons in the input layer.
  • The Hidden Layer (1, j) has an arbitrary number of neurons (j), decided by the developer. In general, the number of neurons depends on the complexity of your task ( for example, large-sized Image classification tasks requires a large number of neurons in their hidden layers).
  • The Output Layer (1, k) has as many neurons as there are outputs (k) required. For example, if your Neural Network is designed to output a single prediction, your output layer will contain only one neuron. For classification tasks, there will be as many neurons as there are classes (If there are 2 classes cats and dogs, there will be 2 neurons in the output layer).

The parameters W and B are linking together layers, so If there are 3 layers, there will be 4 parameters W(1), W(2), B(1), B(2).

Let’s see in terms of mathematics what happens inside on a single neuron:

Activation function

Why does a Neuron need to be Activated?

Complex problems are usually not modelized with linearity. Non-linearity is a key element in Neural Networks. Thus, the main goal of an activation function is to add non-linearity to the model. Also, it converts an input signal (from -∞ to ∞) into an output signal (generally from -1 to 1 or -0 to 1 depending on the function used) which will become the input for the next layer. Therefore, activation functions can be considered as transfer functions. The term “Activated” means that a neuron is ready to proceed to the next step of Forward propagation. The 3 most used Activation functions are:

https://towardsdatascience.com/complete-guide-of-activation-functions-34076e95d044

Remarque: Each layer can have its own activation function, the choice of the activation function depends on the problem you want to solve.

Mathematics

Things are really exciting! It is time to dive into the mathematics. Don’t get scared, if you take equations step by step, everything will be crystal clear.

Forward Propagation

Here is the mathematics behind the Forward propagation.

  • From Input Layer to Hidden Layer:
Forward Propagation ( Input Layer -> Output Layer )

H(^) are the pre-activated neurons resulting from the Hidden Layer. Thus, the vector H(^), depends on the values of the neurons X (from the Input layer) and the parameters W(1) and B(1) which are linking the Input Layer and Hidden Layer together.

Then, we need to activate H(^) to get H. Therefore, H are the activated neurons resulting from the Hidden Layer. These neurons will be considered as the Input of the next layer. Note that the output size of H is equal to the number of neurons in the Hidden Layer (j).

  • From Hidden Layer to Output Layer:
Forward Propagation ( Hidden Layer -> Output Layer )

Y(^) are the pre-activated neurons resulting from the Output Layer. Thus, the vector Y(^), depends on the values of the neurons H(from the Hidden layer) and the parameters W(2) and B(2) which are linking the Hidden Layer and Output Layer together.

Then, we need to activate Y(^) to get our final Layer Y. These neurons will be considered as the final result of the Neural Network. Note that the output size of Y is equal to the number of neurons in the Output Layer (k). Our Forward Propagation is now done for a single data point. We can compare this result to the label and compute the error.

Error Computation

How good is our prediction based on the current model?

In practice, we add the error to a variable at each iteration, and when you have been through the whole dataset (of size N) you can divide this variable by N. So that, at each epoch, you can check if your overall error is closed to 0. You will be able to see that in the coding part.

Backward Propagation

If you are still alive at this stage (I really hope you are), take a deep breath because the hardest part is yet to come.

The purpose of the Backward propagation is to find some new parameters W and B which minimize the error using Gradient Descent.

Gradient Descent formulas

However, our previous notations of the Gradient Descent explained above weren’t detailed enough. These are better notations.

You have 4 parameters to update. Each of these parameters has a determined shape based on its location in the Network. If you look carefully at these formulas, the unknown variables are only the Gradients (the derivative of the E on the parameters). Let’s compute them!

Okay, here, we need to update the parameters in the right order, from the Output Layer to the Input Layer. Therefore, we will update in this order W(2), B(2), W(1), B(1). This order is important because you will see in the mathematics that W(1) and B(1) depends on W(2). Let’s start!

The gradient of E for W(2)

The derivative dE/dW(2) was written as a sum in the first line and no longer in the last line. Why? Because when we applied the gradient on W(2), most of the elements became null except the one at the right indexes (j,k).

The gradient of E for B(2)

The derivative dE/dB(2) was written as a sum in the first line and no longer in the last line. Why? Because when we applied the gradient on B(2), most of the elements became null except for the bias.

We’re halfway there! We have already updated all the parameters between the Output Layer and the Hidden Layer. Now, let’s move “backward” even further, between the Hidden Layer and the Input Layer.

There are some additional difficulties here because W(1) is indirectly linked to yk (you will have to go through hj), so you have a double decomposition to perform. Don’t run away, we are almost there!

The gradient of E for W(1)

Come on, one more to go! Let’s update B(1) and you’ll be ready to implement that in your code.

The gradient of E for B(1)

We are officially done with math! Here is a quick summary of the general formulas we need for backward propagation of a 3-layers neural network:

Backward Formulas

Python3 implementation

I hope you survived the math! It’s coffee time, take a good break and come back ready for the coding part!

FFNN.py

Results

[INFO]: epoch = 1 | error = 0.17833722318123282[INFO]: epoch = 2 | error = 0.17569493660525415...[INFO]: epoch = 9998 | error = 0.0017202127419873423[INFO]: epoch = 9999 | error = 0.0017198195936561878[INFO]: epoch = 10000 | error = 0.0017194266110162218Input [[0 0][0 1][1 0][1 1]]Predition [[0.][1.][1.][0.]]Label [[0][1][1][0]]

GitHub Repository

What’s Next?

“Create your own ML library”

Do not hard-code the architecture of the Neural Network anymore, follow the medium of my friend Omar. You will be able to have as many hidden layers as you want and train stronger Neural Network.

Thank you for reading, don’t hesitate to send me your questions! Leave a clap (or several) if you enjoyed it ;)

--

--