How to create a neural network from scratch in Python — Math & Code

May 16 · 11 min read

In this article, I try to explain to you in a comprehensive and mathematical way how a simple 2-layered neural network works, by coding one from scratch in Python. This article is written as much for you to help you understand the behind the scenes of such a popular algorithm, as for me to have a cheat sheet that explains in my own words how a neural network works.

First, we study the rationale behind the elaboration of these algorithms and the mathematical intuition behind them. Then, we dive into the coding of a neural network (mixing Python lines of code and mathematical equations). Eventually, we imagine how we could generalize our model and make it more adaptable to solve complex real-life issues.

Why do we need neural networks?

Machine learning can be defined as all the techniques used to train a computer to do a task without being explicitly coded to do it. While this definition might sound esoteric, it can be easily understood.

Let’s say we have a simple equation as follows:

In this equation, x is a given variable and y is a variable dependent of x. f is an unknown function. For instance, in a self-driving car x could be the environment surrounding a car (the other cars, the circulation signs, the color of traffic lights…) and y how the car behaves with respect to those elements. In machine learning, since we don’t know f, we use statistical methods to come as close as possible from it.

f can have plenty of shapes. The simplest is when f is linear. In this case, f looks like:

Where a and b are real numbers that we approximate thanks to statistical methods. In particular, we use the famous linear regression method to determine the two coefficients a and b.

But what if f is not linear?

How can we know which shape it has? Is it exponential? Quadratic? For most functions, in fact we can’t know.

Here, the trick comes from a theorem demonstrated by Kurt Hornik called the Universal Approximation Theorem:

The Universal Approximation Theorem demonstrated by Kurt Hornik

For us math profanes, it simply says that by summing enough linear functions and transforming them with the same non-linear function, we can approximate any function f.

Simply put, by selecting the right weights and biases, we can approximate any non-linear function. And that’s a big deal since in our world many relationships between an input and an output are non-linear.

Building a neural network from scratch

Now that we have a sense of what a neural network is used for, let’s try to code one. For greater clarity, we decompose how a neural network works in several steps:

- The structure

- The feedforward propagation

- The loss function

- The gradient back-propagation and the underlying mathematical equations

- Making our neural network run

The structure

In this article, we will only create a two-layered neural network, but the idea remains the same for more-than-two-layered neural network.

As you can see in the picture, our neural network is composed of three layers: an input layer, a hidden layer and an output layer.

NB: We usually omit the input layer, hence the name “two-layered neural network”.

The connections made between the input and the hidden layers as well as between the hidden and output layers are called weights.

To calculate the values in the cells (we call them activations), we use a mathematical equation we are familiar with:

Where a is the activation, x the input, W the weights, b the biases and φ a non-linear function called the activation function (you might also hear non-linearity).

The activation function creates non-linearity in our model. You should recall the universal theorem here. Indeed, we find the same structure. So, to create the structure of our neural net, we need:

- An input x (the red cells in the picture above)

- A target output y

- An output layer (the green cells in the picture above)

- Weight vectors

- Bias vectors

In the following snippet of code, we define the __init__ function of our NeuralNetwork class.

You can also observe a learning rate. I assume familiarity with basic concepts of machine learning. If you don’t know what a learning rate is, I recommend the famous Andrew Ng’s Machine Learning course from Stanford University that you can find on Coursera.

As you can see, I initialize my weights randomly with small numbers since we will implement a gradient descent algorithm in a few minutes to optimize our weights and biases. We can’t set our weights to zero, since the model would fail to update the gradient descent equations. If you want to know more about it, read https://machinelearningmastery.com/why-initialize-a-neural-network-with-random-weights/ .

Furthermore, we create a weight1 vector that has a shape (number of rows of our input layer, number of rows of the hidden layer). Similarly, the weight2 vector has a shape (number of rows of the hidden layer, number of rows of the output layer). You can clearly see it on the image above where the weight matrices make the connections between the layers.

The feedforward propagation

To train a neural network, there are three basic steps.

First, you calculate all your activations. Second, you calculate your error. Third you optimize the weights to minimize the error of your model. You do these three steps iteratively a few hundreds of times (we call them epochs), in order to build a model that is accurate enough.

The feedforward propagation aims at calculating the activations. Remember, we first compute the vectorized equation z = x*W+b and then we apply the activation function φ.

The activation function that we chose is the sigmoid function. It has remarkable properties, of which one is particularly helpful, its derivatives yield the equation:

As you will see, this property will greatly help us to simplify our calculations. Another interesting property is that weights are real numbers between 0 and 1. Thus, we have fairly small numbers in our neural network, and we don’t risk to have a slow neural network due to heavy calculations.

NB: Nowadays, almost all networks use the ReLU function (Rectified Linear Unit) as their main activation function, except for the last layer.

Our activations are computed as follows: a = sigmoid(z).

Hence the code:

The loss function

Now that we have defined the basic process of our algorithm, we need something to calculate our error over the network. Remember, a neural network is a three-step process. First, we calculate an output, second an error and finally we minimize the error. In order to calculate the error, we use a loss function. There exist a lot of loss functions, here we take the cross-entropy loss function, which is:

If you want to know more about the origin of this function and have an intuition of it, go and check Shannon’s entropy: https://en.wiktionary.org/wiki/Shannon_entropy

To calculate the loss at each step of our back-propagation process, we will call this function.

NB: For regression problems (ie problems where continuous values are predicted), the mean squared error loss function is often used. We prefer the cross-entropy function since we are dealing with a classification problem. The cross-entropy function heavily penalizes wrong discrete predictions with a high confidence, hence making it a much better fit for classification problems than the mean squared error function.

The back-propagation process and the underlying mathematical equations

Now here comes the math part. Simply put, the whole idea of our neural network is to minimize the loss function by finding the optimal weights and biases. To do so, we use an optimization algorithm called the Gradient Descent algorithm.

Given a function J, a weight matrix W and a learning rate α, we can minimize our function by computing iteratively:

More specifically, ∂J/∂W is the gradient of the loss function.

But before diving into the math, let’s gain an intuition about the chain-rule. Intuitively, what we are trying to compute is how much w1 and w2 affect our loss function, that is the error of our network. Nonetheless, as you can see in the loss function expression, there is not a direct relationship between our loss function J and our weights. J is defined with respect to a target and an output. Hence, we need to “chain” all our results to derive the variation of J induced by a variation of w1. Indeed, we will first derive the variation of our output induced by a variation of the weights, and then we will derive the variation of our loss function induced by a variation of our output. And by chaining everything together, we can find the variation of J induced by a variation of w1.

The following math equations show how to derive the gradient using the chain-rule.

The first one consists in deriving the gradient of the error with respect to each weight connecting the hidden layer to the output layer. Those weights are the coefficients in our matrix weight2 that we created previously.

where z2 is the dot product of the hidden activations a1 and the weights connecting the hidden layer to the output layer z2 = a1 * w2 + b2.

Then, we can examine each factor. First:

We have derived the cross-entropy function with respect to output.

Second:

Let’s recall that output = sigmoid(z2). Plus, don’t forget the specific property of the derivates of the sigmoid function that I explained above.

Third:

since z2 = a1 * w2 + b2

By combining the 3 equations above, we eventually get:

For the second step, we need to compute the gradient of the error with respect to each weight connecting the input layer to the hidden layer (or should we have a larger neural network, the hidden layers between them). This time, the weights are the coefficients of the matrix weight1 that we created previously in the __init__ method in our NeuralNetwork class.

To do so, we use the calculations that we did before. As you can clearly see, we are chaining the result since we are combining every calculation we did to finally arrive to weight1. In a larger neural network, we would do strictly the same, with more weight matrices to compute.

Similarly, we take z1 = w1 * x + b1 the weighted input sum of our neural network and a1 = sigmoid(z1) the activations of the hidden layer.

We start by stating that:

Here we go again, let’s study in depth the derivatives that compose ∂J/∂W1.

We get the above by following the calculations that we did previously.

since z2 = a1 * w2 + b2.

Again, remember the derivatives of a sigmoid function.

since z1 = w1 *x + b1.

Hence by combining everything, we obtain:

Finally, we need to update the biases of our neural network. The basic idea remains the same, we need to calculate the gradient of the loss function with respect to b2 and b1 and update our biases. Here we go again, but this time the calculations are much simpler. Let’s take a vector bi that represents the bias for the layer i, we use again the chain rule.

Now we have:

But:

since z(i) = a(i-1) * w(i) + b(i)

Hence, we get:

Thus, the final code:

As you can see, I also included in the code the learning rate, according to the Gradient Descent algorithm. We then have to update our existing weights.

Making our neural network run

Now that we have a fully functional neural network with the feedforward and back-propagation processes implemented, we can create an instance of our Neural Network class and let it update its weights and biases for a pre-defined number of times (the epochs).

Here, 1500 times might be a good start to have a performant enough neural network.

We feed our neural network with a training set, here we can pass a table that gives us Boolean data about the key factor that leads a person to have a diabetes. We make our neural network learn the patterns that predicts whether or not a person has a diabetes.

Find below the entire code and in bonus, a function to plot the cost evolution with respect to the number of epochs:

If we plot the evolution of our loss, we get a curved shape, which is what we want. It means that the neural network is learning correctly, not being stuck in local optima but progressively reaching the global minimum.

As you can see, our loss function is minimized when the algorithm reaches the 800th epoch. Hence, we can lower our number of epochs run, to optimize our neural network.

Where to go from here?

Now that you have a clear understanding of how a neural network works, you can generalize the neural network by changing the number of neurons for every layer and even add others hidden layers.

You can also implement cyclical learning rates since it is an easy way to achieve state-of-the-art machine learning models, as pinpointed in the fast.ai course that I highly recommend.

I hope you enjoyed this article; you can find below all the amazing references that I used. All credits to them. If you have any questions or remarks, please feel free to comment below. I’m far from being an expert in machine learning and any corrections or tips are welcome!

Sources and references

https://www.ics.uci.edu/~pjsadows/notes.pdf

http://neuron.eng.wayne.edu/tarek/MITbook/chap2/2_3.html

https://towardsdatascience.com/how-to-build-a-simple-neural-network-from-scratch-with-python-9f011896d2f3

https://towardsdatascience.com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6

https://towardsdatascience.com/coding-a-2-layer-neural-network-from-scratch-in-python-4dd022d19fd2

Pierre-Antoine Bannier

Written by

HEC Paris Graduate student. Currently following Fast.ai course.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade