# Understanding the math behind Neural Networks

Neural Networks (NNs) are the typical algorithms employed in deep learning tasks. The reason why they are so popular is, intuitively, because of their ‘deep’ understanding of data, which is provided thanks to their peculiar structure. NNs, indeed, are built in the same way as the human brain’s neurons. Further, they aim at mimicking the way those networks send and receive impulses — basically, NNs mimic the way the human brain actually works.

Another interesting property of NNs is their being flexible in terms of structure and complexity: as you will see soon, they need only few, basic elements to work: all the exceeding stuff is something that might make them work better, but the inner structure remains the same.

In this article, I will explain the math behind a simple NN, building it from scratch without any language programming, only armed with matrices’ calculations (do not panic: data will be extremely simple and easy to handle). Of course, useful and powerful NN are rather built with Tensorflow, Keras or Pytorch, nevertheless, you will see that, by understanding the idea behind, you will find far easier to build your algorithm with code.

So let’s start. As anticipated, NNs replicates the structure of our biological neurons, and they look like that:

So the circles are called neurons. One list of neurons is a layer: the first layer is the input layer, while the last one (in our case, represented by only one neuron) is the output layer; those in the middle, on the other hand, are the hidden layers. Finally, all the interconnections among neurons are our synapses.

The main elements of NN are, in conclusion, neurons and synapses, both in charge of computing mathematical operations. Yes, because NNs are nothing but a series of mathematical computations: each synapsis holds a weight, while each neuron computes a weighted sum using input data and synapses’ weights.

Let’s visualize it with a smaller structure (for the sake of simplicity, let’s assume that the last three synapses have no weights, or weight equal to 1):

It can also be displayed in a matrix format:

Then, by multiplying the output of our hidden layer with a vector of ones:

We obtain the output value. This is the core structure of NN, and you will see how the understanding of these basic concepts will simplify by far the building procedure. Basically, the idea is that the output of each layer becomes the input of the next one.

Of course, there are many further concepts which need to be introduced, but this time I’ll do so while dealing with numerical data, since armed with those few bits of knowledge we can already start building our algorithm.

For this purpose, consider the following problem:

So we have four observations characterized by two features (hours of study=x1 and GPA=x2) and a target categorical value (exam result=y), hence we are facing a classification problem. Let’s now look at the first entry and build our NN:

As you can see, I put some numerical values in place of weights. How did I decide their values though? Well, weights are the parameters of our algorithm, hence those values which are going to be optimized in order to reduce the loss. However, they need to be initialized before being optimized, so you have to assign them an initial value. The latter can be determined with different techniques, and I recommend you to read my article about all the elements of NN if you want to learn more about this step. Here, I just put random values, which we are going to optimize later on.

So we obtained as output the value of 5.815. How can we relate it to our categorical output ‘pass’ or ‘fail’? The fact is that mathematical operations return a continuous value, while we need a categorical value. So, we need to convert that 5.815 into something relevant, and to do so, we are going to introduce a new element of NN: activation function. As the name suggests, those are functions which convert the input value into something relevant. There are many activation functions (again, you can learn more here), but here there is one which can come to aid.

It is the Sigmoid function, and it converts any continuous value into an output within the range [0,1], so that it becomes a probability. More specifically, if we assume that ‘pass’=1 and ‘fail’=0, an output equal to 0.75 means that our observation will be ‘pass’ with 75% of probability. So, if we apply this activation function to our output, we obtain 0.997 (that means, our first student will pass the exam with 99.7% of probability).

The result is nice, but could have we done better? Of course, like any other algorithm, NNs need to be evaluated with a loss function and optimized accordingly. Hence, for this purpose, I will use the cross-entropy loss function (that employed in classification problems):

Where y is the binary value 1–0 and p is the probability of our observation to belong to class 1. Applying this function to our output returns a loss of 0.003005. Armed with this value, how can we set a strategy which minimizes it? The answer is the concept of Gradient Descent. With gradient descent, our algorithm is optimized through a step called Backpropagation, where all the weights are recalibrated according to this equation:

Indeed, if we plot a generic loss function with respect to only one weight, we can see that there is a minimum where the loss is minimized.

Our goal is to reach that minimum, and we do that by changing the weight’s value towards that minimum.

So we need a direction (given by the slope of the first derivate) and a measure of the impact of that change on the weight (called learning rate).

Once the backward propagation is done, we will have a new set of weights, so that we can compute a new loss function: if the value is smaller than the previous one, it means our weights have been updated in the right direction. This re-calibration procedure is repeated many times until we optimize our NN.

So, in this article, we explored the basic elements of Neural Networks. Keep in mind that, as mentioned above, you can customize your NN with as many layers, neurons and activation functions as you want. Further, you will often need to proceed by attempts before finding the optimum structure. To have an idea, you can visit the Tensorflow Playground and try different configurations of your algorithm.