Fundamentals of Neural Networks

Inspired by the human brain, Artificial Neural Networks are powerful models that excel at recognizing patterns. Recently, thanks to advances in parallel computing, these models have been largely employed to solve many complex problems in different fields such as Computer Vision, Natural Language Processing, Robotics, Drug Discovery, etc. In this article, we are going to describe the theoretical fundamentals of Artificial Neural Networks.

Published in

Semantix

8 min readJan 5, 2022

Artificial Neural Networks, or shortly Neural Nets, are a group of mathematical models for nonlinear problems. Historically, these algorithms were inspired by the biological behaviour of the human brain. In the 1940s, McCulloch and Pitts [1], aiming to create an electronic brain, inaugurated the theory of “artificial neural networks”, proposing the first mathematical model for the biological neuron, as illustrated in Figure 1. This neural is called perceptron.

Fig. 1. Diagram of the mathematical model of the neuron.

In this model, each input xᵢ is associated with a weight wᵢ. The neuron takes the inputs xᵢ and aggregates them by computing the sum:

Then, the aggregated value z is passed to the activation function f(z) that decides whether to activate the output or not. Initially, the model considered both xᵢ and f(z) as binary values, so they assume either 0 or 1, which means active or inactive. But, it can easily be extended to the real values.

To illustrate how this model works, let us consider the AND operator. We want to model the logical expression x AND y. What are the weights and the activation function that gives the correct output for all values of x and y? There is an infinity of correct answers to this question. One possibility is to choose the weights w₁=2, w₂=2, w₃=-2, and the activation function f(.) as a step function that returns 1 for all z ≥ 0 and 0, otherwise. We show this model graphically in Figure 2.

Fig. 2. Illustration of the chosen neuron and the activation function.

From the chosen wᵢ’s, z is defined as z = 2x+2y-3. And we can see that this model gives the correct output for all possible values of x and y and is indicated in the Truth Table in Figure 3.

Now that we already understand how to model the AND operator using an artificial neuron, let us try to create a model of the XOR operator! Well… Not so fast! This was a problem that remained open for 30 almost years until the 1960s. And the reason for this is that the XOR operator can’t be modelled using only one neuron; we need three neurons instead. As proposed by Minsky and Papert [2] in 1969, the solution for this problem requires 3 neurons organized within 2 layers. The model with multiple neurons organized in multiple layers constitutes a neural network. This model originates the Multi-layer Perception. In Figure 4, we show a neural net for the XOR operator. In this example, we have:

z₁ = 2x-2y-1
z₂ = -2x+2y-1
z₃ = f(z₁)+f(z₂)-1

ŷ = f(z₃) (output)

Fig. 4. A neural model for XOR operator.

Again, as we can see in the truth table in Figure 5, this model has the correct output for all values of x and y.

At this point, we can note that some problems require not only many neurons. But also an organization in multiple layers forming a neural network. We use specific names for some layers. We name the first layer, with X values, as input layer. We name the last layer, with predictions or ŷ values, as output layer. All the other layers in the middle of the input and output layers are called hidden layers.

The way that the layers of a neural network are organized is called architecture. For each problem, there are several architectures and choosing the best one is an art. Recently, arose several studies regarding the design of neural nets’ architecture. This field was inaugurated by the paper of LeCun et al., “Deep Learning” [3], which is the most cited paper of Nature with more than 45 thousand citations.

Forward Pass

The process of taking the input to generate the prediction is called Forward Pass. It was shown so far by using scalar operations, but in order to optimize this process leveraging GPUs and parallel computing, we can rewrite the equations using matrix notation. Let us take the XOR example. We can write the weights between the input and hidden layers as the W₁ matrix. Similarly, we denote the weights between the hidden and output layers by W₂. Then:

Note that we can get the same output just using matrix operations. With these Equations, the neural net can get input value and map it to an output, the prediction.

Training a Neural Net

Now that we already know the building blocks of a neural network arise a question: how to choose the weights of a neural net?

This question remained open for years during what is known as the AI Winter. In 1986, Rumelhart, Hinton and Williams [5] showed how to train multi-layer neural networks using simple stochastic gradient descent with backpropagation, previously proposed by Werbos [4].

Gradient Descent

In Figure 6, we show the pseudocode of the Stochastic Gradient Descent algorithm (see “Gradient Descent Algorithm - a deep dive”). Despite simple, it is a very powerful method. Basically, it takes the model prediction (y_pred) and the actual output value (y_actual) for a given input x. Then, it computes the loss J, aka error. Finally, it updates each weight wᵢ decreasing it by a learning rate α multiplied by the gradient of the loss with respect to wᵢ. And, it repeats this process until reach convergence.

This algorithm relies on the fact that the gradient of a function at a particular point is a vector pointing in the direction of the steepest slope. Then, when we walk to the contrary direction of the gradient, we go towards the local (potentially the global) minimum of the function, as shown in Figure 7. It is worth highlighting that the value of the learning rate α is an important hyperparameter. Because if we choose large values for α the algorithm can easily diverge, and if we choose small values, the algorithm may take too long to converge. There are many variations of Gradient Descent that converge faster by dynamically reducing the value of α in the final steps. If you are interested in better understanding optimizer algorithms, I highly recommend you to read this blog post “An overview of gradient descent optimization algorithms”.

Fig. 7. Illustration of Stochastic Gradient Descent running. Image from gfycat.

Backpropagation

Fig. 8. Diagram showing how the loss is computed in a neuron.

Now, let us use the Gradient Descent algorithm to train a neural net. The first problem that we face is that the gradient of the loss function with respect to wᵢ can’t be directly computed. Because note that J depends on ŷ, that depends on z which in turn depends on wᵢ. Thus, we have to recall the chain rule from Calculus.

The loss function measures how good (or bad) the model performs. It is used to compute the error between the prediction given by the model and the ground truth. For each problem, many loss functions can be optimized (see “What are Loss Functions?”). Choosing a good Loss Function is an important decision because it will guide the algorithm through a good path toward an optimum solution. Once we choose the loss function, we know how to compute the gradient of J with respect to ŷ. Similarly, we know how to compute the gradient of ŷ with respect to z, because we choose the activation function.

Applying the chain rule, we obtain:

Applying the chain rule to find the gradient of the loss function with respect to the weights wᵢ is the Backpropagation algorithm. It can be used in conjunction with the Gradient Descent algorithm to train a neuron and even a complex neural network, which will take much more derivatives computation.

To illustrate the backpropagation working in a neural network, we created a net with two inner layers, as illustrated in Figure 9. Let us use the Gradient Descent to train this neural net. The first step is to predict the output value, which is done according to equations in Eq. 1. Note that, at first, all the weights will be initialized with random values. In the following step, the method computes the loss. Then, to update the weights it is needed to compute the gradient of the loss with respect to each weight.

Fig. 9. Diagram showing how the loss is computed in a neural net.

For simplicity, let us show how to compute the gradient for the weight w1₂₁ at the first layer. In Eq. 2, we show how to compute the gradient of the loss with respect to w1₂₁ by applying the chain rule. Note that the more layers there exists in the architecture, the more gradients compose the chain rule.

As you can see, training a Neural Network requires a strong Mathematical background. Also, applying the chain rule several times to compute the derivatives may lead to calculation errors that make it tough to train the model. Fortunately, in practice, we use frameworks such as PyTorch and Tensorflow and they do all the work. These frameworks do all the computations very fast using matrix operations and optimizing them with GPUs and parallel computing.

Summary

To develop a neural network model, you need:

Choose a Deep Learning framework of your preference.
Define the overall architecture by designing how the layers will be organized, the number of neurons in each layer.
Decide which activation function you will use (see “Introduction to Different Activation Functions for Deep Learning”).
Define the loss function (see “What are Loss Functions?”).
Define the optimizer (see “An overview of gradient descent optimization algorithms”).
Finally, train your model.

References

[1] McCulloch, W.S., Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133 (1943). https://doi.org/10.1007/BF02478259

[2] Marvin Minsky and Seymour Papert, 1972 (2nd edition with corrections, first edition 1969) Perceptrons: An Introduction to Computational Geometry, The MIT Press, Cambridge MA, ISBN 0–262–63022–2.

[3] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539

[4] Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard Univ. (1974).

[5] Rumelhart, D., Hinton, G. & Williams, R. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). https://doi.org/10.1038/323533a0