Neural Networks, Multilayer Perceptron and the Backpropagation Algorithm

Tiago M. Leite
8 min readMay 10, 2018

--

http://selmandesign.com/wp-content/uploads/2016/12/SelmanDesign_Q-A_CATorDOG-flow.gif

Have you wondered how image recognition systems work? How a mobile app detects faces, or a smart keyboard suggests the next word? It turns out that Neural Networks have been widely used to perform such tasks, although they also ended up being useful in other branches, like function approximation, time series forecasting and natural language processing.

In this article, I’m going to explain how a basic type of neural network works: the Multilayer Perceptron, as well as a fascinating algorithm responsible for its learning, called backpropagation. Such neural network model has served as a basis for more complex models existing nowadays, like Convolutional Neural Networks, which are the state-of-art for image classification.

If you studied Calculus in college and haven’t forgotten it yet, you’ll have no problem to understand all the mathematics in here. Otherwise, don’t feel bothered by the formulas, just focus on the main idea; some real-world analogies were given in order to illustrate the concepts and help you to understand them.

Getting inspired by how biological neurons in animal nervous system work, a computational model of a neuron was established in Artificial Intelligence branch, according to the following illustration:

Computational model of a neuron

The input signals are represented by the array x = [x1, x2, x3, …, xN], which can correspond, for instance, to pixels of an image. As they are fed to the neuron, they are multiplied by their corresponding synaptic weights, which are the elements of the array w = [w1, w2, w3, …, wN], generating the value z, normally called “activation potential”, according to the following expression:

The additional term b, which is not affected by the input array, is called bias and provides one more degree of freedom to our model. Then, the value z is passed to an activation function σ that should be non-linear, responsible for limiting that value to a certain range, generating the final output value y in the neuron. Some typical activation functions are sigmoid, hyperbolic tangent, softmax and ReLU (Rectified Linear Unit).

One single neuron can’t do much, but we can combine many of them in a structure composed by layers, which one having a different number of neurons, forming a neural network known as Multi Layer Perceptron (MLP). The input array x passes through the first layer, whose output values are connected to the input values of the next layer, and so on, until the network gives, as result, the outputs of the last layer. One can arrange the network in several layers, making it deep and allowing it to be able to learn more and more complex patterns and relationships present in the data.

Neurons arranged to form a network. Each circle represents a neuron like that one described previously.

The network needs to be trained in order to work. Its training process is in the context of supervised machine learning, where each data sample has a label informing to which class it belongs to. Thus, the general idea is to make the network able to learn the patterns relative to each class of objects we have, so when a new unknown sample of data is given to the network, it will be able to predict to which class that sample belongs to. How can this be done?

The idea of the backpropagation algorithm is, based on error (or loss) calculation, to recalculate the weights array w in the last neuron layer, and proceed this way towards the previous layers, from back to front, that is, to update all the weights w in each layer, from the last one until reaching the input layer of the network, for this doing the backpropagation of the error obtained by the network. In other words, we calculate the error between what the network predicted to be and what it was indeed, then we recalculate all the weights values, from the last layer to the very first one, always intending to decrease the neural network error.

Translating that idea to mathematical terms, backpropagation consists in the following parts:

  • Initialize all the weights with small random values;
  • Feed data into the network and figure out the value of the error function, obtained by comparison with the expected output value. Since the learning process is supervised, we know beforehand the value of the correct answer. It’s important that the error function is differentiable.
  • In order to minimize the error, the gradients of the error function with respect to each weight is calculated. It’s known, from Calculus, that the gradient vector indicates the direction of highest increase of a function; once we want to move the weights in the direction of highest decrease of the error function, we just need to take the opposite direction of the gradient vector and … voilà! There is an excellent path to walk.
  • Since the gradient vector has been calculated, each weight is updated in a iterative way, and we keep recalculating the gradients at the beginning of each training iteration step, until the error becomes lower than a certain established threshold, or the maximum number of iterations is reached, when finally the the algorithm ends and, hopefully, the network is well trained.

Thus, the general formula to update the weights is:

That is, the weight value at the current iteration is its value at the previous iteration minus a value that is proportional to the gradient.The negative sign indicates that we are taking the opposite direction of the gradient vector, as previously mentioned. The parameter η represents the learning rate of the network, controlling the step size that is taken when updating the weights. Making an analogy, imagine that you are in a hilly and dark area, wishing to go down as fast as possible to find a valley, hoping to locate a river of crystal clear waters. It’s possible to walk in many directions, but you try to figure out the best one by touching the soil around and taking the one with the greatest slope. In mathematical terms, you’re going in the opposite direction to the gradient vector. Also consider that you can control your step size, but notice that larger steps may take you to the next hill ahead, skipping the valley; on the other hand, very short steps will make you take a long time to descend and find the river.

The key idea in the previous equation is to calculate the expression ∂E /∂w, which consists in computing the partial derivatives of the error function E with respect to each weight of the array w. To help us, let’s consider the following picture, which illustrates a MLP network with 2 layers and will serve as a basis for the explanation of backpropagation. A connection between a neuron j and a neuron i from the next layer has weight w[j,i]. Notice that the superscript indexes, inside parentheses, indicates the number of the layer which the variable in question belongs to.

MLP network for illustrating. Weights of neurons at the bottom were omitted in order to not make thee picture too polluted.

Feel comfortable to get back to the previous picture as you go following the next steps. Let y be the expected output and ŷ the output figured out by the network; then the error function can be defined as:

That is, we are just calculating the sum of squared differences between the corresponding elements of both arrays. Now, the partial derivative of the error with respect to the output layer ŷ:

Following the network structure, to the left, we proceed calculating the derivative of the error with respect to the activation potential z of layer (2) by using the Chain Rule:

Notice that σ′ is the derivative of the neuron activation function. For instance, when using sigmoid function, its derivative is σ′(z) = σ (z)( 1 − σ (z)).

That’s the local gradient with respect to the i-th neuron of layer (2) and, to not make the formulas too long, it will be simply indicated as δ:

Finally, using the Chain Rule again, we calculate the partial derivative of the error with respect to the weight w[j,i] of layer(2):

The calculation with respect to the bias is similar, which yields:

Once we have these results, it’s possible to apply the general formula to update the weights of neurons in layer (2):

Similarly, the for the biases:

It’s all finished for the last layer; now we need to do the same for the previous one. Thus, we need to calculate the partial derivative of error with respect to the output y[i] of the layer (1). The great idea here is to realize that y[i] interferes in all neurons in the next layer, so we need to consider the sum of errors that y[i] propagates to the next layer:

But:

Which results in:

Going ahead, it turns out that:

That δ we have equaled the result to follows the same previous idea: it’s the local gradient with respect to the i-th neuron in layer (1). Finally, we have:

Replacing it on the updating weights formula:

Again, the expressions for the biases are similar and are let as exercise.

That’s it! If there were more layers, the procedure would continue, always following that same pattern of calculating the partial derivatives, backpropagating the errors from the last to the first layer of the network.

Current deep learning networks, like Convolutional Neural Networks, despite being more refined than MLP, also uses backpropagation internally; Recurrent Neural Networks, which has been used for natural language processing, also utilizes that algorithm. What is most incredible is that such models can find hidden patterns that are obscure and hard for us, humans, to notice, which reveals all its power.

--

--