MLPs and Back-Propagation: A mathematical guide

Chamuditha Kekulawala
5 min readJul 6, 2024

--

In the previous article we gave an introduction to neural networks and perceptrons. Now let’s talk about MLPs.

An MLP is composed of one (passthrough) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer:

This architecture is an example of a feedforward neural network (FNN), because the signal flows only in one direction (from the inputs to the outputs)

When an ANN contains a deep stack of hidden layers, it is called a deep neural network (DNN). The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations. For many years researchers struggled to find a way to train MLPs, without success. But in 1986, a groundbreaking paper was published, introducing the back-propagation training algorithm, which is still used today.

In short, it is simply Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network’s error with regards to every single model parameter.

In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

Back-propagation

Let’s run through this algorithm in a bit more detail:

  • It handles one mini-batch at a time (for example, containing 32 instances each), and it goes through the full training set multiple times.

Each pass is called an epoch.

  • Each mini-batch is passed to the network’s input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer (the output layer).

This is the forward pass: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.

  • Next, the algorithm measures the network’s output error

It uses a loss function that compares the desired output and the actual output of the network.

  • Then it computes how much each output connection contributed to the error.

This is done analytically by simply applying the chain rule (a fundamental rule in calculus), which makes this step fast and precise.

  • The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule — and so on until the algorithm reaches the input layer.

This reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network

  • Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

This algorithm is so important, it’s worth summarizing it again: for each training instance, the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

Initializing the hidden layer

It is important to initialize all the hidden layers’ connection weights randomly, or else training will fail. For example, if you initialize all weights and biases to zero, then all neurons in a given layer will be perfectly identical, and thus backpropagation will affect them in exactly the same way, so they will remain identical. In other words, despite having hundreds of neurons per layer, your model will act as if it had only one neuron per layer: it won’t be too smart. If instead you randomly initialize the weights, you break the symmetry and allow backpropagation to train a diverse team of neurons.

In order for this algorithm to work properly, the authors made a key change to the MLP’s architecture: they replaced the step function with the logistic function, σ(z) = 1 / (1 + exp(–z)). This was essential because the step function contains only flat segments, so there is no gradient to work with (Gradient Descent cannot move on a flat surface), while the logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step.

In fact, the backpropagation algorithm works well with many other activation functions, not just the logistic function. Two other popular activation functions are:

  • The hyperbolic tangent function tanh(z) = 2σ(2z) — 1

Just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function), which tends to make each layer’s output more or less centered around 0 at the beginning of training. This often helps speed up convergence.

  • The Rectified Linear Unit function: ReLU(z) = max(0, z)

It is continuous but unfortunately not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for z < 0. However, in practice it works very well and has the advantage of being fast to compute. Most importantly, the fact that it does not have a maximum output value also helps reduce some issues during Gradient Descent

These activation functions and their derivatives are represented in the figure below:

Why do we need activation functions?

Wait! Why do we need activation functions in the first place? Well, if you chain several linear transformations, all you get is a linear transformation. For example, say f(x) = 2 x + 3 and g(x) = 5 x — 1, then chaining these two linear functions gives you another linear function: f(g(x)) = 2(5 x — 1) + 3 = 10 x + 1. So if you don’t have some non-linearity between layers, then even a deep stack of layers is equivalent to a single layer: you cannot solve very complex problems with that.

In the next article let’s discuss the types of MLPs. Thanks for reading 🎉

--

--