Introduction to how an Multilayer Perceptron works but without complicated math

SangGyu An
CodeX
Published in
10 min readOct 9, 2021

What will you do when your coffee isn’t sweet enough? You would probably add more sugar. How about when it’s cold outside? You would wear more clothes. So oftentimes, adding more of something solves a problem, and this approach also works with the perceptron to solve complex problems.

The basic structure of the Multilayer Perceptron

What the multilayer perceptron(MLP) adds to the perceptron to solve complex problems is a hidden layer. The hidden layer is located between the input and output layers and can have more than one layer of neurons. When there is more than one, layers closer to the input layer are called the lower layers, and the ones closer to the output layer are called the upper layers. And just like the input layer, each hidden layer contains a bias neuron.

Basic structure of the MLP

Another thing to notice from the structure is that every neuron is connected with every neuron in the previous and next layers. This is called a fully connected layer or dense layer.

How it is trained

The primary goal of the MLP is the same as the perceptron’s goal. It tries to minimize the error. But the MLP has a little bit different process called backpropagation.

1. Select a number of training instances a network will process each time

2. Pass in training instances into the input layer → hidden layer → output layer

3. Compute an output error based on the output from the output layer

4. Go through the network in a reverse order to measure how each connection is related to the output error

5. Update the weights to reduce the output error

6. Repeat steps 2 to 5 until it covers every training instance

7. Repeat step 6 m epoch times

So to sum up simply, training instances go through the network forward and backward iteratively to reduce the error by adjusting weight values with gradient descent. Before explaining more about how the weights change during the process, let’s get straight with the terms first.

Epoch: one loop over the entire data set
Forward pass: Going through the network from the input to the output layer
Reverse pass: Going through the network from the output to the input layer
Gradient descent: A topic for the next paragraph

Changing little by little - Gradient descent

The term gradient means a vector of partial derivatives of input values with respect to a target function [1]. In simple words, it is a vector of slopes. Thus, the gradient descent means reducing slopes to reach a minimum point. Perhaps, a visual representation would help you to understand the process.

A randomly initialized point on a graph / cost from the graph measures how well a neural network predicts by calculating the error between predicted and actual outcomes

Let’s say we start at the randomly initialized red point. At this state, its cost value is pretty high. One of the ways to reduce the cost value from the plot is lowering the point’s slope so that the point moves downward.

One step toward a minimum point

After moving one step, we can see that its slope has descended and cost value has decreased. However, there is still room for improvement.

Moving until the point reaches a minimum point

When we move the point little by little, the point’s slope eventually becomes 0 and the point reaches a minimum point. This minimum point is where we obtain the lowest cost value; thus, it’s where we obtain appropriate weight values. But how much does a point change in each iteration?

Choosing how much to change in each iteration

This is defined by a learning rate, which is an important parameter in the gradient descent. Let’s say we have a large learning rate. Then, the steps in the gradient descent would look like below.

How a point moves with a large learning rate

As you can see, the point moves left and right inconsistently instead of steadily moving toward the minimum point. Thus, it could diverge away from the minimum point. On the other hand, when we use a small learning rate,

How a point moves with a small learning rate

the point consistently stays on one side and steadily descends its slope until it reaches a minimum point. One of the issues is that it could take way longer. Therefore, it’s crucial to find an appropriate learning rate through cross-validation. Besides the learning rate, another parameter you should choose is how many training instances you will process in each iteration.

Choosing the type of the gradient descent

Generally speaking, you have 3 choices. First, you have the batch gradient descent that processes the entire data set each time. So one epoch equals one iteration. Therefore, the weights are updated once per epoch. One of the advantages is that it’s capable of stopping at a minimum point. However, since it processes the entire data set every time, it becomes computationally expensive when a large data set is passed into the network.

Visual representation of how many instances are passed in the batch gradient descent

The second choice is the mini-batch gradient descent that processes a subset of a data set each time. Therefore, in one epoch, there are

iterations. So for instance, when we have 1000 training instances and a mini-batch of size 100, there will be 10 iterations in one epoch. Thus, the weights are updated 10 times per epoch. Since this approach processes a subset unlike the batch gradient descent, it doesn’t suffer from the expensive computation. However, it doesn’t settle down to a minimum point like the batch gradient descent does because each subset of a data set varies depending on which instances are included.

Visual representation of how many instances are passed in the mini-batch gradient descent

Lastly, there is the stochastic gradient descent that processes one training instance for each iteration. Thus, the weights are updated n times in one epoch when there are total n training instances. One advantage is that it is much faster than the other two. However, because each training instance varies even more than a subset of a data set, it’s much more unstable than the others. So one might wonder why we should use the stochastic gradient descent if it’s that unstable. The answer is related to local and global minimums.

Visual representation of how many instances are passed in the stochastic gradient descent

Whether it can escape a local minimum or not

In a real-world data set, you cannot guarantee that a minimum point you found is always a global minimum. Thus, if you use the batch gradient descent that steadily decreases, you could end up with a local minimum depending on where you start.

Demonstration that a network could end up in a local minimum depending on where it starts

But when it isn’t stable, it can escape a local minimum and find a global minimum. So there are pros and cons to each method. A value obtained by the batch gradient descent is the optimal one but has a possibility of being a local minimum. In contrast, the stochastic gradient descent has more possibility of finding a global minimum, but its value tends to vary a lot. So when we use the stochastic gradient descent, we generally use a relatively high learning rate to find an approximate location of a global minimum and then use a relatively small learning rate to find a more exact location.

Inevitable change for the gradient descent to work - Activation function

A problem with using the gradient descent on the perceptron is that it’s impossible to descend a slope from the step function. Therefore, the MLP replaces the step function with an activation function that has varying slope values.

S-shape - logistic & hyperbolic tangent (tanh)

Both of them have a similar S shape. The differences are their equations and output value ranges.

Logistic and tanh functions and graphs

As the plot shows, logistic ranges from 0 to 1, whereas tanh ranges from -1 to 1. It seems like a small difference, but this allows tanh to have larger derivatives. Thus, tanh tends to minimize a cost function faster than logistic. However, a problem arises when an input value becomes very small or large.

For instance, when an input is 3 in the tanh function, its derivative is close to zero. And because the gradient diminishes as it goes through the reverse pass due to the chain rule, the gradient value will be even smaller in the lower layers. Therefore, in the lower layers, the network might not know which direction to improve its weights. This is called the vanishing gradient problem. To solve this issue, researchers brought up a new activation function.

Linear but better than the previous two - Relu

Compared to the two functions above, Relu outputs 0 for input values lower than 0 and applies a linear function for input values larger than 0. So unlike the logistic and tanh, Relu always outputs a slope of 1 or 0, and its slope does not diminish. These two differences help a neural network to deal with the vanishing gradient problem.

Relu graph

On top of that, because its equation is much simpler than the other two, it is generally faster. Due to these benefits, it has become the most commonly used activation function.

How do multiple layers solve a complex problem?

So now you know how the MLP is structured, how it is trained, and how it is different from the perceptron. But the question of why adding the hidden layer solves a complex problem remains as a question. The answer to this is related to the activation function you’ve just read above.

Since every neuron in the hidden layer contains the activation function and inputs go through the hidden layer, the network applies a series of functions to inputs. The type of transformation depends on which problem you are working on and which layer you are looking at. So I will explain it in a general way.

Let’s say we have an input called x, the first hidden layer activation function called f, and the second one called g. When x is passed into the first hidden layer, its output is f(x). This f(x) value is then passed into the second hidden layer. Thus, we get g(f(x)) as an output.

How x transforms as it goes through the layers

So as you can see, the hidden layer allows us to compute a combination of multiple functions instead of just f(x). A benefit of this is that the combination of multiple functions can compute complex things that a single function can’t do individually. When this concept is applied to a circular boundary,

Tensorflow playground demonstration that shows how a neural network obtains a circular boundary / Tensorflow playground

you can see that the identified patterns become more complex as they progress toward the output layer: from linear to circular.

Another reason for nonlinear activation functions

When you play with different options from the TensorFlow Playground, you will notice something is different with a linear activation function.

Tensorflow playground demonstration of a limitation of linear activation functions

If a linear function is used instead of tanh function, the resulting boundary in the output layer is linear no matter how many hidden layers you add. Why? Because a combination of linear functions is still a linear function. For instance, when f(x) = 5x+4 and g(x) = 7x+1, f(g(x)) = 5(7x+1)+4 = 35x+9, which is still a linear equation.

The only way to change this is by introducing nonlinearity in the hidden layers so that a combination of activation functions becomes nonlinear and identifies a more complex relationship, which is why we use the other functions instead of the step function.

Reference

[1] Brownlee, J. (2020, December 16). What is a gradient in machine learning? Machine Learning Mastery. Retrieved October 14, 2021, from https://machinelearningmastery.com/gradient-in-machine-learning/.

[2] Brownlee, J. (2021, January 31). Difference between backpropagation and stochastic gradient descent. Machine Learning Mastery. Retrieved October 14, 2021, from https://machinelearningmastery.com/difference-between-backpropagation-and-stochastic-gradient-descent/.

[3] ekoulier. (2018, February 26). Why is tanh almost always better than sigmoid as an activation function? Cross Validated. Retrieved October 14, 2021, from https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function.

[4] Faroz, S. (2021, August 27). Chain rule for backpropagation. Medium. Retrieved October 14, 2021, from https://salmanfaroz.medium.com/chain-rule-for-backpropagation-55c334a743a2.

[5] Géron, A. (2020). Hands-on machine learning with Scikit-Learn, Keras, and Tensorflow: Concepts, tools, and techniques to build intelligent systems. O’Reilly.

[6] Harris, D. (2013, July 2). What does the hidden layer in a neural network compute? Cross Validated. Retrieved October 14, 2021, from https://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute.

--

--