Multi-layer perceptrons as non-linear classifiers — 03

Vishal Jain
Analytics Vidhya
Published in
5 min readApr 5, 2021

Recap

In the last post we discussed how a perceptron could ‘learn’ the best set of weights and biases to classify some data. We discussed that what ‘best’ really means, in the context of minimising some error function. Finally, we motivated the need to move to a continuous model of the perceptron. Now, we will motivate the need to connect several perceptrons together in order to build more complex non-linear models.

Motivation

So far, all the models we’ve built using the perceptron have been linear. In reality, a data set is rarely capable of being classified by a simple line. For example, our toy dataset of plants that grow, given a certain amount of water and time in the sun, would more realistically look like:

So how do we build a model capable of separating the data now?

Neural networks as non linear classifiers

Enter multi-layer perceptrons, or the ‘vanilla’ neural network. The idea is to combine several linear models together, in order to create a non linear one.

To see how we can do this, consider the following linear models:

What do you think will happen if we add those two models together? Remember what we are really adding are two probability spaces. The model on the left is a probability space which goes from a value of 0 on the extreme left and tends to a value of 1 as you move right. The model on the right is a different probability space where you tend to 1 as you move up. What shape will the set of points that give y_hat = 0.5 take? In the source models, those set of points take on the shape of simple vertical and horizontal lines.

Here, we see that we get the shape of a curve! This example was just to illustrate the point that adding 2 models with a linear decision boundary can lead to a model with a non linear boundary.

Note, that in this case, this summed model will predict probabilities in the top right region of ~2. So we need to re apply a sigmoid to the summed model to normalise the probabilities to a range between 0 and 1.

Representing the summation of linear models as a neural network

So how can we graphically represent the above summation? Well we know that each linear model used can be represented by a perceptron. All a perceptron does is run a weighted summation of the input features. Therefore, we can take the outputs of our first two perceptrons (the first 2 models shown above) and connect those as inputs to another perceptron which will sum those models together!

So here we pass our original input features into two separate perceptrons with their own sets of weights and biases. We then take the outputs of those linear models and sum them together in the second layer, and get our final output. This final model is non linear. This example should give a better intuition behind what the architecture of a multi-layer perceptron neural network is actually doing, finding a highly non linear function capable of classifying the data points in our data set accurately.

Let’s break down the above diagram. First, we’re taking in a set of input features x_i at our ‘input layer’. Then we’re passing those input features to perceptrons in our ‘hidden layers’. This is just defined as all the layers of perceptrons between our input and output layers. The first layer of perceptrons in our hidden layer will just take the input features and build linear models. These linear models are then passed as input to perceptrons in the next layer of our hidden layer. These perceptrons sum together the input linear models, and each output a non linear model. The more layers we have in our hidden layer, the more complex non linear models we can find. These models are combined at the output layer to give a final model which should be capable of classifying out input data point.

Training and optimisation

So how do we actually train these models? Well it’s largely similar to how we trained our perceptron. Given a clean data set ( we will talk about what makes for a good data set and how to make sure training happens as smoothly as possible later), we do the following:

Run forward prop on each data point, generating an initial prediction.

Calculate the loss for each data point, using a loss function like the binary cross entropy discussed earlier.

Run gradient descent to calculate how each weight and bias in the network needs to be nudged to reduce the error for that specific data point.

Do the above 3 steps for all data points in the data set and average all the changes for the weights and biases.

Update the weights and biases using the averaged change found through passing the dataset through the model and running gradient descent.

The above is one epoch, one pass of the data set through the network for training. We repeat this over several epochs till the loss is acceptably low, and we are getting accurate classifications.

For an amazing visual explanation of the above training process, checkout out 3B1BS video: https://www.youtube.com/watch?v=Ilg3gGewQ5U . If the previous posts all make sense, you should have the pre req knowledge to watch the above video :) .

Summary

In this post we ran through how connecting several perceptrons together can allow us to find non linear decision boundaries and an overview of the MLP (multi — layer perceptron) neural network architecture. In the next post we will try to answer the question of what decides how well the training goes? What qualities should we look for in our data set, in our results, in our network and more importantly, what additional tweaks we can add to improve training.

Questions

If we were using a mean square loss function instead of cross entropy loss (i.e trying to find a line of best fit, rather than decision boundary), can you convince yourself that summing several perceptrons will still lead to a linear model? (If it doesn’t, we are in trouble since it means we can’t solve regression tasks with this approach aha)

What change do we need to make to our MLP architecture in order to do multi class prediction? I.e predict that our data point lies in one of several classes (dog, cat, rabbit etc)

Leave your answers in the comments :)

--

--