The Linear and Nonlinear Nature of Feedforward

Part 2/4 of the Deep Learning Explained Visually series.

Published in

The Feynman Journal

12 min readAug 23, 2021

The full animation for the feedforward process which we’ll build up to in this article.

Welcome to part 2 of the Deep Learning Explained Visually series! In this article, we’ll look at how the Multilayer Perceptron (MLP) combines linear algebra and nonlinear activations to make predictions based on the input data. While the MLP is much more complicated than a single Perceptron, the intuitions we built up in part 1 will continue to give us leverage in understanding deep neural networks.

Multilayer Perceptrons

Similar to a single Perceptron, the MLP maps multiple inputs to a single output (it is possible for a neural network to have multiple outputs but we’ll keep it simple for now). The main difference is everything that happens between the inputs and the output.

With the MLP, we now have intermediate representations of the input data. The MLP takes the original feature vector from the input layer, extracts patterns from the data, and uses the extracted patterns as the input for the next layer. It repeats this process until it reaches the output layer, at which point our multidimensional feature vector is boiled down to a single numerical value. This is the MLP’s feedforward process.

So what are the “atoms” of the MLP? Just as a feature vector is made up of individual features, each layer in a deep neural network is made up of individual neurons, also known as units. The in-between layers are called hidden layers, and the neurons in these layers are appropriately called hidden neurons.

At each layer of the MLP, each input neuron is connected to all of the output neurons. This results in dense connections between the input neurons and the output neurons, which in turn requires the MLP to learn many, many more weights than the Perceptron.

But since the sheer number of connections in the above diagram is rather intimidating, we’ll work with a simpler variation:

The key thing to note here is that while one network architecture is considerably simpler than the other, they are both governed by the same principles. It’s true that bigger architectures tend to be much more complex and harder to design — but grokking the mechanisms behind the simpler architecture will directly translate to a better understanding of bigger architectures.

What’s worth noting is that even the simpler architecture has a lot going on. Fortunately for us, there’s a familiar pattern in the subnetworks of the overall neural network:

Each one of these subnetworks can be thought of as a Perceptron! Well, they don’t function exactly like a Perceptron, but they are undeniably similar. They both map multiple inputs to a single output using a dot product, and we’ll soon see that there are nice parallels between the learning procedures as well.

Based on this interpretation, our first insights into the MLP are as follows:

Though it looks complex as a whole, it’s essentially combining a bunch of simple subnetworks that resemble a Perceptron.
Each layer consists of one or more subnetworks, where the number of subnetworks equals the number of outputs at that layer.
Each subnetwork computes the dot product between the input vector and its weight vector, similar to a Perceptron.

So we can see that each layer of the MLP is computing a bunch of dot products. And as it turns out, the way to concisely express a bunch of dot products is through a matrix multiplication.

Matrix Multiplication

For many people, performing a matrix multiplication is about following a sequence of rules: Add up the element-wise multiplications between the first row of the first matrix and the first column of the second matrix, and so on.

But upon closer inspection, we see that the computation of each element in the output matrix is simply a dot product. In other words, a matrix multiplication is just a bunch of dot products:

Each element in the output of a matrix multiplication is the dot product between a row from the first matrix and a column from the second matrix.

In the above animation, the vectors w1, w2, and w3 are vertically stacked into a matrix — we’ll call this matrix W. Conversely, the vectors v1 and v2 and horizontally stacked into a separate matrix — we’ll call this matrix V. When we perform the matrix multiplication W · V, we’re computing the dot product between each row in W with each column in V.

Another way to say this is that the transformation matrix W is being applied to each of the vectors v1 and v2. When we perform W · v1, the result is v1'. Similarly, W · v2 = v2'. We can express the overall process more concisely as W · V = V'. Each 4D vector is being projected into 3D space, as indicated by the dimensions of W (3 x 4).

Now, remember that each layer in the MLP is made up of simpler subnetworks. Each subnetwork computes the dot product between its weight vector and the input vector at that layer.

When you stack the weight vectors into a weight matrix and multiply it by the input vector, the output is a new vector that will feed into the next layer as input. Simply put, going from one layer to the next involves a single matrix multiplication.

And rather than multiplying the weight matrix by a single input vector, it generally makes more sense to stack the input vectors into an input matrix to get the most out of matrix multiplication. This way, the same transformation can be applied to multiple input vectors simultaneously for greater computational efficiency.

So in the context of the MLP, matrix multiplication allows us to conveniently compute a bunch of dot products in one go. But the MLP requires another key ingredient to be successful.

Nonlinear Activation

Linear algebra is a powerful tool — but at the end of the day, it can only perform linear operations.

As it turns out, it doesn’t matter how many layers you add to your MLP if all you’re doing is applying matrix multiplications. A series of matrix multiplications mathematically boils down to a single matrix multiplication:

These two neural networks are equivalent because they only perform matrix multiplications.

In other words, the MLP collapses into a single Perceptron. A corollary to this is that we can only ever draw straight-line decision boundaries, and most real-world phenomena are simply not linear.

It’s sort of like trying to build a sandcastle without any water. If you keep pouring sand onto a heap of sand, you’re still going to end up with a heap of sand. In order to build interesting structures with sand, you need water to hold the sand together.

So how might we figuratively “add water” to an MLP that’s only using linear algebra? Well, in order to solve nonlinear problems, we need to introduce elements of nonlinearity to our neural network.

That’s where nonlinear activation comes in:

The above function is called the sigmoid function and it’s denoted by the symbol σ. Its formula is σ(z) = 1 / (1 + exp(-z)), but what matters most is its distinct shape.

As you can see, the sigmoid function is highly nonlinear (i.e. not a straight line) and it has some interesting properties that will be useful to us:

It maps all inputs to a positive value since all outputs are above the horizontal axis.
The lower bound for the output is 0, while the upper bound is 1. That is to say, it squishes the input into the range (0, 1).
The slope flattens out at extreme values.

And we get much more meaningful results when we combine sigmoid functions instead of combining straight lines:

Combining straight lines through subtraction results in another straight line. Combining sigmoid functions results in an interesting shape.

So we’ll go ahead and apply the sigmoid function to all of the intermediate matrix multiplication results, as well as the final output of the MLP:

An MLP using the sigmoid function for nonlinear activation.

Now, whenever a neuron receives the dot product between the input vector and its corresponding weight vector, it passes the result of the dot product into the sigmoid function. This process is called nonlinear activation.

Nonlinear activations provide several important benefits:

They allow us to model nonlinearities in our data, which a Perceptron is incapable of doing.
Being placed between matrix multiplications, they stop a series of matrix multiplications from collapsing into a single matrix multiplication.
When the sigmoid function is used on the final output, the output value is squished into the range (0, 1). This allows us to perform binary classification, since values above 0.5 can be classified as positive, and values below 0.5 can be classified as negative.

Combining a dot product with nonlinear activation as shown above is precisely what it means to detect a feature. Remember that a dot product essentially measures the alignment between two vectors. For each neuron at a particular layer, feature detection involves measuring the alignment between the input vector and the neuron’s corresponding weight vector, with the nonlinear activation determining the magnitude of the feature detection signal that will be used as an input for the next layer.

Viewed from this angle, activation functions start to make a whole lot of sense. The alignment between the input vector and a hidden neuron’s weight vector is fed into the activation function, which ultimately determines the nonlinear response:

Left: the dot product response of the sigmoid function. Right: the dot product response of ReLU, which is another type of nonlinear activation.

For some activation functions, the magnitude of the dot product input has a diminishing effect on the output of the neuron as it increases (such as the sigmoid function). For other activation functions, positive inputs are allowed to pass through as they are, while negative inputs are nullified to zero (such as the ReLU function).

It’s important to keep these properties in mind as you decide which nonlinear activations to use for your neural network. Different activation functions lend themselves better to different problems. For instance, ReLU is a very popular choice for computer vision tasks.

But in this article, we’ll continue using the sigmoid function.

Feedforward

At this point, we have done all the groundwork needed to understand the MLP’s feedforward process. Similar to what we covered in part 1, feedforward is about making a prediction about the input data.

In its most basic form, feedforward is simply a series of matrix multiplications, each one followed by nonlinear activation. This is the process by which the original input (feature vector) is evolved into the final output (prediction). We’ll look at a concrete example where the goal of our MLP is to take a 4D feature vector and boil it down to a prediction that tells us whether the input belongs to the positive class or the negative class.

Feedforward begins with the feature vector, which is the input for the first layer.

(And as a side note, all diagrams from here on out will highlight positive numbers in blue and negative numbers in red.)

Feedforward begins with the feature vector. This might represent a student (as in part 1), with features such as the student’s GPA and the number of internships completed. Then after the feature vector passes through the MLP, the final output would predict whether the student will successfully land a full-time job upon graduation (positive) or not (negative).

The weight vectors at the first layer begin extracting patterns from the feature vector. Each output neuron achieves this by performing a dot product between the feature vector and its corresponding weight vector:

The first matrix multiplication in the feedforward process, broken down into individual dot products.

The individual weight vectors are vertically stacked to form the weight matrix W1, which is multiplied by the feature vector X to produce the first linear transformation result, Z1.

What’s worth noting is that each output neuron is detecting a unique pattern from the same input vector. In the context of the student example, one neuron might be responsible for detecting when a student has a high GPA but only a few internships under their belt. Another neuron might be responsible for detecting when a student has poor academic performance but a lot of internships to make up for it.

Moving along, we take the result of the first matrix multiplication and pass it into the nonlinear activation:

The first nonlinear activation in the feedforward process.

As denoted by the blueness of the output vector H1, all of the values that we passed into the nonlinear activation have come out positive. This is because we’re using the sigmoid function (σ) as the nonlinear activation, which - as we've seen - can only produce positive outputs. H1 contains the features detected by the first set of hidden neurons and will serve as the input vector for the next layer.

The second layer processes its input in the same way as the first layer, starting off with a matrix multiplication. This time, we’re using the weight matrix W2, which sits between H1 and the second layer of hidden neurons.

The second matrix multiplication in the feedforward process, but this time the dot products are computed all at once.

When we multiply W2 by H1 (the output vector of the first layer), we obtain Z2. Then we once again pass the result of the matrix multiplication into the nonlinear activation:

The second nonlinear activation in the feedforward process.

The role of the hidden neurons in the second layer is to extract further patterns from the features detected by the hidden neurons in the first layer. While the features detected in the first layer might be fairly easy to understand (e.g. high GPA, few internships), it’s generally harder to describe the deeper, more sophisticated features that are extracted by the second layer. All we can say for sure is that these features will ultimately help the MLP make an informed prediction.

Finally, let’s move onto the last layer:

The final matrix multiplication in the feedforward process, which is just a dot product.

Here, we perform the final matrix multiplication (which happens to be a single dot product since there’s only one output neuron) to obtain Z3. Through this final linear transformation, we turn what was originally a 4D vector of features into a single scalar value!

And since the goal of our MLP is binary classification, we would like the final output to fall within the range (0, 1). To do this, we simply apply the sigmoid activation again:

The final nonlinear activation in the feedforward process, which gives us the MLP’s prediction.

We have arrived at the final output, σ(Z3) = Y! Through the feedforward process, we have evolved the initial feature vector containing four different numbers (features) into a single number that tells us whether the input data should belong to the positive class or the negative class. And since Y = 0.55 ≥ 0.5, our MLP predicts that the input data should belong to the positive class.

Summary

In this part of the series, we learned that:

The MLP is an ensemble of simpler subnetworks that are similar to the Perceptron, and has the same goal of boiling down the input data into a useful prediction.
Each subnetwork at each layer is performing a dot product between the input vector at that layer and its weight vector. The dot products in a given layer can be concisely represented as a single matrix multiplication.
In order to model nonlinearities in the input data, we place nonlinear activations between matrix multiplications.
As we go deeper into the layers of the MLP, it’s able to extract more and more sophisticated features from the input data.

To further illustrate the power of the MLP, in the next part of the series we’ll peek into the black box and see what’s really going on inside deep neural networks.

Peeking Into the Black Box of Deep Learning

Part 3/4 of the Deep Learning Explained Visually series.

medium.com