The Building Blocks of Deep Learning

Part 1/4 of the Deep Learning Explained Visually series.

Tyron Jung
The Feynman Journal
13 min readAug 23, 2021

--

The learning process for a single Perceptron, which we’ll build up to in this article.

Because of their complexity, many people choose to treat deep neural networks as a black box. After all, they tend to have millions of knobs and dials — it makes sense to just use it as a tool rather than try to understand how it works.

But if we peek into the black box, we might gain a deeper appreciation of how it was constructed, and maybe even use some of those insights to build other useful systems. And that’s my hope for this 4-part series on deep neural networks: To illuminate what’s inside the black box.

To do that, I want to focus on one specific neural network architecture — and there are many great architectures to choose from. Convolutional Neural Networks have long dominated the arena of computer vision, and Transformers are the tool of choice for natural language processing. However, this series will focus on the one that started it all: the Multilayer Perceptron (MLP). Also known as the Fully Connected Network, the MLP is the quintessential neural network.

What’s fascinating about the MLP is that even though it’s been around since the ’80s, it remains foundational to so many state-of-the-art machine learning techniques. Hence, grokking the MLP leads to a solid basis for understanding other, more complex architectures.

But first I would like to discuss the typical way in which neural networks are explained.

Over the past decade, many articles have been written in an attempt to explain neural networks. But I find that nearly all of them get lost in the weeds of mathematical equations, rather than providing the reader with a strong intuition of how they work.

With that in mind, I would like to explore the math behind deep learning a bit differently, by shifting the focus from written equations to visualizations. Here’s an example that we’ll build up to in this series:

A neural network learning how to separate data into two distinct categories.

If you’re new to machine learning, this series will build deep neural networks from the ground up, explaining all the necessary technical jargon along the way.

If you’re familiar with machine learning but not 100% comfortable with the theory, this is a good opportunity to reinforce your understanding of the math behind neural networks.

And if you’re a machine learning expert, perhaps this series can serve as a meditative survey of the basic concepts. Or perhaps you just like visualizations. :)

Perceptrons

The basic building block of the MLP is a single Perceptron. Similar to its multilayer counterpart, the Perceptron is a supervised learning algorithm. That is, whenever the Perceptron makes an incorrect prediction about the input data, we provide it with the correct answer so that it can attempt to correct its mistake and produce better predictions in the future.

And while the MLP is pretty complex, a single Perceptron is much easier to understand.

A Perceptron that maps a set of five inputs to a single output.

Let’s begin by taking a look at a concrete example for motivation. Suppose we have some data from the real world and the distribution looks like this:

A data distribution where each data point has two components, x1 and x2, and a target label t.

Every data point has two descriptive features, x1 and x2, as well as a target label t. For example, let’s say that each data point represents a student who recently graduated from college, where the two features are:

  • The number of internships completed by the student (x1).
  • The student’s GPA (x2).

Our goal is to figure out whether the student successfully landed a full-time job upon graduation (t). Hence, the target label can take on a value of 1 (successful) or 0 (unsuccessful). The above distribution — the training data — already contains these answers to allow the Perceptron to learn the mapping between the features and the target outputs.

More broadly, we would like to find a decision boundary between the successful students (blue) and the unsuccessful students (red), so that when we’re given a new student, we can accurately predict whether the student will land a full-time job based on the student’s two features.

(By the way, failing to land a full-time job immediately after college is not the end of the world!)

In order to solve this problem, we’re going to use the following Perceptron:

A Perceptron with two inputs, x1 and x2, and a single output y.

Our Perceptron takes in two inputs, x1 and x2, which are the two features of a given data point. Each input has a corresponding decision weight, whose magnitude represents the importance of its feature. For instance, if the number of internships (x1) matters more than the GPA (x2) in determining the student’s success, w1 would be greater in magnitude than w2.

The way the Perceptron makes a prediction based on the two inputs is straightforward. It simply multiplies each feature by its corresponding weight and adds up the results: z = w1 * x1 + w2 * x2.

If the sum is positive, the output is 1, which means the Perceptron thinks that the data point should belong to the blue class (i.e. successful). And if the sum is negative, the output is 0, and the Perceptron thinks that the data point should belong to the red class (i.e. unsuccessful).

This is known as the feedforward process, whereby a neural network transforms the input data into a useful prediction. When a Perceptron makes an incorrect prediction, it uses its mistake as an opportunity to learn and improve future predictions.

With that, let’s see what our Perceptron does when we train it based on the given data:

A Perceptron adjusting its decision boundary until it cleanly separates the blue points from the red points.

I don’t know about you, but I think that’s pretty neat. The Perceptron starts off with a bad solution, then incrementally improves its decision boundary by learning from the training data. By the time it’s done learning, it cleanly separates the blue points from the red points. It’s certainly not the best line you can draw to separate the data, but it does the job.

So how does the Perceptron learn?

Vectors and Dot Products

If you’re anything like me, you probably took linear algebra in school and have nightmares about inverting matrices, computing determinants, and finding eigenvectors. But I learned that linear algebra can be quite enjoyable once you understand how it’s applied in practice. And as it turns out, it’s fundamental to the operation of deep neural networks.

Now, you might already have a solid understanding of linear algebra. But I want us to be on the same page in terms of how we think about vectors and dot products. These two topics will serve as the backbone for the rest of this explainer, so it’s important that we have a common understanding.

Let’s first talk about what vectors mean in the context of machine learning. A simple way to think about a vector is as a collection of numbers. But for our purposes, it’s more useful to think of them as having a direction. Consider the following vector:

A vector u with two components, u1 and u2.

Going back to our student example, the two features that describe a student — the number of internships and their GPA — would be represented as a vector of two features or a feature vector. The first feature might represent the vector’s horizontal component, and the second feature its vertical component. Since the two features are unique, they are orthogonal to each other, and therefore the feature vector lives in 2D space.

Now, we can rewrite x1 and x2 as a single feature vector, x = [x1, x2].

Vectors operate very similarly to regular numbers. We can add up two vectors or subtract one from another. To perform addition on two vectors means to add one to the tip of the other. Vector subtraction works in the same way, except the vector being subtracted is flipped before being added to the tip of the other.

An illustration of vector addition (left) and subtraction (right). In the left pane, the orange vector is being added to the blue vector. In the right pane, the orange vector is being subtracted from the blue vector.

Mathematically, u + v works out to [u1 + v1, u2 + v2] and u — v works out to be [u1 — v1, u2 — v2]. In other words, the operation applies to the corresponding entries of the two vectors.

At this point, it’s worth clarifying the terminology:

  • The length of a vector is its magnitude, or how far it points in a certain direction.
  • The dimensionality of a vector is the number of components it has.

When we take two vectors of the same dimensionality, u and v, and add up the products of their corresponding entries (i.e. u1 * v1 + u2 * v2), we get a single scalar value. This is known as the dot product (u · v).

The dot product can be thought of as a projection of u onto v, multiplied by the length of v. It can also be thought of as the projection of v onto u, multiplied by the length of u. Simply put, to take the dot product means to multiply two vectors, similar to how we would multiply two numbers.

For the purposes of understanding neural networks, the most important characteristic of the dot product is that it takes two vectors of the same dimensionality and measures their alignment. This is illustrated in the following diagrams:

The three distinct scenarios of the dot product: aligned (left), orthogonal (middle), and unaligned (right).

Here, we notice that:

  • When two vectors generally point in the same direction, their dot product is positive.
  • When two vectors are orthogonal — meaning they don’t have anything to do with each other — their dot product is zero.
  • When two vectors generally point in opposite directions, their dot product is negative.

So the magnitude of the dot product depends on two things: the lengths of the vectors and how much they align with each other. In general, long vectors cause the dot product to be large (really positive or really negative), while nearly orthogonal vectors cause the dot product to be small (close to zero).

The cool thing about the way that dot products measure the alignment between two vectors is that it allows neural networks to detect patterns in the data.

Perceptron Learning

Returning to the earlier question about how the Perceptron learns, we can now formulate an answer using our understanding of vectors and dot products. First, we can recast the feedforward process as a dot product:

The same Perceptron as before, but with the prediction expressed as a dot product.

Using this abstraction, we can now rewrite z = w1 * x1 + w2 * x2 concisely as z = w · x, i.e. the dot product between the weight vector w and the feature vector x. Then, the output of the perceptron simply becomes w · x ≥ 0. If the dot product is positive, the output is 1. If the dot product is negative, the output is 0.

When we first present our data to the Perceptron, it starts off rather confused:

The orange vector is the weight vector, where points in the blue region are classified as blue, and points in the red region are classified as red by the Perceptron.

Our Perceptron’s weight vector is randomly initialized, and it hasn’t learned anything yet. As shown by the number of misclassified points (blue points in the red region and red points in the blue region), it has a classification accuracy of about 50%.

One thing to note here is that the Perceptron’s decision boundary is orthogonal to its weight vector. Everywhere along the decision boundary, the dot product with the weight vector is equal to zero, signifying that the Perceptron is unsure. The closer a data point is to the decision boundary, the more uncertain the Perceptron will be in making a prediction.

The first step our Perceptron takes in the learning process is to pick a point at random:

Left: The feature vector of a randomly selected point (blue) and the weight vector (orange). Right: The same vectors in the context of other data points and the Perceptron’s decision boundary.

Here, we notice that the point we picked belongs to the blue class, but it’s being misclassified into the red class. This is because the feature vector and the weight vector are unaligned, making the dot product between them negative. So the output of the Perceptron is 0 (red).

But we would really like our Perceptron to output 1 (blue) for this point. Hence, we need to increase the dot product between the weight vector and the selected point’s feature vector. How might we do that?

Let’s do a little thought experiment. Imagine that you have two vectors that are orthogonal to each other in 2D space (i.e. their dot product is zero). Now take one of the vectors and repeatedly add it to the other vector. What happens? They become more and more aligned, and their dot product increases.

As it turns out, that’s the best strategy for increasing the dot product between two vectors. But in our case, we can’t randomly choose which vector to add to the other. We must add the feature vector to the weight vector since the feature vector is the ground truth and cannot be changed.

But we won’t add the entire feature vector to the weight vector since that will result in a big update. We would instead like to adjust our weight vector more carefully — and to do that, we scale down the feature vector using a learning rate before adding it to the weight vector. We want to make sure that our Perceptron converges to a good solution by making careful updates.

So let’s see what happens when we add a scaled version of the feature vector to the weight vector:

Adding a scaled version of the feature vector (yellow) to the weight vector (orange).

By adding a small version of the feature vector to the weight vector, we have effectively increased their dot product! The update didn’t quite manage to knock the selected point into the correct region (blue), but we made meaningful progress towards it. Now, our decision boundary is looking noticeably better.

Next, we pick another random point. This time, it’s a red point that’s being misclassified as blue:

The second randomly selected point’s feature vector (red) and the weight vector (orange).

Our situation this time is the reverse of the previous one. This time, the dot product between the feature vector and the weight vector is positive, but we would really like for it to be negative so that the selected point is correctly classified as red.

So how might we decrease the dot product between two vectors? We do the opposite of what we did to increase the dot product: subtract one from the other. Once again, we’ll update the weight vector using a scaled version of the feature vector:

Subtracting a scaled version of the feature vector (yellow) from the weight vector (orange).

The update made another noticeable improvement to our decision boundary.

At this point, you might be wondering what happens when the Perceptron randomly picks a point that’s already classified correctly:

The third randomly selected point’s feature vector (blue) and the weight vector (orange).

In this case, we simply leave it alone and move onto another randomly selected point.

We’re getting pretty close to a working solution, so let’s keep going:

Another additive update to the weight vector.

This time, the weight vector update was big enough to knock the misclassified blue point into the correct region! In fact, we notice that most of the points are correctly classified at this stage.

Finally, we’re down to our last update:

The final update to the weight vector.

The last update results in a decision boundary that cleanly separates the blue points from the red points. Our Perceptron’s final solution looks like this:

The Perceptron’s final decision boundary.

As you can see, a simple set of learning rules has allowed the Perceptron to converge to a meaningful solution. At first, it wasn’t very intuitive how simply adding/subtracting the feature vectors to/from the weight vector could lead us here. But we’ll soon see that this strategy also lies at the heart of deep learning.

To wrap up, here’s a summary of the Perceptron learning procedure:

  1. Randomly pick a data point from the training data.
  2. Predict which class the point belongs to.
  3. If the prediction is 0 but the target is 1, add a scaled version of the feature vector to the weight vector. Else if the prediction is 1 but the target is 0, subtract a scaled version of the feature vector from the weight vector. Else (the prediction matches the target), leave the weight vector alone.
  4. If any of the points are misclassified, return to step 1.

Perceptron Limitations

Based on what we’ve seen so far, it seems plausible that the Perceptron would be useful in solving real-world problems. Unfortunately, this is not the case.

The Perceptron has two big limitations:

  • It can only handle binary classification (i.e. positive or negative).
  • It can only draw linear decision boundaries.

To illustrate its limitations, let’s make a slight change to our training data:

Two additional red points, each highlighted by an orange circle.

We’ve added two additional red points, one towards the top-left and the other towards the bottom-right. These two points make it impossible to draw a single straight line that cleanly separates the blue points from the red points.

Let’s see how the Perceptron handles this challenge:

The Perceptron breaks down when it’s unable to converge to a solution.

Not very well! Towards the end, it looks like it’s having a mental breakdown. The Perceptron seemed so promising, yet it was trivial to set it up for failure. Its main drawback is its linear nature, as most real-world phenomena are simply not linear.

This unfortunate limitation — paired with the exaggerated claims made by the proponents of the Perceptron — led to the disappointment of many researchers in the field of artificial intelligence. This resulted in what’s known as the “AI winter” — an extended drought of machine learning research around neural networks that lasted for decades.

But eventually, researchers were able to develop new machine learning methods that overcame these challenges, putting neural networks back in the game. One such method is the Multilayer Perceptron, and we’re now ready to deconstruct its complexity.

--

--