Peeking Into the Black Box of Deep Learning

Part 3/4 of the Deep Learning Explained Visually series.

Published in

The Feynman Journal

11 min readAug 23, 2021

Data evolving through the layers as a deep neural network transforms feature vectors into useful predictions.

This article is a part of the Deep Learning Explained Visually series. If you’ve just read part 2, welcome back! If you’re new here but familiar with machine learning, you can skip the first two parts and simply enjoy the visualizations. :)

The Complexity of Deep Neural Networks

While the inputs X and the desired outputs T are well-defined for a deep neural network, the means by which they transform the inputs into meaningful outputs can be puzzling. This is why it’s tempting to simply treat them as a black box — but our goal in this article is to peek inside the black box to gain a better understanding of what’s going on.

We can think of a deep neural network as a function f(X) that maps X to Y.

In a sense, what happens between the inputs and outputs can be thought of as a function f(X). This function performs both linear and nonlinear transformations on the input data X in order to massage it into the output Y, which should be as close to T as possible.

And although we know that throwing in more hidden layers and adding more neurons per layer allows the neural network to approximate more complex functions, designing the architecture of a neural network is still considered to be as much art as it is science.

This is especially true for high-dimensional data with thousands of features. While it is possible to apply proven techniques and empirically iterate on a neural network design, it can be challenging to wrap your head around all the moving parts of a deep neural network.

But by choosing a simple problem and defining the right architecture, we can actually visualize the steps that f(X) takes in order to transform X into Y.

Peeking Into the Black Box

To understand what’s going on inside the black box, let’s look at a concrete example:

A data distribution where each data point has two components, x1 and x2, and a target label t.

In the above distribution, each data point has two features, x1 and x2. If the signs of x1 and x2 are the same, the point has a target label of 1 (positive) and belongs to the blue class. If the signs are different, the point has a target label of 0 (negative) and belongs to the red class.

At first glance, it seems like a pretty simple data distribution. But you’ll notice that it’s impossible to separate the blue points from the red points with a single straight line. In other words, it’s not linearly separable.

The only way we can solve this binary classification problem is to draw a nonlinear decision boundary, and we’ve already seen that the Perceptron is incapable of overcoming this kind of challenge. The MLP, on the other hand, is much better suited to dealing with nonlinear data.

So we’ll use the following MLP to solve the problem:

Our architecture has two inputs (for x1 and x2), two hidden layers with three hidden neurons each, and a single output that will tell us whether the input data point belongs to the blue class or the red class. As well, the hidden neurons and the output neuron are all using the sigmoid function as the nonlinear activation.

Although this problem is far from the kind that you would actually see in real life, it satisfies the two requirements that we care about:

The data is not linearly separable, which allows us to compare how the MLP performs against simpler models such as the Perceptron.
The data lives in 2D space, while the hidden layers of our MLP operate in 3D space. This allows us to visualize how the data is transformed by our neural network at each step.

Indeed, our choice of architecture is entirely based on being able to visualize the data as it evolves through the neural network. We could have chosen a different architecture — for instance by adding more hidden layers or varying the number of neurons per layer. So it’s important to realize that there are many different ways to solve the same problem in deep learning.

With that in mind, let’s dive into the problem and see what our MLP thinks when we randomly initialize its weights:

The MLP starts off with a very fuzzy idea of how to draw the decision boundary.

Similar to the Perceptron when it was randomly initialized, our MLP is confused about how to draw the decision boundary. It thinks that the data points towards the upper-left corner belong to the blue class and that those towards the bottom-right corner belong to the red class.

What’s worth noting is that unlike the Perceptron, the MLP has a continuous output. In other words, it can output any value in the range (0, 1) rather than just 1 or 0. This is why we see a continuum of colors in the above plot.

Now, let’s see how the MLP learns to draw the decision boundary:

The MLP learning to draw a nonlinear decision boundary between two classes.

Our neural network starts off in a pretty uncertain state, but it eventually learns how to cleanly separate the blue points from the red points! You can tell that its understanding is becoming more definite by the increase in contrast between the blue and red colors.

But how well did it approximate the true underlying distribution?

Left: the true decision boundary. Right: the decision boundary learned by the MLP.

For the learned decision boundary, we notice that the top-left quadrant and the bottom-right quadrant are connected. From this, we can glean a significant limitation of deep neural networks: They don’t actually figure out the underlying nature of the data — they simply fit the data given to them. A human observer might be able to infer the rules behind the patterns, but a neural network is unable to exhibit that level of intelligence by simply fitting the training data.

This also relates to another limitation of deep learning: It requires a lot of data to fit complex patterns. In other words, it’s incapable of doing what we call “one-shot” learning, where an algorithm learns to recognize different patterns based on just a few examples. For instance, the lack of training examples in the center region caused our MLP to incorrectly conclude that the top-left quadrant and the bottom-right quadrant are connected.

Our neural network overfitted the training data and failed to generalize in the center region.

If we had more training examples in that region, our MLP’s solution might have been closer to the ground truth. This is why simply providing more training data is often a good solution to overfitting.

Activation Landscapes

Now that we’ve seen a top-down view of how our MLP learns a decision boundary, let’s look at an isometric view of the learning process:

An isometric view of how the output activation landscape changes throughout the learning process.

Here, we’re plotting the output y of our neural network with respect to all of the possible values for the two inputs x1 and x2 in the range [-1, 1]. Since we’re mapping two variables (x1 and x2) to a single output (y), what we get is a 3D plot. We’ll call this the output activation landscape.

Towards the end of the learning process, we see that the output landscape is elevated in the top-right and bottom-left quadrants (where the blue points are) and depressed in the top-left and bottom-right quadrants (where the red points are). The positive blue points with a target value of 1 are pulling the landscape up, while the negative red points with a target value of 0 are pulling the landscape down.

Taking this further, we can map the output of each hidden neuron with respect to the original two inputs x1 and x2 throughout the learning process:

Animation showing how a MLP learns to combine simple functions into more complex ones.

The activation landscapes in the leftmost column of the above animation represent the three hidden neurons in the first hidden layer. As we can see, these landscapes are very simplistic — in fact, they’re just 2D sigmoid functions since that’s what we’re using for our nonlinear activation.

Things start to get more interesting in the middle column. The activation landscapes in this column represent the three hidden neurons in the second hidden layer, and we notice that these are more sophisticated than the ones in the first hidden layer. This is because each landscape in this layer is a unique combination of the simpler landscapes from the previous layer.

More broadly, the chosen nonlinear activation is what determines the shape of the activation landscapes that are combined by the neural network. This is another thing to keep in mind as you decide which function to use for nonlinear activation!

Moving onto the last layer, the output neuron combines the landscapes in the second hidden layer to form the final activation landscape, which is shown in the rightmost column of the above animation. The overall process reveals how an MLP learns to combine simpler functions into more sophisticated ones based on the given data, with the goal of faithfully approximating the underlying function.

You might have noticed that all of the landscapes begin as flat 2D planes before being molded into meaningful shapes by the learning process. This is because the MLP’s weights are initialized to small values, which is standard practice in deep learning. When the weights are small, the inputs into the nonlinear activations are small, resulting in the following activation range:

The linear region of the nonlinear sigmoid function highlighted in orange.

The above diagram makes it evident that for small input values, the sigmoid function has a linear response to its inputs. This is reflected in the 2D-plane activation landscapes we see at the beginning of the learning process. However, as the weights grow bigger, our MLP becomes more opinionated and starts exhibiting more curved activation landscapes.

Now let’s do a little thought experiment. Imagine adding more and more neurons to the hidden layers of the MLP and how that translates to giving it more activation landscapes to work with. That would allow the MLP to produce a more sophisticated output landscape, and therefore increase the complexity of functions it can approximate.

From this we can conclude that if you give the MLP enough hidden neurons and set the weights just right, it can approximate any function you desire. This is formally known as the Universal Approximation Theorem and it’s what makes deep neural networks such a powerful tool.

Transforming the Data

So far, we’ve visualized how the MLP learns to combine simple functions into a more complex one. Another interesting process that we can visualize is how the input data evolves through the different layers of a fully trained neural network. And in order to get a holistic view of the process, we’ll track all of the training data at once.

We begin with the original data distribution that’s impossible to separate linearly:

The initial shape of the training data before any transformations are applied.

Note that the data is flat when we visualize it in 3D space since it only has two dimensions (x1 and x2). But things start to get a more interesting when we apply the first weight matrix multiplication:

The training data after the first linear projection.

Our first weight matrix is a 3 x 2 matrix since it transforms 2D vectors into 3D vectors (since we’re going from 2 neurons to 3 neurons). And as we can see, the transformation projected our flat 2D data into 3D space. However, because it’s a purely linear transformation, our data isn’t any more separable than it was before.

But what happens when we use nonlinear activation on the first hidden layer?

The training data after the first nonlinear activation.

The sigmoid activation has exploded our data! From this, we can begin to see how our MLP is nonlinearly massaging the data to make it more separable.

Next is another linear transformation, this time from 3D to 3D:

The training data after the second linear projection.

On the surface, it’s hard to tell whether this step got us anywhere. But as we’ve seen before, interesting things happen when we apply nonlinear activation to the data:

The training data after the second nonlinear activation.

Amazingly, we see that the last nonlinear transformation has made our data linearly separable! We can draw a single straight line to separate the blue data from the red data when it’s in this butterfly-like shape.

Now let’s see what happens when we apply the final linear transformation. This time, it projects our 3D data into 1D:

The training data after the final linear projection.

The neural network has learned to project 3D data onto a 1D line such that the blue data and the red data are nicely collected into two distinct groups without any overlap. This corresponds to a 100% training accuracy.

Finally, let’s see what happens when we apply the last nonlinear activation:

The training data after the final nonlinear activation.

Now, not only is our data organized in a straight line, but it’s also nicely contained in the range (0, 1)! At this point, it’s trivial for us to draw a line at 0.5 and say that any new data point above the line will be classified as blue, while any new data point below the line will be classified as red:

The decision boundary (orange) drawn by the MLP to separate the two classes.

At each step of the feedforward process, the MLP molds the shape of the data until it becomes easy to distinguish one class from the other. The fact that it’s able to learn how to transform data that isn’t linearly separable into data that is linearly separable just based on the training data is nothing short of remarkable.

Summary

The goal of a deep neural network is to learn a function f(X) that closely approximates the underlying distribution from which the training data was collected. It achieves this by figuring out how to combine simple functions into more complex ones, extracting meaningful features as the data passes through the layers. By the end of the learning process, we saw how our MLP masterfully combines linear projections and nonlinear activations in order to perform binary classification.

In this article, we took a peek inside a deep neural network to develop a stronger intuition around its inner workings. But how exactly does an MLP learn? We’ll tackle this question in the final chapter of the series.

What Makes Backpropagation So Elegant?

Part 4/4 of the Deep Learning Explained Visually series.

medium.com