Illuminating the Black Box of CNNs

A visual guide by a college data scientist

Published in

The Startup

11 min readJun 29, 2020

Introduction

Most non-technical literature tends to brand machine learning algorithms as opaque and uninterpretable, a sort of input-output black box. All of the data science classes I’ve taken at school thus far have treated neural networks in a similar fashion; even as a statistics major, I’ve never been taught how a neural network (NN) actually works. With this piece I seek to make NNs more accessible and more understandable to the non-technical populace.

Below I have compiled a “visual guide” to NNs and CNNs. I’ve seen a few articles targeting this subject from a very high level, and a few targeting it from a pretty mathematically rigorous low level. In making this guide, my goal was to balance the two; that is, I want to be completely thorough at the high-level, while still sprinkling in a few low-level but hand-wavey explanations.

For certain steps, I will provide concrete examples from a related project of mine — outputting a sex classification (male/female) given a dataset of faces. In steps where actual numerical data is easier to conceptualize (as opposed to image data converted to numerical data), I will turn to the case of COMPAS — outputting a likelihood of recidivism given a dataset of many personal and historical predictors (age, sex, previous offenses, etc). Note that the actual COMPAS algorithm is undisclosed, and these examples only convey how COMPAS might work, in its simplest form.

The Neuron/Node

The most basic unit of a neural network is the neuron, or node. Referring to Fig. 1, a neuron takes in a numerical input vector of size m (1, x, x2, … , xm). This can represent, for instance, one observation of the dataset. In the case of COMPAS, each element of the input vector may convey one predictor of the dataset (eg. x1 = age, x2 = sex, x2 = number of past arrests, etc.), for a particular individual. Next, the neuron assigns a weight, or coefficient to each element of the input vector (w0, w1, w2, … , wm). The input vector and weight vector will then be combined into a single scalar value via a net input function. This often takes the form of a simple dot product, ie. multiplying each input element by its respective weight then adding them all together. In many cases a constant bias value will also be added after the dot product. Going back to the example of COMPAS, a larger weight will cause its respective predictor to influence the final output more than a predictor with a smaller corresponding weight.

The aggregated scalar will then be passed through an activation function, which will transform the scalar to coerce it into our desired output form. For COMPAS, the activation function may simply normalize the aggregated scalar so that it rests between 0 and 1, representing the probability of recidivism. In my facial classification study, I will make use of two common activation functions: RelU and Sigmoid (See Fig. 2).

The RelU function, f(x) = max(0, x), takes in the aggregated scalar and forces it to zero if it is negative. If it is positive, it will leave it untouched. Effectively “discarding” our scalar may seem strange in the case of a single node, but once we begin to combine hundreds of nodes through thousands of connections, the RelU function’s zero boundary gives us a way to “throw out” outputs from unimportant nodes and therefore greatly reduce computational demand. The Sigmoid function, f(x) = 1/(1+e-x), is frequently used for when we want the output to be a binary classification: 0 or 1. This is precisely what we want to do for sex classification, as we can encode 0 to mean class #1 (male) and 1 to mean class #2 (female). Examining the graph of the sigmoid function, we see that the range of f(x) indeed goes from 0 to 1, where f(x) represents the probability of belonging to class #1. We further see that the center of the function (around x-value 0) is very steep, while the edges of the function get flatter and flatter. This pushes our output values closer to 0 or 1, which reduces the number of ambiguous predictions like a 50/50 classification.

Single-layer Networks

A single node on its own can function as a model — you might notice that it’s effectively become a regression problem. But by combining multiple nodes together into a layer, then aggregating their outputs, we can process a lot more information at once and gain a lot more predictive power. Referring to Fig. 3, let us now examine a simple, single-layer neural network.

Figure 3: A single-layer, feed-forward neural network

The input nodes in a neural network do not carry out any computation; they merely translate externally-provided information into the model. Each of the nodes in the hidden layer, then, behave like the single node that we examined earlier — that is, they each take in a vector of inputs, assign weights to them, aggregate the vector into a scalar, then transform the scalar via some activation function. The transformed scalars from each of these nodes will then aggregated once more to produce a single output. Similar to a single node, neural networks weight and bias each node, prioritizing nodes that “do very well” in providing accurate outputs and correcting nodes that systematically mispredict in some way. Fig. 3 presents a rather simplified representation of a standard feed-forward neural network, in that all of its arrows point right. This presents a huge issue — after training the network and getting some resultant prediction accuracy, it cannot use that accuracy information to go back and improve itself. In reality, neural networks rely heavily on a technique called back-propagation, which allow it to use information from the most recent forward pass to make adjustments to all of its weights and biases.

Back-propagation begins by calculating the error on the training set after a pass via some specified loss function — in the case of facial classification, it quantifies how poorly the network did in classifying faces as male/female. It then tries to optimize the loss function (aka minimize the loss in prediction accuracy) via a step of some specified optimization algorithm. It adjusts the weights and biases of the network accordingly, then the process of a forward pass and back-propagation repeats with another step of the optimization algorithm. The simplest loss function could be something like root mean squared error (RMSE) on the training set, while the simplest optimization algorithm could be something like stochastic gradient descent (SGD).

With each pass, we move back and forth toward the minimum of the loss function (in black) with a step magnitude proportional to our distance from the minimum.

For my facial classification model, I use Keras’s binary cross-entropy as a loss function and the Adam optimizer, which are the standard for binary classification neural networks, albeit far more complicated mathematically than RMSE and SGD.

Deep Neural Networks

The same concepts of aggregation and back-propagation hold as we extend our network into multiple layers. A deep neural network is conceptually no different than a single-layer neural network, except for the fact that nodes in a previous layer will “send” their aggregated output to multiple nodes in the next layer rather than immediately making a final prediction.

Figure 6: A feed-forward, densely-connected, >2-layer, >4 node-per-layer neural network

Fig. 6 demonstrates some of the properties of a neural network that I will briefly cover. The network is feed-forward, meaning that with each pass through the network we make no cyclic returns to previous layers until the entire pass has been completed. It has a depth of at least three, which is the number of layers excluding the input layer but including the output layer. (You might note that the aforementioned “single-layer network” technically has a depth of two. We count the output layer as well because it is also parametrized.) It is densely connected, which means that each node in the previous layer is connected to every node in the next layer, and there are at least four nodes per layer. In addition to what we can see here, we can also specify some parameters during the fitting process. The number of epochs, for example, is the total number of full forward and backward passes that the model will make during training, and the batch size is the number of training observations we will use at one time. Dense connection, great depths, more nodes per layer, more epochs, and larger batch sizes are more computationally intensive. In my own facial classification model, we have relatively few training observations (up to 74) and a simple two-layer network . As such, I can afford to have as many as 128 nodes per layer, a dense connection, a batch size of 32 (relatively large compared to our total observations), and twenty epochs. Raising all of these parameters is not necessarily a good thing, as a) the network will take longer to train, and b) the network may become overfit on the training data.

Convolutional Neural Networks

Flattening

Now that we have a solid foundation of feed-forward neural networks, we can build a full model to predict on quantitative data. In order to use images, like we want to do for our facial classification, we first have to convert the pixel data into quantitative data. We accomplish this by flattening our images into a numerical vector (Fig. 7). First, we extract the “magnitude” of each pixel from our image. Thinking in a black-and-white spectrum, lighter pixels might have higher magnitudes, while darker pixels might have magnitudes closer to zero. We then stack each column of pixels on top of each other from left to right, producing a flat vector.

This process, while simple, cannot be directly applied to most images. There are simply too many pixels per image, and many images don’t just have one channel of black-and-white, so this would make our neural network far too computationally expensive. Rather, we pass our image through a process of feature learning (a “convolutional neural network”), before flattening it and feeding it to a standard neural network. In essence, feature learning is the iterative process of reducing image resolution while retaining the “important” parts of an image.

Splitting

The very first step we take is to split colored images into their respective channels. Most colored images have three channels: red, green, and blue. By splitting the combined RGB pixels into singular channels, we can assign a single scalar magnitude to each pixel rather than having to think about some multidimensional magnitude. The next step is convolution, hence the name “convolutional neural network.”

Convolution

In Fig. 9, we see the first two steps of a convolution layer. Given an image’s pixel data in each of its channels (RGB in this case), we pass a kernel over the each channel. The kernel, in this case, is of dimension 3x3, and each block of the kernel has a specified filter weight. The kernel is projected onto the image, taking a 3x3 rectangular chunk of the channel’s pixels, and aggregating the nine pixel magnitudes into a single scalar, once again via dot product. In such a fashion, the kernel iteratively filters the entire image, as shown from step 1 to step 2. Ultimately, after aggregating the channels, the output will be a “convolved feature” that is reduced in size, with each pixel of the convolved feature representing a weighted 3x3 pixel chunk of the original image. Many parameters of the convolution are alterable, such as the dimensions of the kernel itself. In Fig. 9, we see that from step 1 to step 2 the kernel moves rightward by 1 pixel; this indicates a stride of 1. We can increase this stride length, but note that if we increase it to more than 3 for a 3x3 kernel, we will effectively be “missing” some pixels of the original image. We also see that the edges have pixel values of zero. These are not evident in the original image; rather, we attached meaningless padding to the image such that the kernel fits properly from edge to edge as it strides right and downward. The filter weights are adjusted as the network trains itself, but in general we will observe a trend: at lower layers of convolution, we will still be able to make out low-level features of the original image, but as we add more layers of convolution we will only retain higher and higher-level features.

Figure 10.1: Convolved features at layer two

Figure 10.2: Convolved features at layer four

Fig. 10 demonstrates this phenomenon in my second facial classification model, which has many layers of convolution. Observe that at the second convolution, we’ve extracted most edges of the image, but by the fourth convolution, we’ve extracted more high-level “human” features like eyes, hair, and mouth.

Pooling

After convolving, we can farther reduce our image and extract high-level features through pooling layers (Fig. 11). This technique is very similar to convolving, making use of an iterative kernel. Rather than aggregating pixel block using trained filter weights, pooling simply takes the maximum (max-pooling) or average (average-pooling) pixel value in the block. I use max pooling in my facial classification models, which is common practice as it functions not just as a reduction technique but also as a form of noise suppression.

Putting It Altogether

After applying a desired number of convolution and pooling layers, we have an image that represents the most important features of the original image, but is far reduced in size. We can now carry out flattening, then pass in the flattened numerical vector to be processed in our standard neural network. Fig. 12 demonstrates a convolutional neural network, altogether.

Figure 12: A full convolutional neural network

With the conclusion of this guide to CNNs, I encourage the reader to conceptualize neural networks not as an impenetrable black box, but rather a step-by-step process that is still very much controllable. Although the adjustable parameters may be far more mathematical in nature than in, for instance, a simple linear regression, they are still explainable and human-determined. Machine learning models can “make their own decisions” in a sense, but they make those decisions according to strict methods that we specify.