A Visual Description of Convolutional Neural Network Using Basic Squares.

Gary Pearson
15 min readOct 25, 2022

--

You Promised Cats! Convolutional Neural Networks

Convolutional Neural Networks are brilliant. They are responsible for some of the tremendous leaps forward in artificial intelligence most famously classifying anything represented as an image, e.g., cat pictures. They perform operations called convolutions. However typical explanations can be convoluted (Pun intended). I hope to make them easier to understand.

Convolutional Neural Networks (CNNs or Covnets) are responsible for transforming images into a feature set used by the deep learning methods such as feed forward neural nets. Before describing convolutions, let’s highlight the difficulties of using a feed-forward network without CNN’s as the pre-processing method.

*If you are new to neural networks, check out my article “A Simple Explanation of Neural Networks for Business People.”

A neural network architecture can distinguish pictures of two-wheel bicycles with round wheels from one-wheel unicycles with square wheels. Assuming there are enough training pictures of unicycles with square wheels. Tthe network architecture is straightforward.

Simple Neural Network

The number of wheels is input X1. Because a description of the shape cannot be used as an input, the number of sides represents the shape, one for round and four for square and is input X2.

The first training picture is a bike. The network will initialize with random weights and biases and hopefully develop a probability of one for a bicycle and zero if a unicycle. It won’t!

As in all neural networks an error is calculated, and gradient descent and back-propagation adjust the parameters. After repeated epochs (training instances) with many more pictures of bicycles and unicycles and a balanced distribution of images, the result is a pretty good model.

Once again, life is not that simple. How do we know how many wheels our image has? What features would be used to separate a cat from a dog? The numbers of legs, ears, eyes? Take a moment and have a think. Can you name even one feature that helps to distinguish one cat from the other? Great news! Convolutional neural networks will decide what features are essential, and it won’t be anything obvious.

Using this high-resolution image of my face after watching a 2021 end-of-year review show,

Figure 2 — Face

How would the picture be described in a manner that can be understood by a computer? Applying a value of one to black pixels and a negative one to white pixels might work if making a side-by-side comparison. It won’t work in image classification. How about two sets of horizontal 2x1 black pixels and one 4x1 line of pixels below? Nope, what if I tilt my head to the side? My face could be offset in the image, or my eyes narrowed, or there may be a small frown.

That description does not work either — all these problems with a black and white picture. It is ludicrous to think rules can be defined for high-resolution, color, detailed images.

Every neural network takes numbers to start. It does not know what the numbers mean, but it finds a way to elevate the critical numbers to see the correct output through guided trial and error. Every neuron is connected to every neuron in the layer ahead. In image classification, pixel values are not useful for at least three reasons; first, a medium resolution (1024px x 768px) color image would produce more than two million inputs. With one thousand neurons in the first hidden layer, we hit two and a half billion parameters. Of course, networks with many more parameters are commonplace so this is not a deal killer.

The second reason is that it makes no sense to connect every pixel. The relationship between most of the pixels is irrelevant to the outcome, e.g., a white pixel in the top left has no ties to a white pixel in the bottom right. The third reason is that the pixels have no context other than the brightness level. Two images with the same pixel values, when rearranged, result in a different picture.

It’s inconceivable an image recognition system would work satisfactorily if our input values were simply the values of each pixel.

A convolutional neural network is typically connected to and comes before a feed-forward neural network. A convolutional neural network aims to deliver input values to the feed-forward network, that are related to only meaningful image features. If you are trying to identify a bike, do you care that the background is a snow scene? No, but you would want to recognize wheels.

In the following picture, there are clearly identifiable features, black horizontal and vertical lines. How can the network learn that they are essential? Undoubtedly you cannot read the tiny numbers in each square; they indicate a pixel’s brightness, a negative one for white and one for black.

Black Lines

To recognize other similar images, maybe with various colors replacing white, the network is trained to learn only the features that matter, i.e., the black lines. As with all neural network inputs, it starts by multiplying each input value with a random weight. In a CNN, it is done using convolution.

What is a Convolution?

Convolutional neural networks get their name from a mathematical operation called convolution. The formal definition of convolution is — a function derived from two given functions by integration, which expresses how the shape of one is modified by the other. For us, it means changing the value of pixels by passing a smaller grid of values across the image.

Figure 4 — Face Starting Values

The grid that is passed over the image is called a filter, there is another name, a Kernal. In this example, the filter is 3x3 pixels, but they can be a larger size. Filters are initialized with randomly generated weights as in any neural network. For convenience, assume we have trained the network and arrived at a filter that uses only two numbers, 1 and -1.

Figure 5 — A 3x3 Filter

Let’s perform the convolution and get into what is happening at the end. Start by placing the grid over the image starting in the top left corner. Each of the filter values multiplies the value of the pixel beneath, as shown here. Bonus material, the area of the image underneath the filter is called the “receptive field.”

Figure 6 — Dot-Product

The next step is to sum the output 3x3 grid. Place the answer into a new grid called a feature map, in this case, the same size as the original image.

Single Feature Output

Slide the filter to the right by one step (user-definable) and repeat the process. When the end of the row is reached, jump to the grid’s left and drop down one line. Again, repeat the process. Until there are no further movements possible.

The convolution, unlike a feed-forward network, applies weights to areas of a picture. It maintains a relationship between a pixel and its surrounding pixels. As stated earlier, there is no relationship value between pixels that have a spatial separation.

These could be used as input features, but we won’t. There is insufficient image feature information.

Feature Maps Demonstrated

You now understand the convolution process; let’s look at an example. Going back to the image of the black lines, using new random filter values at initialization are

Random Starting Weights

After completing the convolution on the picture of black horizontal and vertical lines, the following feature map is generated. Note that this is a small section. The output is repetitive, so I have omitted most of it for simplicity. The result does not tell us anything.

Feature Map One

The colors are arbitrary; white is used for a value equal to or less than zero. Green is used for values between zero and less than five, and gray might be used for values between five and less than nine. Black will be used for nine or more. There is nothing in this feature map that helps. There is a repeating pattern but nothing that stands out as a feature we can use.

A training epoch begins. Gradient descent and back-propagation are applied, as in our fully connected network. We ultimately arrive at a new set of filter weights,

Final Weights

In practice it would take many many epochs to derive the filter weights

After completing the convolution, wow! Look at the result.

Small Section After Filter One

The meaning of the feature map values — In this example, the value in the resulting feature map indicates how well the 3x3 filter mapped to the values in the receptive field. An output of nine means that when the 3x3 filter is centered on that pixel location, the underlying 9 pixels all exactly match the filter pixel values.

Although the filter was only 3x3, sliding across the image has resulted in finding horizontal lines no matter where they live in the picture. The filter weights find features that are useful in identifying a grid. However, horizontal lines are not enough. There is no rule against using multiple filters. The second filter ultimately ends up with this configuration where a few of the values have changed from positive to negative and vice versa.

Filter 2 Final Weights

Figure 12 — Filter Two

Which, when applied, produces this feature map, identifying the vertical lines.

Feature Map 2

Let’s add one more filter,

Final Weights Filter 3

The intersections have been discovered.

The first convolutional layer is almost complete. Weights in the form of a filter have been applied to the inputs to produce a feature map. Back-propagation and gradient descent were used to find weights that reduced the model error. Each pixel value in the feature map is now the weighted sum of that pixel and every pixel surrounding it.

A bias can also be applied to each of the feature values in the resulting feature maps. A single-bias value would be used for each feature map. It sounds complicated, but it is nothing more than calculating the neural network weighted sum, i.e., Image input Pixel x Filter weight +Bias = Feature Map Value. Importantly the well-known neural network linear operation X1*W1+B1=Weighted Sum has been applied in a manner that maintains a spatial connection.

Finally, after the convolution, the ReLU activation function is often used to set all negative values to zero. It tidies up the math.

It’s time to stop thinking about images and think about features (we have just created a feature map, after all). Additionally, the filter is nothing more than a grid of weights.

Filter Weights

Three filters of dimension 3x3 were used in the first convolution; typically, there would be many more with various filter sizes.

Max-Pooling, Reducing the Feature Set Size

The black lines image used in the convolution example was a tiny 57x55 Pixels. The filter at 3x3 was a barely visible speck on a monitor. Any feature found, i.e., horizontal and vertical lines, and intersections were minuscule. It is not good enough to solve the image classification problem. In fact, horizontal and vertical lines of these sizes will be shared across most images. Features that are more representative of a typical class of pictures are needed. I would expect to see wheels for a bicycle and ears for a horse, horizontal and vertical lines do not help.

What is Max-Pooling

Disclosure, there is more than one pooling method. We will use Max-Pooling. Pooling is used to strip unnecessary information from the feature map. The result is fewer remaining input features for the feed-forward network while retaining the critical feature information. The mechanics of pooling are similar in many ways to convolution. Using a small pixel grid, we will select 2x2; slide across the feature map in user-defined steps called a stride. A stride of two means taking two-steps every time you move across or down the feature map. For each position, copy the highest value of the four pixels’ (2x2 grid) into a new map.

Figure 17 — Max-Pooling

This example is not large enough for adequate visualization, so let’s go back and apply max pooling to the previous feature maps. The result is the image to the right side in the following figure.

Max Pooling Before and After

After removing 75% of the features (with a stride of two, one out of four feature values are retained), the image’s key elements remain in the feature map. Most of the empty white features have been killed and removed. Three feature maps are reduced in size while retaining critical information, but it is not good enough. The inputs are still too many, and the features not specific enough. The feature maps become the input for the next convolutional layer.

At this point, if you have searched the internet for an explanation of a Covnet, you have likely been told the next layer finds higher-level features. Let’s examine what that means and how it works.

Back to Bicycle Wheels

In classifying a bicycle with two wheels or a unicycle with one wheel, I clearly need to know if there are one or more wheels in the picture. Here is a small part of a beautifully captured scene Loch Leven and Ben Nevis in the background (not shown for math simplicity). It is a single wheel from either a unicycle or a bike.

Bicycle Wheel

Bonus Material — The white border represents added pixels with a value of zero. They allow a 3x3 filter to capture information in the real picture boundary by pushing the center of a 3x3 filter over a pixel at the edge of an image.

The first convolution layer uses five 3x3 filters; after multiple epochs, the learned filters produce the following results. Note that white cells without values are equal to or less than one.

Figure 20— First Convolutional Layer Feature Maps

With the first convolution of five filters complete, information is highlighted that might be useful in identifying a wheel.

The convolution is followed by a max-pooling operation, using a 2x2 pooling with a stride of two (each step to the side and down is two pixels). As shown in the following figure, the feature maps have been significantly reduced, but the critical feature information is retained.

Figure 21 — After Max-Pooling

The five feature maps now become the inputs for the next convolutional layer in the network that learns higher-level features. The feature maps above now have smaller dimensions, 11x11, than the input image (23x23). It has been compressed while still having the features we need. When applying a 3x3 filter to the reduced feature maps, it is conceptually covering a more significant piece of the original image. The filter can group more detail. Even more of the feature map can be covered by switching to a 5x5 kernel.

Starting with random filter weights and, after many epochs, this new filter is learned.

Figure 22 — New Filter

The filter is applied to all five feature maps, and here is the result.

Figure 23 — Convolution Two

That is not good! It appears that there is less detail, not more!

However, unlike the one-dimensional black and white picture used as the layer one input, layer two input was five feature maps.

Figure 24 — Layer Two with 5 Inputs

When applying a filter to more than one feature map, the output is the sum of the matching pixels in each feature map. The result in the following combined feature map (remember values exceeding 9 are in black).

Figure 25 — The Wheel Feature Map

Ladies and gentlemen, we have a wheel. Yes, it is a small square wheel, but this is not intended to be a mini version of the original picture. The critical feature knowledge was retained throughout the process. The final step is one more round of max-pooling,

Figure 26 — Final Convolutional Layer Feature Output

resulting in the inputs X1 to X9 for the feed-forward section of the convolutional neural network.

A Real Convolutional Neural Network

The following figure shows an example of a CNN. It is considered a very small CNN, but it suits the purpose of summarizing the CNN’s steps.

Figure 27 — A Convolutional Neural Network
  1. The input image. In this case, it is a single layer. For an RGB image, there will be three input layers, one for each color channel.
  2. Three filters were applied to the input layer, resulting in three feature maps. If the image in step one was an RGB image, each of the three filters would be applied to each input layer. The result is three feature maps for each input layer (in this example only 1). However, if there were 3 input layers, each layer’s three feature maps are added, leaving only one map for each input layer.
  3. Max-Pooling reduces the size of the image and removes unimportant pixels.
  4. The ReLU activation function changes all negative pixel values to zero to tidy up the math.
  5. In this network nine filters are applied to each of the three feature maps. Again, each feature map at step three produces nine new feature maps, and those are added together to leave nine new feature maps.
  6. A second max-pooling takes place.
  7. The ReLU is applied a second time to produce the final feature vectors.
  8. The feed-forward input layer is populated with the feature vectors from step seven.
  9. The last step is the feed-forward network with as many hidden layers, neurons, and output nodes as your heart desires.

A Covnet Summary

The explanation of convolutional neural networks was deliberately simplified. The input values delivered to the feed-forward component were for one feature we might find in an image of a bicycle. It does not help classify the picture as a unicycle or bicycle. For that, more features are required, maybe handlebars, frame geometry, forks, or the number of wheels.

The process just demonstrated would find a second wheel if it was in the picture. The first layer filters search the entire image to locate low-level information. The second layer looked for higher-level details, and if the filter had encountered another wheel, it would have impacted the feature map.

The filters that were learned would not have worked well for handlebar detection. However, instead of just three filters in the first layer, many more can be applied. Additional filters would learn to identify different components of our bicycle. Instead of ending with nine input values for the wheel, additional feature maps for the other classifying bike v unicycle elements would be included.

Other than illustrating the workings of a convolutional network, the explanation is overly simple. The filters learn increasingly abstract information. A bird image classification may learn minute feather patterns. Images have color; the input image is now three layers of RGB information. When trying to classify dogs, cats, horses, mice, etc., the number of ears, eyes, and legs are not suitable differentiating features. These are all identical for each of the animals listed. A cat classifier may, over multiple layers, learn subtle cat whisker patterns, the shape of the claws and feet, or even the face’s outline.

Learning the subtle texture features is the type of abstraction necessary. Mechanical items are relatively straightforward, but in the organic world, there is significantly increased complexity.

Knowing what features have been learned is not easy and, in some cases, not possible. A word of caution, if the cat training images all contained something like the copyright symbol ©, it is likely, if not probable, that the model will classify every picture with the © as a cat. Recognize the model might be focusing on features not related to the image class. A repeating snow scene behind the bike in every training image (or even just many images) may result in all bike images without snow being misclassified and all backgrounds with snow being classified as a bike.

--

--

Gary Pearson

A tech exec with an interest in Artificial Intelligence and AI Ethics. I like to take complex subjects and explain them in simple terms