Convolutional Neural Networks Explained

Jeremi Nuer
8 min readJan 8, 2022

--

Computer Vision is a term that’s thrown around often, and it is nothing if not incredibly exciting. But really, how does it work? How can any computer actually see images and videos, and interpret things?

Well, one of the most common and powerful tools for interpreting image data, are convolutional neural networks.

This article assumes that you know the basics of neural networks. If you’re brand new to the topic or want a little refresher, check out the article that I wrote about neural networks here.

This article is divided into 5 parts:

Introduction & High Level Explanation

Tensors & Image Data Representation

Kernels & Filters

Padding & Pooling Operations

Coding Example

Introduction & High Level Explanation

Convolutional neural networks, by definition, are neural networks that contain convolutional layers.

CNN’s can contain fully connected layers(in fact in most cases they do), and their architecture is identical to a normal neural network. Input data is still passed into the network, and feeds forward through each layer, becoming more and more abstract until finally a class is outputted

The distinction appears when you look at how exactly the layers process the input data, find patterns, and make the output data more abstract.

Most commonly, convolutional neural networks (CNNs) process image data, and can efficiently extract “features” or patterns from the image data. The now processed images are passed to the fully connected layers, which are able to much more easily classify the images.

Features, or patterns, refer to abstractions that are apparent in the images. These are things like edges and corners (features you would see in the earlier convolutional layers), but also curves, and shapes (features that would become apparent in later convolutional layers).

To “extract” a feature is to make it more apparent. Let’s take the example of an edge. A convolutional layer would extract certain edges by making the parts of the edge brighter, and the space right next to the edge darker to exaggerate the existence of the edge.

We’ll get into what these numbers mean exactly, but this just shows as an example

Think of the process in a CNN as sending an image through layers that slightly change the pixel values of the image to make certain patterns more apparent, and make it easier for the fully connected layers later on in the network to classify the image.

We can infer that as an image goes through multiple convolutional layers, the features build on each other, and using the extracted features of a previous layer, a later convolutional layer could extract more abstract features.

Tensors & Image Data Representation

Before we dive into how exactly a convolutional layer would extract a feature, we need to briefly touch on how the image data is represented. Image data is represented by the pixel values that make up the image. One entire image is represented by a multidimensional tensor (or a collection) of pixel values. Let’s take a 2-dimensional tensor representation of an image.

For our purposes, imagine this image is gray-scale

This would be a gray-scale image, where the x-axis represents the pixels along the width of the image, and the y-axis represents the pixels along the height of the image.

But most images are not grayscale, rather they are RGB images. In this case we would have a 3-dimensional tensor. The first two dimensions would still be represented by the length and width of the image. But the third dimension would represent the different channels of the image. Specifically 3 channels of red, green, and blue.

Let’s take the example of a 28x28 RGB pixel image. The shape of a tensor that represents this image would be 3x28x28, where for each channel, there are 28x28 pixels representing that specific color channel. So there would be 28x28 red pixels, 28x28 blue pixels, 28x28 green pixels

This is an incredibly important concept to grasp, as there are many different channels besides the simple red/blue/green channels of an image in its original input form.

In fact, each time an image passes through a convolutional layer, the output is considered a channel. If a convolutional layer has multiple filters that go over an image, then there will be multiple output channels that resulted from that image. Instead of these output channels representing red, blue, or green, they now represent different features that have been extracted.

To clarify some terminology, image data that is inputted into a layer is called an input channel. For example, an RGB image entering the first convolutional layer would be referred to as three input channels.

In later convolutional layers, there can be many more input channels (as many as you want)

In this sense, you could say the shape of a tensor representing an image passing through any convolutional layer is described as the different features x the height x the width of the image, where each point along the first axis represents a different feature being extracted, a different channel.

There is one more dimension to consider. The standard for all tensors is to be 4 dimensional when they enter a convolutional layer. And this is because images come in batches. That’s right, multiple images are processed at once.

I do want to make this clear though. Although the images are processed at the same time, they do not affect each other, and there are no operations done between two images. Using batches just helps us pass more images through a CNN faster.

That puts the final shape of a tensor as the number of images x the number of channels x the height of image x the width of image.

The most important thing to grasp is how each image has multiple channels, and each channel has its own set of pixels that create an image that looks slightly different to another channel of the exact same image.

Kernels & Filters

Now let’s talk about how different features are extracted. The process is actually pretty simple. There is a filter (represented by a tensor of a certain size) that convolves each image, performing the dot product between the filter tensor, and the set of pixels that the filter is currently on.

When I say convolve, I mean that the filter, that is usually much smaller than the image, performs an operation on a set of pixels the exact same size as the filter, then moves in the same progression as you would reading a book — moving right and down.

The filter would then slide by one pixel to the right (2 of the columns would be the same as what we just filtered), and perform the same dot product between the new corresponding elements of the filter tensor and the image. The filter will continue sliding right until it reaches the edge of the image.

Then, just like in a book, it would slide down by one pixel, and return to the far left. This is what I mean by “convolving.”

It gets a little more complicated when you consider multiple input channels. Every convolutional neural network will have multiple input channels, even if the original image was grayscale. Remember that by passing an image through a convolutional layer, multiple output channels are produced, determined by the number of filters.

Then how are multiple input channels processed into a different number of output channels? How could you have a situation where 6 input channels enter a convolutional layer, and 7 output channels leave it?

To describe this, I need to re-define the word filter as I have been using it. The actual tensor that performs the dot product operation on the image is called a kernel.

Each filter has a collection of kernels — one kernel for every input channel flowing into this layer. Each kernel is its own tensor, with its own completely unique values. Once each of the input channels have been transformed by their corresponding kernel, the channels are combined together to create one “output channel”.

credit for these two diagrams here

When we use the word filter, we refer to the collection of kernels that created the output channel. The number of filters/output channels is arbitrarily chosen by the neural network programmer.

Padding & Pooling Operations

Some other processes of note are padding and pooling operations. I’ll briefly explain them.

The process of a kernel convolving an image inherently decreases the image size, as the pixels on the edge of the image never have the chance to make it to the middle of the kernel. The solution is to add a “padding” of pixel values around the image, often with a value of zero. This is called zero padding. This allows the pixels on the edge of the image to remain after the filter.

With a padding of one, the shape of the tensor is maintained

The end result is that the overall size of the image is maintained while being passed through a convolutional layer.

On the other hand, there are many cases where you want to intentionally decrease the size of your input image. For this, we use pooling operations.

Pooling operations are very similar to convolving kernels, in that the pixels that line up with the path condense down to one pixel. Instead of calculating the dot product, methods like the max pooling operations find the maximum pixel value in the entire patch, and set that as the new singular pixel value of that area. You can then define the stride that determines how far the max pooling kernel steps after each operation. The larger the stride, the smaller the image will become after the max pooling operation is completed.

The image shape drops from 5x5 to 2x2

Coding Example

Now that we have a basic understanding of how CNN’s work, let’s create one!

The only necessary prerequisite knowledge is having read this article, and having a basic skill set in python. (the other requirement is giving this article a clap ;)

We’ll be using the Fashion MNIST dataset — a dataset of 70,000 images (60,000 training and 10,000 testing) of different pieces of clothing. There are 10 different types of clothing, and our goal is to create a CNN that can effectively classify an image as one of those ten types of clothing.

There are no downloads necessary for this tutorial! We’ll be using Google Colab. We’ll also be using Pytorch, which is a friendly API that is built upon python.

However, for this coding tutorial, we’ll be switching over to youtube, where it is easier to see the code that is being written. Unfortunately, my voice and the video are not perfectly in sync, but listen to my voice first, and then write the code second!

Hope you like the thumbnail (;

Happy Learning! Check out my Newsletter, and Socials.

--

--

Jeremi Nuer

What does the future hold? I’m exploring emerging technologies such as AI and Quantum Computing