A quick guide to Convolutional Neural Networks

9 min readOct 18, 2021

In the domain of computer vision, convolutional neural networks are one of the most widely used algorithms. To understand what convolutional neural networks(CNNs) are and how they function, it is very important to understand why they are used and what problems they resolve that are present in fully connected neural networks.

So then, what is the problem with regular fully connected networks?

Whenever we try to tackle a problem that involves a lot of data, such as computer vision problems for example, the input vectors can become very very large. This is because for an image that is of dimensions 1000px × 1000px × 3 the number of input features turns out to be 3 million, so if we have 1000 hidden units in the first layer of the neural network, the weights of the first layer end up having the dimension (1000, 3 million). This is computationally very expensive and is one of the primary reasons why CNNs are preferred over having an entire network be only comprised of fully connected layers.

Convolutional neural networks broadly are comprised of three types of layers

Convolutional Layer
Pooling Layer
Fully Connected Layer

To start grasping the intuition behind how a deep convolutional neural network behaves we need to understand how each of these layers function.

Convolutional Layer

A convolutional layer is where the majority of the computations occur and is the fundamental building block of a convolutional neural network. It has several components that shape and define the computations of this layer such as, filter size, number of filters, strides and padding.

Let’s consider an example to understand what goes on in this layer —

If we are presented with data that has a 5 x 5 x 1 (grayscale image) then to find and extract the features of objects inside the image we carry out a process called “convolution” or blending one function with another. Here our input image is one function, what is the other function then? The other function is called a “filter” or a “kernel”. This is usually of a smaller size than the input matrix.

Steps —

The filter is placed on the top of the 5x5 input image, followed by a dot product between the pixels of the image in the 5x5 matrix and the filter.
These values are become the elements of our result matrix, also known as the “feature map”.
This process is succeeded by moving the filter one column to the right and repeating the previous step to get the second element of the first row of the result matrix.
This keeps recurring until we eventually move the filter one step down and start over from the left column. These values eventually become the elements for the second row of the result matrix also referred to as the feature map.

FILTER

Refer to the image below to understand how a feature map (red) is created using a filter or feature detector(green)as shown below:

source: https://rstudio-conf-2020.github.io/

An observation to be made here is that a 5×5 input when passed through a 3×3 filter yields a 3×3 dimensional result. A convolution operation reduces the dimensions of the input.

Therefore, to generalize the dimensions of the feature map of an n×n input convolved with an f×f filter we use:

so in the above example, if we replace the values we get,

dimension of feature map is 3x3

Although there are use cases where we can use pre-determined values for filters, such as, Sobel filter, Scharr filter etc, in problems such as computer vision we define the values of the filter as another parameter that the model can learn during back propagation. This for helps the filter better understand features such as the edges of an image whether they be in any orientation (45degree, 70degree, 73degree etc). The underlying convolution operation itself helps the filter learn the value.

What if we want to extract more than one feature? Let us assume we need to find and extract not one but two rudimentary features(horizontal and vertical edges) from an RGB input image, how do we go about implementing this?

THE NEED FOR MULTIPLE FILTERS

Multiple filters are used generally to extract layers of information about an input. Having multiple filters ensures that the output of our convolutions result in an output with multiple channels that directly correspond to each filter. Thus, we can have a filter that is responsible for horizontal edge detection while having another filter that is responsible for vertical edge detection. The output that we obtain therefore has two channels, each corresponding to the output of each of the filters with which the input image was convolved.

To demonstrate with an example

Note-The ‘*’ denotes a convolution operation

It is noteworthy that the number of layers of the filter should match the number of channels in the input image as each layer of input feature matrix is convolved with the respective layer from the filter matrix.

As one might have noticed from the example illustrated above, the number of layers/channels of the feature map output depends on the number of filters being used. As there are two filters here the output feature map has a final dimension of ‘2’. ie — The depth of the feature map depends on the number of filters being used to calculate the feature map.

Therefore, we can generalize this as,

where n=input dimension, f=filter dimension, nc= no. of channels, nc’= number of filters, *=convolution

PADDING

The process of padding essentially adds a specified number of rows and columns around the input to any convolutional layer. It resolves two problems —

As we saw one of the side effects of performing a convolution operation is that it shrinks the dimensions of the input image. This makes it a challenge to create deep neural networks as the size of the input keeps shrinking.
The pixels near the edges are used much less since compared to the ones near the center they overlap less often per iteration of a convolution step. This results in a lot of information from around the edges and corners getting thrown away. Padding reduces this effect.

By referring to the equation we used to find the output dimensions in the previous section we can modify the output calculation as

so for a 6x6 input if we have a padding(p) = 1 the output becomes,

Hence, dimensionality was preserved as the output did not shrink.

Based on this knowledge we can define two types of convolutions —

Valid convolutions: This does not have any padding hence the input to output dimensionality is given by,

Output dimension reduces

2. Same convolutions: This includes padding to keep input and output dimensions same.

source: https://datahacker.rs/what-is-padding-cnn/

Output dimension kept the same using proper padding value

Since we know n+2p-f+1 = n we can define the value of padding(p) as —

STRIDES

In a CNN, strides refer to the distance that the filter moves over the input to generate the the next element in the output feature map. By default the value of a stride is set to 1. However, if we set the value of stride as 2 then the filter moves two places instead of one to generate the value of the next element in the output. Unsurprisingly, a larger stride value leads to a smaller feature map, though, typically a value of 2 is usually the maximum that is used in most cases.

As a consequence of applying a stride, the filter may move out of bounds of the last column/padding, in this case the computation is only done for the values overlapping with the filter. There are therefore situations where the final output of a convolution might result in fractional values, on such cases the value is rounded down or floored against the fractional value.

Note — In mathematics, a convolution usually involves a step where the filter is mirrored/flipped twice.
eg: —

However, in machine learning this step is not carried out. So technically, what we perform in machine learning is a form of cross correlation. However, by convention we refer to it as convolution in machine learning. The advantage of flipping the filter is that the result which is obtained abides by the associative property. ie — A×(B×C)=(A×B)×C, this helps in signal processing tasks but are of no consequence in deep learning.

Change in dimensionality due to strides can be generalized as —

where dimensions are, s=strides, f=filter, p=padding, n=input

POOLING

Pooling layers in CNNs are used to reduce dimensionality and hence the number of parameters to learn from in a layer. The main advantage of performing such a task is that it reduces the computational requirement of the network.

One way to think about a pooling layer is to treat it as a summarizing layer, it reduces the dependency of the network on specific features which are precisely positioned to a generic summarized representation of said features. This also helps the network generalize better on input features with positional variations.

Two categories of pooling are normally used,

Max Pooling —

It selects the maximum value from an element on the feature map that’s covered by the filter. Therefore we can say that only the most prominent features are selected.

Now that we have an idea of filters, the convolution operation, padding and strides, let us look at how one layer of a CNN would be implemented.

Average Pooling —

It selects the average value from an element on the feature map that’s covered by the filter at any given time. Therefore we can say that only an average of all the features are selected.

Since we now have an intuition about how the main components required to make a CNN work, we can have a look at one layer of a CNN to see the change in dimensionality.

ONE LAYER OF A CNN

One layer of a simple CNN with a 6x6x3 input image, ‘*’ denotes the convolution operation

The convolution is a linear operation followed by a non linear activation(ReLU).

Here, we can see how the output feature map is a 4x4x2 matrix due to having two filters.

Finally, we take a look at what a multilayer CNN would look like —

MULTILAYERED CNN

Third dimension of Conv1 is 6 as 6 filters of size 5x5 were used

The above image illustrates a CNN with an input of the image of digit ‘7’ with a softmax output for performing multi-class classification to identify the digit.

The architecture of such a network can be denoted in short as —

CONV — POOL — CONV — POOL — FC — FC — FC — SOFTMAX

Hopefully this article has helped in making convolutional networks and the change in dimensionality easier to understand!