A Quick Intro To Convolutions In CNNs

Florian Abel

Published in

CodeX

5 min readMar 18, 2023

Disclaimer: This article requires a basic understanding of neural networks and how fully connected layers function.

Introduction

Convolutional neural networks (CNNs) are a type of network architecture that is commonly used in computer vision tasks. The convolutional layers provide several advantages over fully connected layers, which are commonly used in classical neural network architectures. The main functionality of the convolutional layers is to detect features inside the images while reducing the required computational power in comparison to fully connected layers. In fact, convolutional layers are fully connected layers, in which some of the weights are set to zero and some weights are shared between neurons.

Why do we need convolutions?

Computationally expensive
Image analysis problems often depend on comparably large inputs. With an image of size 1.000x1.000 pixels and three color channels (RGB) the input layer already consists of 1.000x1.000x3 = 3.000.000 input features. If a fully connected layer with only 1000 neurons is used, the weight matrix for the first layer alone has already 3 billion values that need to be computed. A network architecture with fully connected layers would therefore be computationally very expensive.

Overfitting
As outlined above, a network with fully connected layers would have a very large number of variables to learn. In order for the network to generalize well and prevent overfitting, a large amount of training data is needed. Next to the computational expenses it would, therefore, be very time-consuming and expensive to collect the required amount of training data.

Feature detection
Fully connected layers will also learn the features the input images are providing, however, they will learn these features not solely based on their existence but also based on their location in the image. If a network of this type learns to detect a feature, it might not detect the same feature in another part of the frame. CNNs on the other hand can learn and detect features independently of their position in the frame.

Overall
Convolutional layers help to substantially reduce the required computational power and required training data in comparison to networks using fully connected layers. Furthermore, they help the network to learn about features independently of their position.

How do convolutions work?

In a convolution, a filter (or kernel) is placed and shifted over an image in order to detect certain features. Depending on the filter and its position in the network it will detect more simple or more complex features within the image.

What is a filter (or kernel)
A filter (or kernel) is a matrix of typically 3x3, 5x5, or 7x7 pixels with a specific set of values. Depending on the values of the filter and the layer of the network it will help detect different features. The values of the filters in neural networks are variables that are learned by the network in the training process. In earlier applications in other fields, filters were often still chosen manually. In machine learning, however, it proved useful to let the network learn the best values instead of trying to manually design the filters.

The convolution operation
When placing the filter on an image, its values are multiplied and added element-wise. The result indicates the presence (or absence) of a feature (that is represented by this filter) in this particular position.

Convolution for one position in the image

Shifting the filter along the image and calculating the convolution in each position essentially generates a feature map, indicating the presence of this feature in different positions of the image.

Convolutions in different positions of the image by shifting the filter

It is important to note, that the result of the convolutions are no pixel/color values, but a map of the existence of a feature (the feature that is represented by the filter) in a specific position.

Edge detection example for convolutions

Edges are one of the most simple features and will be detected in the early layers of the network. A filter for vertical edge detection might look like this:

Let’s assume we have an image with only black and white (represented by 0 and 1), forming a vertical edge. In order to detect that edge, we are using the above filter to calculate the convolutions. As a result our feature map clearly shows the position of the edge:

Padding

As you might have noticed, the resulting matrix of the feature map in the above example is a 4x4 matrix, whereas our input has been a 6x6 matrix. Due to the filter size of 3x3, it can occupy 4 positions horizontally and vertically, each, resulting in 16 unique positions and a 4x4 output.

Valid convolutions
A convolution, in which the output is smaller than the input, is called a “valid convolution”.

Same convolutions
In some applications, it is important though to maintain the dimensions in the convolution process. This type of convolution is called “same convolution” and effectively signifies that the output has the same size as the input.

How to obtain “same convolutions”
In order to maintain the same size in the output as in the input it is important to have as many unique positions for the filter to be placed on as the original input has values. E.g. in a 6x6 input matrix, we need 6x6=36 unique positions for the filter to be placed.

In order to create the extra positions, the input matrix can be padded with extra values (in this case zeros):

Same convolution with zero padding of the original input

Summary

This should have given you a good overview of how convolutions, the determining operation in convolutional neural networks, work. As this article is supposed to serve only as a quick introduction, there is still a lot more to learn in order to fully understand the process and how CNNs work exactly. However, this should give you a good starting point to read more detailed information about the topic.