Convolutional Neural Network (CNN)

Dhruv Pamneja
9 min readMay 5, 2024

--

The convolutional neural network (CNN), as it is widely popular, is a type of neural network which is the heart of majority of the tasks related to the processing of images and video type of data i.e. this architecture has allowed to perform vision tasks and plays a role in generative AI as well.

CNN and Human Brain

Now, before we head into this, I would like to briefly associate the similarities between the working of the human eye and brain with the CNN. Since this deals with vision based tasks, a high level understanding so as to how it works would be beneficial.

Let us assume I present before you an image such as shown below :

At first glance, you would see an image of 3 people meeting in the park with their respective dogs. Now, if you were asked to detect if a man is present, your eyes would focus on the man in the image and give a signal to your brain about the presence of the man in the middle.

Over here, the same image is received as an input by our eyes and is passed through our brain over the course of multiple layers. As we know, any input is sent to the Cerebral Cortex of our brain, from which this moves onto the Visual Cortex inside it, where this input is processed via many layers, each of which try to gauge certain features or aspects of the input, which then finally yields an output in our conciseness.

Similarly, the structure of a CNN has been built as such to mimic the above flow so as to perform many vision based tasks. An overview of the flow in a CNN can be as follows:

  • Convolution
  • Pooling
  • Flattening

Let us see what these are and how a CNN integrates these and build an architecture.

Reading Image Data

Now whenever a computer tries to process and read an image, it does so by reading it as a matrix of pixels. An image is a combination of pixels combined together, with images being primarily of two categories:

  • Black and White (B&W)
  • Coloured

Now, for each image, each pixel can take any integer value from 0 to 255 and is an amalgamation of many such pixels. Over here, 0 implies black, while 255 implies white. Each number in between represents a combination of the two colours.

To visualise the same for B&W images, we can see the below image :

As we can see above, the B&W image of pixels nXn is converted into a single channel or matrix of nXn dimensions, with each cell taking any value between 0–255.

Now, similarly, let us go ahead and visualise a colour image :

Over here, we can see how an image will have three channels of n*n dimensions, where each channel represents the colours given below respectively:

  • Red
  • Green
  • Blue

As it is a given fact, that with the combination of these three colours, any known colour is possible to create.

Convolution

So basically, first we perform the convolution function over the given image so as to extract key features from a particular local regions of an image, which may be omitted or less highlighted when looking at the image as a whole. Along with that, this function also helps us to reduce the overall computational cost at the time of training, as the network then focuses on local regions to pick up and learn certain features.

So our first step before this process is to scale any image so that each cell of the image matrix falls into the value range of [0,1], by which the image is now scaled and further functions can be applied on it uniformly. We do so by scaling an image using the MinMax Scaler function, which divides each cell by 255 to make it fall in the desired range.

Now let us look below to understand the components used while performing the convolution function.

We have now introduced a kernel or filter, as we can see in the above image. This is a matrix of smaller dimensions than the image, and as the name suggests, acts as a filter to extract and retain information from the source image. The above filter is also called the horizontal edge filter.

So here, we multiply each corresponding 3x3 sub-matrix from the scaled image with it’s correspondent from the kernel image, and use their submission to create the value of the corresponding cell of the output matrix.

Once we have calculated the value for the ijth cell of the output matrix, we take a stride of one towards the right and perform the same operation for all columns. Similarly to do the operation for rows, we take a stride of one towards the downwards direction. Here, stride can be defined by the user, which determines the dimension of the output.

In the above example:

and so on.

To put the above mathematical operation in a simple analogy, think of the kernel as a magnifying glass, which focuses on a specific localised part of the image, which reveals the details more clearly which we might have missed at a normal glance. By moving the magnifying glass across the image, we can extract different features from various locations.

Now, the output is again scaled using MinMax Scaler, after which it now signifies the presence of the given feature the filter was trying to locate, as well as it’s location in the original image. We can have many filters, for example:

  • Vertical Edges Filter
  • Horizontal Edges Filter
  • Round Edges Filter

As far as the output is concerned, we can gauge its dimensions by using the formula, output = n — k + 1.

Here, the output is the total dimensions of the output matrix, whereas n implies the input image’s dimensions, and k means the dimensions of the kernel. However, there is a small issue that still persists. As we can see from the example above, we have a 6x6 image, which is finally reduced to a 4x4 image in the output i.e. we have lost some clarity and information from the image, which may lead to a loss in the future. So to prevent that, we create a padding around the input image.

This is nothing but the addition of dummy rows and columns around the overall image, so as to protect the loss of the scaled input image in the final output, which can be seen in the image below, where we have added a padding across it’s dimensions.

Here, padding can be of many types, such as :

  • Zero padding
  • Nearest value padding
  • Casual padding
  • Constant padding

Now if we see, our formula states that the resultant output matrix of the multiplication of a 8x8 matrix with a kernel of 3x3 is of the dimension 6x6.

Hence, our final formula will be as follows:

Here, the output is the total dimensions of the output matrix, whereas n implies the input image’s dimensions, p implies the padding layer added , s means the stride defined by the user and k means the dimensions of the kernel.

Another interesting part here is the initialisation and updation of the kernels. While the initial values of the kernels are typically chosen through initialisation techniques, they are not hardcoded. Instead, these values are dynamically updated through a process called back propagation based on the training data. This allows the network to learn and adapt to the specific features present in the images it encounters, leading to improved performance.

To achieve back propagation, once we get the output matrix, we apply a RELU function over it which acts as an activation function. It can be described as :

We particularly use Relu or preRelu, as for these functions we can find the derivative during the back propagation. This will be very useful when the updation of kernels will need to happen and hence, we use these activation functions.

Pooling

After the convolution process has learnt the features and extracted the information, there is one important property which we want the model or network to retain while it trains over multiple images and various convolutions, and that it of Location Invariance.

Pooling plays a crucial role in achieving location invariance. This property ensures that the network can recognise and extract features regardless of their specific location within the image. While convolution helps detect features, pooling summarises these features across a local region, making the network more robust to small shifts or translations in the object’s position. This allows the network to learn and identify the same feature even if it appears in different parts of the image, leading to more accurate object detection and recognition.

To achieve this, we use a concept of Pooling. To put it simply, pooling is taking a subset of values from the overall matrix, and then choosing a singular value based on a particular logic to retain a data point which encapsulates the entire pooled sub-region.

Now, techniques for pooling are :

  • Average Pooling
  • Min Pooling
  • Max Pooling

In the given example below, we have used max pooling. We have taken an output of pooling layer of 2x2 matrix, where we will select the corresponding sub-matrix and retain the maximum value of the same. Then, we will make a stride of two units in both rightwards and downwards direction to get the final pooled matrix.

Here, let us assume that the output post a convolution and RELU activation on the output matrix looks like this, over which we apply max-pooling as follows.

Now, an interesting thing here is that both the convolution and pooling layer can be used in a combination in the overall network i.e. we can stack them on top of one another and create multiple combinations.

Flattening

Let us now assume the below given matrices as the output of pooling, such as :

In this process, we simply flatten all these into a single layer, such that they become a single vector comprising of all of the above. With this, they act as a dense layer of a regular ANN, which is then connected to an ANN to learn and yield the final output. Note that this flattened layer can be connected in any order to the final ANN, below given is an example.

All the steps above are used to create a CNN which works in sync with the below given diagram, which depicts the overall working of the CNN :

In conclusion, CNNs are powerful neural networks specifically designed for processing image and video data. Inspired by the human visual system, they leverage convolution, pooling, and flattening operations to extract and retain features, reduce computational complexity, and achieve location invariance.

This allows them to excel in various vision tasks like image recognition, object detection, and image segmentation. By mimicking the hierarchical structure of the human brain, CNNs learn complex representations of visual information, leading to remarkable performance in a wide range of applications.

You can further read about CNNs in the research paper given here, which describes the CNNs and their workings in depth.

Credits

I would like to take the opportunity to thank Krish Naik for his series in deep learning on his Youtube channel, which has allowed me to learn and present the above article. You can check out his Youtube channel here. Thanks for reading!

--

--

Dhruv Pamneja

Just a bloke trying to learn and grow, with a passion for new AI technologies and products.