Convolutional Neural Network: what it is and how it works

Published in

Analytics Vidhya

6 min readMay 14, 2021

In this article we are going to talk about the Convolutional Neural Network (CNN).

Are you just starting your journey in Deep Learning? I recommend you to read this entire article because I’m going to explain in a general way, what CNNs are.

P.S. I will not be using mathematical concepts in this article.

Are you ready?

Let’s get started and start with the basics!

What is a convolutional neural network?

A convolutional neural network is an artificial neural network architecture used to detect images larger than 64 x 64 pixels.

Unlike a normal artificial neural network (ANN), CNNs are used to achieve better image detection (we will go into more detail later).

This type of neural networks also called ConvNet, are often used in object recognition. In fact they are used in the medical field (such as cancer detection), in self-driving cars (eg Tesla), recognition of agricultural fields and much more.

Convolutional neural networks used in Computer Vision, are becoming very popular in recent years because new and increasingly powerful Deep Learning algorithms are being discovered (as technology advances).

With that preamble done I can finally start talking to you about the differences between a normal neural network and ConvNet.

What distinguishes a CNN from an artificial neural network?

The structure of an artificial neural network is completely different from the structure of a convolutional neural network.

Theoretically you could use an ANN to perform an image recognition, but this would lead to completely disastrous results. Now I will try to explain better what I have said so far.

Imagine a normal artificial neural network, there are thousands of neurons and hidden layers interconnected with each other:

Because of its depth and complexity, the following problems would occur:

High computational cost
Gradient descent would explode

To avoid this kind of problem CNN is used because unlike normal artificial neural networks, they are suitable to analyze images thanks to neurons placed in 3 dimensions, also called channels: Height x Width x Depth:

In addition to achieving better results with little effort, convolutional neural networks have the following advantages:

Parameter Sharing
Sparsity of connections, i.e.: in each layer each output value depends only on small numbers of inputs.

I hope that by now it is clear to you, because now that I have explained the main differences between the two types of neural network and why you should use CNNs for image recognition, let’s go and see how a convolutional neural network actually works.

How does a convolutional neural network work?

The basic structure of a neural network consists of:

Convolutional layer
Non-linear activation function
Pooling
Fully Connected Network (i.e. our NN)

Let’s go in order and see how each of these features are important for building our convolutional neural network.

Convolutional layer

The convolutional layer is the initial part of our network, in this layer happens what is called “Convolution” from which the CNN takes its name:

In the convolution we extract image features.

Let’s take an image: it is divided into squares (take the green square as an example and make it like an image).

After performing this step, the image is multiplied to a matrix (called ‘filter’ or ‘kernel’ or ‘feature detector’ — the yellow square) thus obtaining the convolution (or ‘Feature Map’ — pink square).

To give you a better understanding of what we are going to do see the images below:

One of the important things to keep in mind is that the filters here act as feature detectors from the original input image. We have different types of filters which in turn form different types of ‘Feature Map’:

Stride: in the Windowing process if the stride has not been applied, the filter will start analyzing the next square (as in the image above). On the other hand, if for example we apply a stride = 2, the filter will start from the second successive square, skipping the next one each time.
Padding: adds a layer of 0 around the image. This avoids two negative aspects:
The shrinking of the image every time you apply a filter
And the loss of information given by the lack of usability of pixels at the corners and edges.

A nonlinear activation function is applied to this convolutional layer.

The functions of nonlinear activations

We have several types of activation functions, but the most common are: Sigmoid, ReLu (Rectified Linear Unit) and Tanh.

The activation function is applied every time after the convolutional layer because the ConvNet is linear and since we do not want to predict (e.g.) the price of houses, we have to apply the nonlinear function.

Each of the functions mentioned above is suitable for different situations. Let’s take the ReLu function as an example, which is the one most commonly used.

Unlike the other functions mentioned before, ReLu is faster in learning and gets better results on the gradient when ‘Z’ is very large or very small.

At the end of it all, after applying our nonlinear function we can move on to Pooling.

Pooling in CNN

Pooling has no parameters to set, but this can be of different types: Max, Sum, Average etc.

This reduces the size of the ‘Feature map’ but at the same time preserves the most important information of each one.

Let’s take as an example a feature map obtained from the first steps, apply to it a 2×2 filter and stride = 2 using MaxPooling and we will get the following result:

Through Pooling we derive the following benefits:

makes input representations smaller and more manageable
reduces the number of parameters and calculations in the network, thus controlling overfitting
makes the network invariant to small transformations, distortions, and translations in the input image (a small distortion in the input will not change the output of Pooling — since we take the maximum/average value).
helps us arrive at an invariant representation of our image (called an “equivariant”). This is very powerful since we can detect objects in an image, regardless of where they are located.

These first 3 steps are at the base of every ConvNet and form the first level of it. In the network it is possible to add other levels formed by this base and then to add the fully connected level.

Fully connected layer

This layer implies that each neuron is connected to the previous and next layer.

The combination between the fully connected layer and the rest of the network described so far, manages to give better results in the prediction.

Don’t forget that the linear activation function ‘Softmax’ is applied to this last level and will return our output.

Softmax is a function that allows you to detect multiple objects at once in a single image.

Here’s a ConvNet:

Conclusions

This article is intended to give a brief introduction to CNNs, in fact as you can see I did not go into detail to avoid complicating what you have learned so far. If you want to study the topic I recommend you to follow this free course.