Convolutional Neural Networks Explained

6 min readOct 10, 2023

What is Convolutional Network?

Convolutional Neural Networks, or CNNs, stand as a formidable branch of deep learning specially designed for excelling in image recognition and processing tasks. These neural networks consist of multiple layers, including convolutional, pooling, and fully connected layers, each with a unique role in comprehending visual data.

Key Stages in a Convolutional Neural Network (CNN)

Input Layer:

The input layer represents the raw input data, typically an image or a matrix-like data structure. Its dimensions vary depending on the size and shape of the input data.

Convolutional Layer:

In CNNs, filters, also known as kernels, are small, learnable matrices used to extract features from input data. Filters play a pivotal role in convolutional layers, responsible for identifying specific patterns, edges, textures, or features in the input data. These filters are usually small, square matrices with dimensions like 3x3 or 5x5, though other sizes can be used.

The process involves sliding or convolving these kernels across the input data. At each position, they perform a convolution operation, which entails element-wise multiplication of the kernel values with the corresponding values in a local region of the input. The results are then summed up, generating feature maps that highlight various patterns and features in the input.

Then the convolution of 5 x 5 image matrix multiplies with 3 x 3 filter matrix which is called Feature Map. We apply the dot product to the scaler value and then move the filter by the stride over the entire image.

Output[0][0] = (9*0) + (4*2) + (1*1) + (1*4) + (1*1) + (1*0) + (1*1) + (2*0) + (1*1) = 0 + 8 + 1 + 4 + 1 + 0 + 1 + 0 + 1 = 16

Output[0][1] = (4*0) + (1*2) + (2*1) + (1*4) + (1*1) + (0*0) + (2*1) + (1*0) + (0*1) = 0 + 2 + 2 + 4 + 1 + 0 + 2 + 0 + 0 = 11

Output[0][2] = (1*0) + (2*2) + (2*1) + (1*4) + (0*1) + (4*0) + (1*1) + (0*0) + (6*1) = 0 + 4 + 2 + 4 + 0 + 0 + 1 + 0 + 6 = 17

Output[1][0] = (1*0) + (1*2) + (1*1) + (1*4) + (2*1) + (1*0) + (1*1) + (0*0) + (0*1) = 0 + 2 + 1 + 4 + 2 + 0 + 1 + 0 + 0 = 10

Output[1][1] = (1*0) + (1*2) + (0*1) + (2*4) + (1*1) + (0*0) + (0*1) + (0*0) + (2*1) = 0 + 2 + 0 + 8 + 1 + 0 + 0 + 0 + 2 = 13

Output[1][2] = (1*0) + (0*2) + (4*1) + (1*4) + (0*1) + (6*0) + (0*1) + (2*0) + (4*1) = 0 + 0 + 4 + 4 + 0 + 0 + 0 + 0 + 4 = 12

Output[2][0] = (1*0) + (2*2) + (1*1) + (1*4) + (0*1) + (0*0) + (9*1) + (6*0) + (7*1) = 0 + 4 + 1 + 4 + 0 + 0 + 9 + 0 + 7 = 25

Output[2][1] = (2*0) + (1*2) + (0*1) + (0*4) + (0*1) + (2*0) + (6*1) + (7*0) + (4*1) = 0 + 2 + 0 + 0 + 0 + 0 + 6 + 0 + 4 = 12

Output[2][2] = (1*0) + (0*2) + (6*1) + (0*4) + (2*1) + (4*0) + (7*1) + (4*0) + (6*1) = 0 + 0 + 6 + 0 + 2 + 0 + 7 + 0 + 6 = 21

Pooling Layer

Pooling layers, also known as downsampling, play a crucial role in reducing the dimensionality of data, effectively lowering the number of parameters in the network. Similar to convolutional layers, pooling layers employ a filter that traverses the input. However, pooling filters do not possess learnable weights. Instead, they perform an aggregation function within their receptive field, populating the output feature map. There are two primary types of pooling:

Max Pooling: During max pooling, the filter selects the pixel with the highest value within its receptive field and passes this value to the output feature map. Max pooling is commonly preferred for preserving important features.

2. Average Pooling: In contrast, during average pooling, the filter computes the average value of the pixels within its receptive field and conveys this average to the output feature map.

Non-linearity (Activation)

Following convolution and pooling, the next crucial step in the CNN architecture involves normalization and the application of the Rectified Linear Unit (ReLU) activation function. ReLU introduces non-linearity into the network’s computations, capturing complex patterns and relationships in real-world data. It is defined as ƒ(x) = max(0, x), where x represents the input. This simple yet effective operation enables the network to learn and represent both positive and non-positive (negative) values in the data. ReLU can be safely skipped if your dataset lacks negative values.

Fully Connected Layers

In the final layers of the CNN, fully connected (FC) layers combine the learned features to make predictions. These FC layers connect every neuron in one layer to every neuron in the next layer. The output of the FC layers is often passed through a softmax activation function for classification tasks, producing class probabilities.

Training Process

CNNs are trained using labeled data and optimization techniques, notably gradient descent. The training process involves iterative adjustments to the network’s weights, including the kernel values in convolutional layers. These adjustments aim to minimize a specific loss function, quantifying the disparity between the model’s predicted output and the actual labels associated with the training data.

Prediction

After training, CNNs can be used to make predictions on new, unseen data. The input is passed through the trained network, and the output represents the model’s prediction for the given input, whether it’s image classification, object detection, or any other relevant task.

In summary, Convolutional Neural Networks (CNNs) stand as a formidable branch of deep learning meticulously designed for the complex task of image recognition and processing. These networks function through a meticulously orchestrated series of distinct layers, each with a specific role in unraveling the intricacies of visual data.