An Extensive Guide to Convolution Neural Network-(2023)
Convolutional Neural Networks (CNNs) are a type of artificial neural network that is primarily used for image and video recognition tasks. They are designed to process data with a grid-like topology, such as an image, and are particularly effective at identifying patterns, features, and objects within images.
In this post, we will get a brief understanding in simple words how the convolution Neural Network Work in an ELI5 (Explain Like I’m 5).
Introduction
Convolutional Neural Networks (CNNs) are a type of artificial neural network designed specifically for processing data with a grid-like topology, such as an image. They are particularly effective at identifying patterns, features, and objects within images, which is why they are widely used in image and video recognition tasks.
Convolutional neural networks work by applying a series of filters to the input data, which allows them to learn and identify patterns within the data. They do this by sliding the filters across the input data and performing a mathematical operation called a convolution at each position. The convolution operation involves multiplying the values of the input data by the values in the filter and then summing the results. This process is repeated for every position in the input data, resulting in a new set of data that represents the features learned by the filter.
The filters in a CNN are typically arranged in layers, with each layer learning more complex features as the data passes through it. For example, the filters in the first layer might learn simple patterns like edges and corners, while the filters in a deeper layer might learn more complex patterns like textures or shapes. The output of the CNN is then fed into a classifier, which uses the learned features to make a prediction about the input data.
Input Image
In the image, we have an RGB image which is separated by its three color planes — Red, Green, and Blue. There are many such types in which images exist — Gray Scale, RGB, Index, CMYK.etc. You can Know more about it HERE
Convolution Layer — The Kernel/Filter
Image Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels, eg. RGB).
In a Convolutional Neural Network (CNN), the kernel (also called a filter) is a small matrix of weights that is used to detect patterns in the input data. The kernel is typically much smaller than the input data and is used to scan over the input data, performing a mathematical operation called a convolution at each position.
The convolution operation involves multiplying the values of the input data by the values in the kernel and then summing the results. This process is repeated for every position in the input data, resulting in a new set of data that represents the features learned by the kernel. The output of the convolution operation is called the feature map.
In the above Image, We Have a Filter which is in Yellow Color which is a 3×3 matrix is applied to our input image which is a 5×5 matrix. We get the Output image which is in Green.
Strides
In a Convolutional Neural Network (CNN), the stride is the number of pixels that the kernel (also called a filter) is moved at each step as it scans across the input data. When the kernel is moved one pixel at a time, the stride is 1. When the kernel is moved two pixels at a time, the stride is 2, and so on.
The stride is an important hyperparameter in a CNN because it determines the size of the output feature map. A larger stride will result in a smaller output feature map, while a smaller stride will result in a larger output feature map.
There are trade-offs to consider when setting the stride size. A larger stride can reduce the computational complexity of the CNN, but it can also cause the CNN to miss smaller patterns and features in the input data. A smaller stride can capture more detailed patterns and features, but it can also increase the computational complexity of the CNN.
In general, it is a good idea to start with a small stride and then increase the stride size if the computational complexity of the CNN becomes an issue. It is also common to use different stride sizes at different layers in the CNN, depending on the size of the input data and the desired size of the output feature map.
Padding
In a Convolutional Neural Network (CNN), padding is the addition of extra pixels around the border of the input data. Padding is often used in CNNs to preserve the spatial size of the input data, which can be important for maintaining the spatial relationships between the pixels in the input data.
The first example in the picture above is showing what we have done in the previous section. The input image has 4x4 pixels and the filter has 3x3. There is no padding, which is called ‘valid.’ The result becomes 2x2 pixels data (4–3+1 = 2). We can see that the output data is downsized.
Let’s see the third example this time. There is one layer of padding with the blank pixels. The input image has 5x5 pixels and the filter has 3x3. So the result gets 5x5 pixels (5 + 1*2–3 + 1= 5), which is the same size as the input image. We call this ‘same.’ We can even make the outcome bigger than the input data, but the two cases are used the most.
For example, suppose you have an input image that is 5x5 pixels and you want to apply a 3x3 kernel to the image. If you don’t use padding, the kernel will only be able to “see” a 3x3 area of the input image at a time, and the output feature map will be smaller than the input image. This can cause the CNN to lose important information about the spatial relationships between the pixels in the input data.
By adding padding to the input data, you can ensure that the kernel “sees” the entire input image at each position. For example, if you add 1 pixel of padding to the input image, the kernel will be able to “see” a 5x5 area of the input image at each position, and the output feature map will be the same size as the input image.
The amount of padding used in a CNN is usually a hyperparameter that can be adjusted to optimize the performance of the CNN. Too much padding can cause the CNN to lose resolution and fail to capture important patterns and features in the input data, while too little padding can cause the CNN to lose spatial relationships and make it difficult to recognize objects in the input data.
Pooling Layer
A pooling layer in a Convolutional Neural Network (CNN) is a layer that is used to reduce the spatial size (i.e., width and height) of the input data. It does this by applying a pooling operation to the input data, which is a way of down-sampling the data by taking the maximum, average, or sum of a group of adjacent values.
There are two main types of pooling layers: max pooling and average pooling. In max pooling, the output of the pooling operation is the maximum value of the group of input values. In average pooling, the output of the pooling operation is the average of the group of input values.
The pooling layer has a kernel size and a stride, just like the convolutional layer. The kernel size determines the size of the group of input values that the pooling operation is applied to, and the stride determines the step size at which the kernel is moved across the input data.
The primary purpose of the pooling layer is to reduce the computational complexity of the CNN and to introduce some degree of translation invariance (i.e., the ability to recognize an object regardless of its position in the input data).
Fully Connected Layer (FC Layer)
A fully connected layer (also called a dense layer) in a neural network is a layer in which every neuron in the layer is connected to every neuron in the previous layer. The term “fully connected” refers to the fact that each neuron in the layer receives input from every neuron in the previous layer, rather than just a subset of neurons.
In a Convolutional Neural Network (CNN), the fully connected layers are usually located at the end of the network, after the convolutional and pooling layers. The purpose of the fully connected layers is to take the output of the CNN, which is a low-dimensional representation of the input data, and transform it into a higher-dimensional space where it can be used to make predictions.
For example, suppose you have a CNN that is trained to classify images of dogs and cats. The output of the CNN might be a vector of 100 values that represent the features learned by the CNN. The fully connected layers take this output and transform it into a vector of 2 values, one for each class (dog and cat). The values in this vector represent the probability that the input image belongs to each class.
The weights in the fully connected layers are typically learned during training, using an optimization algorithm like stochastic gradient descent.
Support me 👏
Hopefully, this helped you, if you enjoyed it you can follow me!