The motive of this blog is to explain the theory of CNN and also give an intuition of the theory through practical implementation in python.
CNN is the state of the art Deep Learning technique used for tasks involving image data. Earlier techniques included manually extracting the features from an image using Image Processing techniques and using those features for further study. Here is where CNN’s are way powerful. They automatically extract important features from an image and use those features for Classification, Segmentation, and Object Detection tasks.
Each layer of a Convolution Neural Network includes a linear operation and non-linear operation. The linear operation is a simple convolution as shown above in the GIF. An image is a matrix of pixels, so the 5x5 matrix can be considered as an image with numbers in it as the pixel values. The 3x3 kernel slides over the image and does an element-wise matrix multiplication followed by summing it up and the value obtained would be the output for the first cell. This thing where we take each 3x3 element of an image and multiply it with a 3x3 kernel and sum it up to get a result is called convolution.
Deep Learning Explained in 7 Steps - Updated | Data Driven Investor
Self-driving cars, Alexa, medical imaging - gadgets are getting super smart around us with the help of deep learning…
In the above case, the image that we considered has a single channel. But in reality, most images have 3 channels (RGB). Hence the kernel also needs to have 3 channels. Such a convolution of two rank 3 tensors results in a rank 2 tensor.
The kernel helps in extracting features from the image. So in a single layer of a CNN, an image is convoluted with multiple kernels. Each kernel extracts a different feature from the image. The resulting outputs are stacked to form a rank 3 tensor which becomes an input to the next layer.
I’ll try to show what I meant practically in python. I grabbed an image of a cat.
Note: I am using Pytorch as the framework, hence the image matrix is stored as a torch tensor. Notice the shape of the image — the 0th dimension in the tensor represents the number of channels followed by the size of the image which is 352x352.
I generated 3 random kernel of shape 3x3 and expanded its shape to 3x3x3. Convoluted the image with the generated kernel. Each of the convolutions provides a different image of shape (1,350,350) and they represent a different set of features. The different outputs of convolution from the first layer are stacked together, goes through a non-linear activation function(mostly ReLU) and fed as input to the second layer. A similar operation is performed in each layer and CNN can have hundreds of layers. Each of these layers learns to extract important features from the image. This is where deep learning is so powerful as it follows the Universal Approximation Theorem. No matter what the initial values of the kernels are, the neural network learns and updates the values of the kernels in each layer to extract important features from the image which can be helpful in whatever task is at hand.(Classification, Segmentation or Object Detection)
As we go deeper into our network, we want deeper outputs meaning convolution with many kernels, so that we can extract many features in a single layer. As the output becomes deeper, the model becomes more memory intensive. Hence keeping the memory in mind we use stride in our convolutions. Stride is the amount of shift by the kernel over the input matrix. On using a stride of 2 on the input matrix, the size of the output matrix would become half.
To handle the side pixels we add padding to our image matrix across the borders. The padding can be done by adding zeros to the borders and this form of padding is called zero padding. There is another type of padding called reflection padding or mirror padding.
Assume that we are performing a Classification task and have 10 classes for prediction. Also, assume that the output of the final layer generates a tensor of size (512,11,11). The final output of the CNN should be a vector of size (10,1) wherein the vector contains the probabilities for each class. Hence to get the respective probabilities we take the average from each of the 512 convoluted feature maps and generate a vector of shape (512,1). This method is called average pooling. The resultant vector of shape (512,1) undergoes a linear operation i.e. multiplied with a weight matrix having shape (512,10). The output from the linear operation is then passed through a non-linear function(in this case a softmax function) to generate a matrix of shape (10,1). The values in the output vector would be the resultant probabilities for each of the class. The class with the highest probability would be the predicted class.
In the next blog, I’ll write about the methods to prevent overfitting.
A special thanks to Jeremy and his course Practical Deep Learning for Coders. It helped me understand the concepts clearly.