The Ultimate NanoBook to understand Deep Learning based Image Classifier
A guide for understanding Convolutional neural network
In today’s world, we are making extensive use of images. Isn’t it? Have you ever wonder how Facebook automatically detects faces in an image and then tells who is in the image like it did in the above figure. When I move the pointer near to one of the faces in the image, it automatically tells the name of the person and which is none other than Yann LeCun (who proposed the CNN Architecture). Have you ever used Snapchat? It automatically first detects your face and then accordingly apply the selected filter chosen by you. Something similar is also done by Apple’s Face ID which will automatically unlock your phone when it founds your face in front of the device.
There are tons of examples like these which are doing one thing in common. They are playing around with images and making machines smart, so they automatically extract out the meaningful information from it. Amazing. Isn’t it? When I first get to know about these technologies, I was so much curious about how are they even working? To be honest, it seems like magic at first. How can a computer automatically tell that this is the picture of my brother not mine? If you are also finding yourself on the same page as me, then grab a mug of coffee because, in this nano book, you’ll gonna learn a lot about each and every detail of the main idea working behind these things but in a very interesting way.
There are many tasks like image classification, object localization, object detection, object segmentation and many more. But in this post, we will primarily focus on image classification.
This post is divided into three sections. In the first one, We will try to build this technique by following our own intuition which will be very interesting and a unique approach to learning this magic trick. In the second section, we will implement it in python to see it in action. And in the third section, we will try to play around with it by asking several interesting questions to explore and learn more.
Understanding Semantic Gap
We, Human Beings, are visual creatures and we see various objects around ourselves using our eyes, but have you ever wondered what does an image means to a computer? There is a big difference between how we perceive images and how a computer does. We particularly call this gap a Semantic Gap
which is well described in the below image.
In the above picture, it is clear that for computer, an image is nothing but a bunch of numbers arranged in a grid-like structure which is formally known as Array.
However, I have just shown a small part of it. If you want to see what the whole image looks like for a computer, you can use the following code.
import os
import cv2
import numpy as npnp.set_printoptions(threshold=np.inf)image = cv2.imread('/cat.jpeg')
print(type(image))
image = cv2.resize(image , (28,28) , cv2.INTER_AREA)
print(image.shape)
#visualizing the seperate channels of image
print(image[: , : , 0]) #prints out Blue channel
print(image[: , : , 1]) #prints out Green channel
print(image[: , : , 2]) #prints out Red channel
#To visualize the whole image with all 3 channels
print(image[: , : , :])
Now, you will be questioning if there is a way our beloved computer knows which object is present in the frame. You bet there is, and we are going to build it. I know, it seems like a daunting task but please bear with me till the end. We will make it happen for sure!
The first and most important step of our journey:
As I have said before, we are going to simply ask questions that will guide us to build an image classifier. For the sake of brevity, we will call Image Classifier an IC
Now, we are ready to start our journey. So let us ask the first question: “can you tell to which category the following images belong to? Cat or Dog?
I am sure you have said it right! They belong to the category Cat
, but wait, take a step back and think about how you reached this conclusion? Try to think, how your brain did it? How can you be this much sure that these images are of cats and not dogs? You know how cats look like and what you have just probably done is similarity check between the rough image of the face of cat you have stored in your brain unconsciously with these images (more specifically the middle portion of these images), and if it returns a high score. Then you came to the conclusion that these images are effectively cats
Now, you may be saying “Can’t we incorporate the same mechanism into our own image classifier?” It seems like a great idea, but it comes with two main challenges:
- Which image will act as the image(rough image) we have stored in our mind?
- How are we going to measure the similarity between the two images?
Let us address the first question. Here we are primarily focussing on classifying an image as a cat or not a cat. So can’t we use the image of a face of a cat as our rough image? Why not! We can use the below image as the rough one and then we can do similarity check with the middle portion of the above images that shows the cat face Or we can use the pixel values present in the middle of the array
Now, our rough image will also be stored like our main image(which has to be classified) in the form of numbers or pixel values in a computer.
For the second question, is there a way to do the Similarity Check between the two images, in other words, between the large matrix ( the original image) and the small matrix (the rough image)? Yes! We have something which can do it for us. It is known as Frobenius Inner Product in mathematics. Don’t get scared. I will explain what it is and what it does.
Frobenius Inner product is nothing but the element-wise multiplication of two arrays followed by the addition operation. Let say you have below two grids or matrices and let’s call them A and B as shown below, then the Frobenius Inner product (FIP) will be calculated as follows.
Frobenius Inner product, in general, is one of the ways through which you can measure the similarity between two things(matrices). It is nothing but the generalized version of the dot product with only one difference: dot product is defined for vectors and inner product is defined for matrices. How it exactly measures the similarity is something beyond the scope of this guide, but if you want to know about the same in detail, then feel free to check out this blog post and this one to understand the inner product.
Now we have both images (the rough and the main one) and a way to measure the similarity between two matrices or images. Cool! So, let us look at our very simple pipeline to classify an image as a cat or not a cat which includes two steps :
- Select the middle portion of an image which has to be classified
- Do the similarity check between the selected area and the rough image. If it returns a high value, then the main image is of a cat.
But will this pipeline or our Image Classifier work for the following images?
No, A big No! It will not work for these images because the above ones don’t have the face of cats in the middle. This will lead to a less similarity between both images which means an incorrect Classification. Can we fix this problem? Yes, We can. Instead of focussing just only on the middle part of image, what we should do is do the similarity check for each part of the main image with the rough image which will make our image classifier global translation invariant which means doesn’t matter where the face or pattern is located in the image, our image classifier will detect it for sure. The process is very simple, we start from the upper left corner of the main image, pick up a small matrix with the same size as of our rough image, then we do the similarity check between the selected part of the main and the rough image. In the end, the result will be stored in the output matrix, and we have to repeat this process for each part of our main image. Please note that the direction of selecting the specific part of the main image will be from left to right and top to bottom. The number of steps or units with which we move the rough image over the main image is known as Stride. Now, let me make this point more clear with the following graphics and pseudo code.
category = None
for selected_part in (each part of main image):
1. calculate the similarity between selected part and rough part
2. if similarity == HIGH:
3. category = cat
if category == cat:
print("Cat is present in image")
else:
print("Cat is not present in image")
Seems pretty simple and promising. This particular process of convolving the rough image over the main image is known as Convolution operation (linear operation) which is a very intuitive as it is only checking for some specific pattern in the main image by calculating the Frobenius inner product between two matrices. The result of this process will be another matrix storing the similarity scores of the rough image and each part of our main image.
And have you noticed something very interesting here? The size of our output matrix or feature map will be always smaller than the original size of the main image. Isn’t it? If you are not convinced, I will suggest you to have a closer look at the above operation. But what if you want to preserve the size of your main image to retain most of the information about the main image even after the operation? Is there any way to do so? Yes, there is. To preserve the dimension of the output image, we simply pad the main image with zeros around the border and this process is known as Padding.
Now, it’s the right time to let me introduce you the names of these matrices as known in the deep learning community. Formally, the rough image is known as Kernel, Feature Detector or Filter, and the particular part of our main image with which we are doing similarity check is known as Receptive Field. The output matrix is known as Activation or Feature map.
So far so good. It’s time to appreciate ourselves that we have come this far and built an image classifier which can detect a specific pattern in the main image with the help of a single kernel. Let us take one more step and try to make our image classifier more robust to various different kind of images. Will our image classifier work for the following images?
Again, the answer is no, and the reason behind this is so simple. There is a difference between the faces of cats in the above pictures and the face of the cat which we have selected as a kernel. Even a small change in the color or shape of an object in an image can change the values of the array. Then, this difference is enough to lead our classifier to incorrect classification. We have to fix this. We have to fix this by making our classifier more robust. We just can’t classify each image using just one specific pattern. Our classifier should also classify the images in which the face of a cat is slightly rotated, deformed or in any other pattern. Thus, there is a high need for multiple kernels or feature detectors because we simply can’t rely on one single kernel. We should have different kernels to identify different patterns in an image. This way, it will improve the performance of our image classifier.
So, now we will have a modified version of our previous convolution operation, and it contains the following three components:
- Input Volume: It can be an image with the depth of 3 (RGB) or it can be the output of previous convolution operation.
- Feature detector matrix: This will constitute K number of kernels to detect various patterns.
- Feature map: This will be the output of convolution operation which will simply store the similarity values.
Formally, the combination of these three components is known as single Convolution Layer. And if you are pondering how convolution operation will be done this time. Have a closer look below.
Now our image classifier can detect various features at any location in the image. Seems like we are doing a great job. But there still is a need for improvement. Can’t we reduce our workload? Notice one thing here, we are manually defining our feature detectors that this particular kernel should detect this particular feature. Is this always possible? Can we as humans identify all of the patterns required to build our classifier? Is this even feasible? The answer is No, we can’t. But, can we make our computer smart enough to learn the feature detectors themselves? Yes, we can. We will make use of the computing power of modern computers using Machine Learning to train our machine with tons of images with their respective labels. Then it will identify various feature detectors by itself to build the robust image classifier. If this does not make sense to you that much, I will suggest you go through this blog.
We are agreed on the fact that our machine will be smart enough to learn the feature detectors by itself and will be able to detect multiple features at any location of the image but there is one more important and crucial thing to understand here: your machine will learn those feature detectors as you do. What do I mean by this? Remember, How you’ve learned English or any concept in general? First, you’ve learned about alphabets, then words and from there how to make a meaningful collection of words also known as sentences. We as humans learn in a hierarchical way. The machine will also mimic our way of learning, it will not learn the higher level features like a face of cat, hands or leg of a human directly. But first, it will learn very low-level features like a feature that detects a horizontal edge, a vertical edge or a color in an image. And from there, it will learn higher-level features like faces, hands, and shapes. Thus, it clearly signifies that our image classifier will not only have a single convolution layer but multiple convolution layers.
Introducing non-linearity
Have you noticed, as we are progressing, we are coming up with a better version of our image classifier? In the last section, we learned that our classifier should have multiple convolution layers in order to learn in a hierarchical way. This means the structure of our classifier will be like this:
CONV LAYER--> CONV LAYER--> CONV LAYER --------> CONV LAYER
Recall! We have previously learned that Convolution is a linear operation which simply means, it does not matter how many layers you will introduce in the classifier, it will only learn the linear function. And even if you add 100 layers in your network, they all will act as a single convolution layer. Yes, 100 Layers == 1 Layer
Don’t believe me? Fine. Let me elaborate with the help of simple linear equations.
The transformation from `u` toy
is done by 3 linear equations but as shown in the image above, this can also be done by a single linear equation 60u+31
which means Power(3 linear equations) = Power of 1 linear equation
.
Thus, we have to introduce the nonlinearity in between our convolution layers because of following reasons:
- To learn complex and non-linear function to classify images.
- To make our network powerful by preserving
100 Layers == 100 Layers
The non- linear function or Activation function which is common nowadays in the deep learning community is ReLU (Rectified Linear Unit).
Think of ReLU function as a box which accepts any number and will give the output either 0 or the number if it is positive. The following code will make it more clear.
def relu(x):
if x < 0:
return 0
else:
return x
Thus, introducing an activation function will make our image classifier more robust and will help it learn nonlinear decision boundary but this is not compulsory that you should only use ReLU as an activation function. You are free to use any non-linear function. To read more about activation functions, feel free to read this blog.
Pooling: Make it more robust and efficient
So far, in this guide, we had much more focus on the robustness of image classifier but for now, let us also focus on the efficiency of the classifier. The input to our classifier is an image. Right? And its size can vary depending on the problem at hand. Images can be of shape28x28
, 64x64
, 128x128
, 256x256
. The bigger the size of an image will increase the number of parameters in our image classifier. It is a very well known fact in the deep learning community that more the number of parameters you have, the more your model will overfit and more time will take to train. Thus, isn’t there any way to get rid of this problem? Yes, there is. We can reduce the shape of input volume using an operation known as Pooling.
Pooling is a very simple operation which sub-sample or down-sample the input shape by replacing a very small sub-array with a single value and that single value is usually the summary statistics of that small sub-array. The below graphic will make it more clear.
In the above picture, the red array is our sub-array and the small blue matrix is the output of the pooling operation. Reducing 20x20
array with 2x2
array will lead to a drastic change in the number of parameters in the further convolution operation. The small sub-array is known as Kernel and the concept of stride is also applicable here and comes in play when deciding how many steps you want your kernel to move. We can use any summary statistics to replace it with a single number and that can be done by taking the maximum value from the sub-array, known as Max-Pooling, or we can be taking the average of values of the sub-array, known as Average-Pooling.
We will make use of the pooling operation to downsample the size of input volume but there’s one more interesting thing which pooling does also. In pooling, we are replacing small sub-array with a single value means we are not that much focusing on the exact location of the result of the feature detector operation. Instead, we are zooming out a little bit and asking for an approximate result of that sub-array. This will make our image classifier robust to local translation invariant. The following example will help for sure
[Above are the three different pictures of a human face with a slight difference. If you notice carefully, the spacing between two pairs of eyes and the spacing between eyes and nose is not the same but still, we want our image classifier to classify all these faces perfectly. But what more do we expect from our image classifier? We want our face detector kernel to not focus on the exact relative position of eyes and nose but instead to check the presence of one eye on the left, one eye on the right, a nose in the middle followed by lips and if all of the above conditions are fulfilled, declare it as a face. And this is exactly what pooling operation help us to achieve. So, pooling performs mainly two functions and now you know both of them!
The last component of our image classifier
We have already covered the hardest part. Now, let us cover the easiest one and make our classifier fully working. It will give us a single number through which we can guarantee that this image is of a cat, right? But, have you noticed one more thing? The output of all of the layers we have covered so far is of shape (width,height,depth)
but we are interested in one single number which can tell us whether a cat is present in the image or not. So, how do we achieve this? How do we make use of all neurons present in the output volume of previous layers as those neurons will tell us whether a particular feature is present in the image or not? How do we combine the information of those neurons intelligently to get a single number? First of all, we have to perform an operation called Flattening
so that we can make use of all of those neurons intelligently to get that single number. Flattening is a very simple operation which does convert the input volume of shape (width,height,depth)
to a one-dimensional array of shape (K,1)
where K=width*height*depth
.
Now, we have a one-dimensional array as an output of the image classifier and up to this part, our classifier is acting as a Feature extractor. It is extracting out meaningful information from the image that will play a crucial role in determining whether a cat is present in the image or not. And now, we have to make use of this meaningful information to get a single number. This will be done by a layer called Fully Connected Layer which connects all of the neurons present in the previous layer to all the neurons in the next layer.
If you know about the simple neural network (without any fancy layer), then connecting multiple fully connected layer with a different number of neurons is equivalent to stacking a simple neural network on the top of deep learning based Feature Extractor. Isn’t it? The below image will clarify the concept a bit more and it will also help you to understand how one-dimensional array will be converted to a single number.
Congratulations! We have totally developed the full-fledged deep learning based image classifier with the help of our intuition. We have covered each and every layer of this deep learning based image classifier and it is very clear that out of all of these layers, the most important one is Convolution Layer and thus this classifier is called a Convolutional Neural Network.
Section 2: Convolutional Neural Network in action
We have learned about convolution neural network in great detail. Let us switch our mode. Now, Enough theory and intuition. Let us implement all the things we learned till now and build an end-to-end LeNet Architecture which was the first purposed CNN architecture by none other than Yann LeCun. However, there are many other CNN architectures about which you can read here. The arrangement of various layers in LeNet architecture is as follows:
Input -> Conv -> Activation -> MaxPooling -> Conv -> Activation -> MaxPooling -> Flattening -> FC -> FC
We will use CNN to classify hand-written digits into one of the 10 categories. And the dataset that we will use is MNIST Database
which looks like
Now, let us implement all of the things we have learned so far in python using Keras.
Lines 1–10: We have imported required classes which will be needed further to implement LeNet architecture.
Line 13: It will load the MNIST dataset into four variables ((x_train,y_train), (x_test,y_test))
Lines 15–18: These lines will print out the shape of the train and test dataset.
Lines 21 and 22: They will normalize the input images by converting the range of every pixel of an image from 0–255 to 0–1.
Lines 24–25: Performs One hot encoding which is basically done to explicitly tell our model that there is no ordinal relationship between the classes.
Lines 27 and 28: They will reshape the data(both train and test) from (data.shape[0], 28, 28) to (data.shape[0], 28, 28, 1) which will be used to signify the number of channels of the input image for convolution layer.
Lines 38–49: These lines will define the LeNet architecture. To implement this architecture, we have used the Sequential API
of keras (line 38). The first hidden layer of the network is the convolution layer with 30 filters of size (5,5), followed by a ReLu activation layer and then a MaxPooling layer with pool size=(2,2) is stacked on the top of the previous layer. This combination of 3 layers is repeated again according to the architecture. After these 2 blocks of three layers, flattening operation is performed by Flatten layer(line 47) and then two fully connected layers are attached with 500 and 10 neurons (line 48 and 49).
Lines 52 and 53: These lines will define the required optimizer which will be used to train the CNN(line 52) and the line 53 will compile the defined model with categorical crossentropy
as a loss function and accuracy
as a monitoring metric.
Line 56: It will be responsible for training our CNN (the real thing) with 10 as the number of epochs.
Lines 59–63: These lines will be used to evaluate the trained model which will help us to check how well our model will perform on the unseen data by calculating the log-loss value and accuracy on the test dataset.
If you will execute the above code, you will get the following output.
Accuracy of our model is 97.008%
Log loss value : 0.079
As the accuracy of our model is approximately equal to 98% and the log loss value is equal to 0.079 which is pretty good.
Hope this section gives you the taste of implementation details of the convolutional neural network in python. Now, let us move on to the most exciting and important section of this nano book to understand the convolutional neural network better.
Section 3: Playing with ConvNets to understand them better
In the deep learning community, most of the folks treat the convolutional neural network as “black boxes” without that much focus on how they do what they do. In this section, we will play around with convolutional neural network and mainly focus on the visualization aspect of ConvNets. We will get to know more about CNN by studying some of the most important and interesting papers but again by asking interesting questions. This section will include all of those details which are not often discussed or known to most of the people in the deep learning community. Hope you will learn something new too. Are you excited? I know you are!
3.1: Visualizing intermediate outputs
We know very well that there are so many layers present in the convolutional neural network starting from the input layer to the output(softmax) layer. It will be a fun exercise if we can visualize the transformation done by each layer or the output of each layer when we feed forward an image through the network. This exercise will help us to see what exactly each layer is doing to the input image. This is exactly what was done by the authors of the paper: Understanding Neural Networks Through Deep Visualization. If you want to see this thing in action, you can use the tool they have developed which can be found here.
3.2: Reconstructing the image given the CNN code
In the previous sub-section, we have tried to visualize the output of the intermediate layers as the image passes through the network, and the output of intermediate layer is also known as the encoding
of that image of that particular layer. But along with visualizing the output of the intermediate layers, is it possible to construct the original image using the encoding of the image? Yes, it is possible and this is covered extensively in the paper: Understanding Deep Image Representations by Inverting Them. Authors of this paper conducted a direct analysis of the visual information contained in intermediate representations by asking the question: Given an encoding of an image, to which extent is it possible to reconstruct the image itself?
The method they have used is pretty simple. They have projected this problem as an optimization problem which will compute the approximate inverse of an image representation such that the difference or loss between the original representation of the image (Φ(0)) and the one we are trying to find out (Φ(x)) should be minimum.
Here, x represents the image we are trying to find, Φ(0) is the output of the intermediate layer or the encoding of the image, Φ(x) is the encoding of the image we are trying to find, loss(Φ(x), Φ(0)) is the difference between both of the encodings and λR(x) is regularization term. The loss function which they have used to measure the difference between the two encodings is the Euclidean distance
which mathematically looks as follows:
Below are the images of inverted representation of a single image corresponding to the encoding of different layers.
From the above-shown results, it is very clear that the reconstruction of the image using the encoding of layers which are closer to the input layer contains much more visual information than the one constructed from the layers which are far away from the input layer.
3.3: Which portion of an image is responsible for classification?
So far, we’ve agreed on the fact that the presence of the object in an image is responsible for the output of image classifier. For an instance, it is the presence of the face of a cat in an image which will be responsible for the output of image classifier. But how can we be sure about it? What if the classifier is classifying the image as Cat on the basis of the surrounding context and not on the basis of the presence of face of a cat? There are mainly two approaches to be sure about it, the first one is Occlusion Experiment and the second one is via using Saliency Map.
3.3.1: Occlusion Experiment
This approach was first introduced in Matthew Zeiler’s Visualizing and Understanding Convolutional Networks. In this approach, they have occluded a particular section of the image with a grey square and then monitor the output of classifier for that image. They have done the above step for each possible section of an image by sliding the grey box from top to bottom and left to right in the same way we slide the small sub-array over the input image in convolution operation.
What to expect from this experiment?
The output(probability) of image classifier should not change when we occlude the not so important section of the image but probability should drop significantly when we replace the main object with the grey square. Below are the images of this experiment’s results.
Thus, we can be sure now that it is the presence of the object in an image which is responsible for the output of image classifier.
3.3.2: Saliency map
This approach was first introduced in Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. This is a very simple approach in which the authors of the above mention paper tried to know what set of pixels of the image are important for the output of classification. They have tried to observe a very simple thing and that is if we change the pixel values of the image one by one, then how much and which pixel set will affect the class score of that image the most? The reasoning behind this is pretty straightforward and that is the pixel set corresponding to the object will highly affect the class score than other pixel sets.
They have done so by simply calculating the gradient of output score of an image (Score(Ic))
with respect to an image.
w* = ∂Sc /∂I : Gradient of the class score with respect to the image.
The set of pixels with higher gradient value are highlighted in the above images which signify that these are the most important pixels of an image as with the small nudge or change in the value of these pixels can change the class score a lot. So, it clearly tells us that the presence of the object in an image is responsible for the output of the classifier.
3.4: Is our assumption right?
There is no doubt in the fact that the backbone of the convolutional neural network is convolution layer and it consists of many kernels or Feature detectors which look for the presence of certain features in an image and in turn respond with the big number if that feature is present in the image. This is something that we have already covered in this nano book.
Our whole intuition and explanation are based on the most important and single assumption that the sub-arrays present in the convolution layer act as feature detectors but is it really the case? How can we sure that sub-arrays act as feature detector? Is there any way to be sure about this? You bet, there is. This is going to be a very important section of this nano-book as it is going to check our assumption about ConvNets on which this whole nano-book is based. The approaches used to be sure about this are mainly divided into two camps, one is dataset-centric and another one is network-centric.
3.4.1: A network-centric approach to understand CNN
The network-centric approach only requires a trained convolutional neural network, unlike dataset-centric approach which also requires data. One of the main and effective network-centric approach is first introduced in the paper: Understanding Neural Networks Through Deep Visualization.
In this approach, the authors of the paper have done a very simple thing and that is they have tried to visualize the activations by imposing an optimization problem which will try to construct an input image such that the value of any activation present at any layer should be maximum.
Here, x is the input image to be constructed, ai(x) is the value of ith activation and Rθ(x) is the regularization term.
Maximizing the value of a particular activation will result in the construction of the image(x) which will contain the visual information related to that activation. The below images will help you to visualize the output of different activations of different layers which are being constructed using this approach.
3.4.2: A dataset-centric approach to understand CNN
The dataset-centric approach requires both a trained convolutional neural network and running data(images) through that network. One of the main dataset-centric approach was first introduced in Matthew Zeiler’s Visualizing and Understanding Convolutional Networks. In this paper, they have introduced a visualization technique that reveals the input stimuli(part of the input image) that excite or activate individual feature maps at any layer in the model. To do this, they make use of Deconvolutional neural network which can be thought of as a convnet model that uses the same components(pooling, non-linearity) but in reverse order. But you will ask what deconvnet does and how it will help us? As the input image is passed through the network, we obtained the activation map as an output of the intermediate layers and when we attached the deconvnet to each activation map, it maps these activities back to input pixel space. The mapping of particular activation back to input pixel space will form that pattern on input pixel space which that activation is looking for.
The following excerpt and diagram from the paper itself are self-explanatory which will help you to understand more about deconvnet and about the approach used in general.
To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.
Now, let us have a look at the result of the above experiment and try to understand what activations of the different layers are looking for.
From the above-shown images, it is very clear that the layers which are closer to input layer are looking for lower level features(like corners, edges, color) while the layers which are far away from input layer are looking for higher level features(like dog faces, keyboards), thus confirming our one more assumption that convolutional neural network learns in a hierarchical way.
In this last and most important section of nano-book, we have played around with the convolution neural networks and get to know more about them by reviewing the four most important papers of CNN which considers the visualization aspect too. Hope you’ve enjoyed this section like other sections too.
I hope you have enjoyed each and every section of this nano book and if you have learned anything new from it, then you can show your love by sharing it with others. It takes so much time to write such comprehensive blog post, hopes my hard work will help some of you to understand the convolutional neural network.
And feel free to connect with me on LinkedIn, follow me on Twitter and Quora as well.