Convolutional Neural Network — An Informal Introduction (Part-1)
The convolutional neural network aka CNN has become a de facto practice for image classification and detection. Machine learning frameworks like PyTorch and TensorFlow provide convenient means to rapidly construct a CNN. Therefore, it is unlikely that we need to start a CNN project from scratch totally since there exist published CNN architectures like (AlextNet, VGG16, Darknet19, etc.) which can be readily adapted for image classification and detection projects. Having a good understanding of how a CNN works, however, is of importance when we need to work with any image classification and detection task.
Fully-Connected Network
The object of image classification is to classify an image represented by a 2D (grayscale) or 3D (RGB) array of pixels into a particular group based on its underlying features.
One can use a normal fully-connected network to perform the classification task by converting the pixel array into its corresponding vector. An example of a fully-connected network is shown in figure 2. The first layer of nodes is the input(pixel vector) and the last layer is the output. The intermediate layers are known as ‘hidden layers’ which are meant to extract features.
There are two main problems when using a fully-connected network to classify images:
1- No spatial information of pixels is input into the network. Usually, the location of pixels carries significant information ( ex. eyes are close to nose).
2- Excessively many trainable parameters. If all pixels are connected to all nodes in the hidden layer, the number of weights to be learned is too many.
As such, a filter-based approach is proposed to capture the spatial structure in the input.
Convolution
The idea is to have a sliding window (ex. 4 by 4 in Figure 3) to connect a patch of inputs to a single neuron in the subsequent layer. The sliding window is moved across the input layer by a given stride (ex. 2 pixels in Figure 3) to obtain the resulting neurons in the subsequent layer as shown in Figure 3.
The patchy operation ( over an input patch by a sliding window) is called ‘convolution’ which does element-wise multiplication and sum the outputs. And the sliding window is called ‘kernel’ or ‘filter’. An illustration is given in Figure 4.
Another key aspect related to convolution is zero-padding. The idea here is to allow proper convolution operation on edge pixels by adding extra pixels of zero value.
By knowing the filter (or kernel) size, stride and padding, the size of an output feature is given by:
Pooling — Max and Average
The size of the output feature maps can be quite big so it is desirable to downscale while preserving the spatial information. This can be achieved by pooling. The idea of pooling is quite simple as shown in Figure 7. There are two types of pooling, namely: average and max.
Multiple Channels — Input and Output
The description & illustration of the convolution operation above is only for 1 input map and 1 filter. However, when working with CNN, the input map can have multiple channels ( ex. colour (RGB) image has 3-channels) and so does the output.
To deal with multiple-channels input like an RGB image, the filter must be 3D (ex. 3by3by3) as shown in Figure 8. Essentially, the number of the input channel is equal to the depth of the filter.
To produce multiple output feature maps, multiple filters are required. This is illustrated in Figure 9.
Below is the demo of a Conv layer with K = 2 filters, each with a spatial extent F = 3 , moving at a stride S = 2, and input padding P = 1.
Convolutional Neural Network
So far, we have explained and illustrated the key components of CNN. Now we are ready to look at the complete picture of CNN.
The success of CNN in computer vision lies within the use of filters or (kernels). Filters extract interesting features from an input image. Below is a list of some kernels which extract certain characteristics.
It is not possible to manually design filters to extract the features for image classification and detection tasks. The main idea of CNN to learn the feature representations (filters) from the dataset. So filters are trainable weights. The use of filters in CNN can be summarised in Figure 12.
There are 3 levels of features — low, mid and high as shown in Figure 13. The shallow layers of CNN capture high-level features (ex. vertical/horizontal edges, structures). The intermediate layers extract mid-level features. And deep layers specialise in low-level features which are not interpretable to human eyes. The final feature maps are usually fed into a classifier (SVM, MLP for classification task) or a detector (Yolo, for object detection task).
In a nutshell, CNN learns to classify or detect through a series of feature extraction convolutional and pooling layers, followed by a classifier/detector as shown in Figure 14.
Wrap-up
In this post, we have covered the fundamental concepts of CNN which is Part-1 in Convolutional Neural Network — An Informal Intro series. In the next post, we will explore modern CNN architectures like residual network, network-in-network, batch-normalising. Also, we’ll walk through PyTorch implementation of Darknet-19.