Convolutional Neural Networks

Thomaskutty Reji
Geek Culture
Published in
7 min readSep 14, 2021

Contents

  • Introduction
  • General Architecture
  • Convolutional Block
  • The Convolution Operation
  • Pooling Operation
  • Padding operation
  • Striding Method
  • Implementing CNN from scratch
  • Calculating Number of Parameters per layer
  • Conclusion and further sources

Introduction

Convolutional neural networks are type of neural network primarily used with image, speech and audio inputs. CNNs are very useful to process images and extract features from them using convolution operations. The major application areas include image classification, object detection and image segmentation(both semantic segmentation and instance segmentation).

General Architecture of CNN

fig1 : CNN architecture

A convolutional neural network has two parts namely the feature learning and classification part. In the feature learning part, we have many convolutional blocks and generally each such conv block consists of three operation namely, convolution, activation and pooling. A CNN model can have many convolutional blocks and fully connected layers depends on the network design. With each convolutional layer we need to specify the number of filters . Filters detect patterns include — edges , shapes, textures, curves, objects , colors etc. The classification part has the feed forward neural network. CNN learns the kernel parameters while training. So we don’t need to manually design kernels for specific feature extraction.

The Convolution Operation and the kernels

In image processing domain, people use manually defined kernels or filters for image denoising, transformations etc. But in the context of CNNs, kernels are trainable during the training. Initially we take random weights for each parameter in a filter (kernel) and using backpropagation we get the optimized parameter values for each parameter. What we are doing in convolution? It is just a series of dot products between kernel values and different regions of the image. To capture all parts of the input image we slide the kernel window over the images. The filter associated with the convolutional layer slides over each 3*3 set of pixels from the input. The process is repeated until the filter covers all 3*3 block of pixels. This sliding is referred to as convolving. The results of the dot product is stored in the output, known as feature map. Now consider the following figure which shows how convolutions are happening using kernels.

fig 2: convolution process

We can have many different kernels for feature extraction ( vertical edges, curves, boundaries etc. ). In the feature map, high dot product represents good matching portion of the input image with respect to the kernel operator. So what to do with the feature map?

Pooling Operation

The feature map we get after the convolution operation is sensitive to the location of the features in the input image. To address this sensitivity we down sample the feature maps using pooling. So the resultant down sampled feature will be more robust to changes in the position of the feature in the input image ( local translation invariance). Since the input has huge dimension, we need to perform many multiplication operations. But fortunately through pooling we need less amount of computation while training. There are various kinds of pooling, max pool and average pooling are the common choices here. In pooling we take a small region in the feature map and take the maximum value if it is max pooling , if we are applying average pooling we take the average of values in the selected region. Lets consider the following figure to understand how we can do max pooling on the feature map.

fig3 : max pooling and average pooling

Padding operation

Padding is the process of adding extra set pixels of zero to the input images to avoid information loss during the convolution operation. Padding helps to improves the performance by keeping the information at the borders. If we don’t use padding then the volume size reduces so quickly and the information would washed away too quickly.

  • There are three kinds of padding in deep learning namely, valid padding, same padding and full padding. Valid padding means no padding at all. Same padding will keep the image size after the convolution. Full padding ensures that all pixels have same influence on output. In this case output is larger than the input.
fig 4: padding ( 2 layer)

Striding Method

In strided convolution we shift the window by more than one pixel range. If the stride is 2 then the kernel will slide over the image with the shift of 2 pixels(row or column).

fig 5; convolution with stride = 1
fig 6: convolution with stride = 2

Scaling , Batch normalization and dropout

Scaling or normalization is done by dividing 255 from each input pixels. So normalization refers to bringing all values to a uniform scale of 0 to 1. Similar scale features will lead to attain the gradient descent faster.

Batch Normalization helps us to improve the training speed , performance, stability and stability. Batch normalization is applied on the layers that we choose to apply. The first batch norm does is normalize the output from the activation function. After normalization batch norm multiplies this normalized output by some arbitrary parameter and then add another arbitrary parameter to it. This calculation with the two arbitrary parameters sets a new standard deviation and mean for the data . These four parameters are all trainable. This process makes it so that the weights within the network don’t become imbalance with extremely high or low values since the normalization is included in the gradient process. with batch norm we normalized within the network also . Batch normalization occurs per batch basis.

Dropout helps to minimize overfitting. Neural networks trained on small dataset may lead to overfit. During training a large number of layers outputs are randomly ignored or dropped out. Dropout is implemented per layer in a neural network.

Fully connected layer

After each convolution layers, our input data gets huge number of dimensions . We flatten the parameters after all the convolution and then we design the feed forward network. We can have many number of fully connected layers. In the final layers we use SoftMax activation function if we address a multiclass classification problem, and sigmoid if we have a binary classification problem. In the following section lets create a Convolutional neural network using keras.

Implementing a Convolutional Neural Network using keras

gist 1 : model class

Now lets get the summary of the model and analyze the parameters with respect to each layers. Following image shows the model (network) summary.

fig 7 : model summary

We can see that the input image shape is (224,224,3), so here we have a three channel image. In the input layer we don’t have any parameters. The first convolution layer uses 32 filters. We should remember that since our input image has three channels, each filter also has three channels. Now , in the network designing part, we already defined that the filter size is (3,3). So, each filter has three channels and each channels has (3,3) shape. This implies that we have 9 parameters for each channels. In total we have 27 parameters + 1 bias for each filter . Since we have 32 filter in the first convolution, we have 32* 28 = 896 parameters with respect to the first convolution layer.

We use (2,2) pooling so the dimension reduces to half of the original. ( from 224 to 112). Now comes to the batch normalization part, corresponding to each filter batch normalization layer has 4 parameters where half of them are non-trainable. Since we have 32 filters in previous layer, the total parameters are 32*4 = 128. Dropout layer does not have any parameters, and it has no effect on the dimensions.

Note that in the second convolution layer we have 9248 parameters. Lets look onto this. The previous layer has 32 dimensions. So, the filter should also have 32 channels or dimensions. We have 9 parameters per channel. So, in total we have (9*32)+1 bias per filter ( 289). Since we have 32 such filters the total number of parameters in the second convolution is 289* 32 = 9247

Conclusion

We have seen how convolutional neural networks work and implemented a basic CNN network using keras. CNN has got various use cases in many different domains. We need a sufficient number of images per category to train a CNN from scratch. CNN advanced topics include Transfer learning, various kinds of convolutions, residual networks etc. The reader should now be good in fundamentals of convolutional neural network.

References and further reading

--

--