CONVOLUTIONAL
NEURAL NETWORK (CNN) — A BRIEF INTRODUCTION

Vinamra Shrivastava
Electronics Club IITK
6 min readMar 20, 2021

INTRODUCTION

CNN or ConvNets are very similar to Neural Networks; they are made up of neurons and have learnable weights and biases. Each neuron takes input and performs dot product and optimally follow it with non-linearity. CNN is in high demand in today’s world. It is widely used in Image Recognition, Image Classification, Object Recognition, Face Recognition. CNN still expresses a single differentiable score function, and they still have a loss function (e.g. Softmax) on the last (fully-connected) layer. Most techniques and tricks are similar to regular Neural Networks.

So what changes? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode specific properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the network's number of parameters.

Many techs giants like Facebook, Amazon and Google use it for Automatic photo tagging, for a product recommendation or photo search, respectively, and the list goes on.

Convolutional Neural Network

CNN image classification takes an input image and process it, and then classify it into different categories. For example, let’s say that we need to build a Cat and Dog classifier that inputs an image, assign importance to various aspects in the image and then predict the probability of both classes. We choose the appropriate class based on the output probability. The pre-processing in CNN is much lesser as compared to other algorithms.

Dog and Cat Classifier

The computer can’t perceive images as us humans, for computer images are nothing but arrays of pixels. For instance, An image of 6 x 6 x 3 array of a matrix of RGB(width=6, hieght=6, 3 refers to R, G and B channels)

Input ( Training Data) :

  • The input layer or input volume is an image that has the following dimensions: [width x height x channels]. It is a matrix of pixel values.
  • Example: Input: [64x64x3]=>(width=64, height=64, depth=3) The depth here, represents R,G,B channels.
  • The input layer should be divisible many times by 2. Common numbers include 32, 64, 96, 224, 384, and 512.

ConvNet has two parts Feature learning(Conv, Relu, Pooling) and Classification(FC, Softmax) so let’s dive into it.

Convolutional Layer :

The convolutional layer's main goal is to extract features such as edges, corners, and colour from the image. As we go deep in the network, it starts recognising more complex features such as car parts, shapes, and digits.

A demo of a Conv layer with K = 2 filters, each with a spatial extent F = 3 , moving at a stride S = 2, and input padding P = 1. (Reference : CS231n notes.)

We all must hear the word convolution in our life once, and here it’s meaning doesn’t change. Yeah! You got it; this is all about convolving an object on another object. The convolutional layer is carrying the majority of the computational load. Now we will perform dot products between a receptive field and a filter on whole dimensions. The result is a single integer of the output volume (feature map or featured matrix). Then we slide the filter over the next receptive field of the same input image by a Stride and compute the dot products between the new receptive field and the same filter again. We repeat this process until we go through the entire input image. The output is going to be the input for the next layer.

Featured Matrix, Filter, Kernel are matrix used for feature detection. These decide the dimensions or shape of the next layer, so we have to carefully choose the shape of this matrix.

Receptive fields mentioned above are the local region that has a shape equal to kernel or filter. We perform the dot product between the receptive field and kernel to obtain the feature map.

Feature Map is the output formed by dot-product of kernel sliding over the image with the receptive field.

Stride has the aim to produce a smaller output volume layer. For example, if strides=2, then filter(kernel) will shift by two steps horizontally or vertically. Stride equals to 1 or 2 are more common.

Zero Padding adds zeroes around the input image, so that shape of the Feature Map remains the same as input; if we don’t pad it with zeros, many of the information and features in the corners get lost. Zero paddings is applied in the above picture.

Valid padding means we drop that part of the image where the filter didn’t fit. Example below

Convolution operation on image matrix

Non-linearity(Relu activation) :

Relu stands for the Rectified linear unit. It is a non-linear activation function that done thresholding at zero. There are other non-linear activation functions such as tan(h), sigmoid, but for now, we proceed with performance-wise Relu.

ƒ(x) = max(0,x)

f(x) operation on image matrix

Pooling Layer :

The pooling layers section would reduce the number of parameters when the images are too large, it doesn’t have any parameter. Spatial pooling also called subsampling or downsampling, reduces the dimensionality of each map but retains important information. Spatial Pooling can be of different types:

  • Max Pooling — Most used type of pooling, takes max value from the feature map.
  • Average Pooling — Average of all elements in a feature map.
  • Sum Pooling — Sum of all elements in the feature map.

Fully Connected Layer :

Till now we haven’t done anything about classifying different images. What we have done is highlighted some features in an image. Fully Connected Layer is, feed forward neural networks. Fully Connected Layers form the last few layers in the network. We flattened our matrix into a vector and fed it into a fully connected layer like a neural network.

After Pooling

In the above diagram, the feature map matrix will be converted as a vector (x1, x2, x3, …). With the fully connected layers, we combined these features to create a model. Finally, we have an activation function such as softmax or sigmoid to classify the outputs as cat, dog, car, truck etc.

CNN network

Summary :

  • Provide the input image into convolution layer
  • Perform convolution on image and with featured kernel/filters also choose parameters, strides and padding and then apply Relu activation to the matrix.
  • Perform pooling to reduce dimensionality size
  • Add as many convolutional layers until satisfied
  • Flatten the output and feed into a fully connected layer (FCLayer)
  • Output the class using an activation function (Logistic Regression with cost functions) and classifies images

--

--