Convolutional Neural Network

Harshitha Harshi
Analytics Vidhya
Published in
13 min readJun 28, 2021

Overview

In this article, I want to share my knowledge on the basics of Convolutional Neural networks. understanding the basics of any concept makes learning those concepts easy.so, let’s get started understanding CNN or Convnets in a better way.

Table of Content:

  1. Introduction
  2. How CNN learns from images?
  3. Filters and Filter depth
  4. Parameters
  5. Padding
  6. Dimensionality
  7. Pooling
  8. CNN basic Architecture
  9. Activation function
  10. Training CNN using Backpropagation
  11. existing Architectures

1. Introduction:

Convolutional Neural Networks (CNN or Convnet) is a category of Neural Networks that were specifically built for visual tasks. visual tasks in areas such as image recognition and classification, object recognition.

ConvNets are an important tool for most machine learning practitioners today. However, understanding ConvNets and learning to use them for the first time can sometimes be an intimidating experience.

A Convolutional Neural Network (CNN) is biologically inspired by the visual cortex of the human brain, it comprised of one or more convolutional layers and then followed by one or more fully connected layers

2. How CNN works?

Let’s develop a better intuition for how Convolutional Neural Networks (CNN) work. We’ll examine how humans classify images, and then see how CNNs use similar approaches.

images are noting but pixel .pixel contains the value of RGB colors from which images are formed. I used pixel viewer to view the pixel range of the image below.

Original image of a cat
RGB values in Pixel of the above cat image

Let’s say we wanted to classify the following image of a cat to a particular breed of that cat:

As humans, how do we do this?

We do that we identify certain parts of the cat, such as the nose, the eyes, and the mouth. We essentially break up the image into smaller pieces, recognize the smaller pieces, and then combine those pieces to get an idea of the overall cat.

In this case, we might break down the image into a combination of the following:

  • A nose
  • eyes
  • mouth

Going One Step Further

How do we determine what exactly a nose is?

A nose can be seen as an oval with two black holes inside it. Thus, one way of classifying nose is to break it up into smaller pieces and look for black holes (nostrils) and curves that define an oval as shown below:

A curve that we can use to determine a nose
A nostril that we can use to classify the nose of the dog

Broadly speaking, this is what a CNN learns to do. It learns to recognize basic lines and curves, then shapes and blobs, and then increasingly complex objects within the image. Finally, the CNN classifies the image by combining the larger, more complex objects.

In our case, the levels in the hierarchy are:

  • Simple shapes, like ovals and dark circles
  • Complex objects (combinations of simple shapes), like eyes, nose, and mouth
  • The cat is a combination of complex objects.

With deep learning, we don’t actually program the CNN to recognize these specific features. Rather, the CNN learns on its own to recognize such objects through forward propagation and backpropagation!

It’s amazing right.. how well a CNN can learn to classify images, even though we never program the CNN with information about specific features to look for.

Hierarchy of how CNN splits images in a layer by layer

A CNN might have several layers, and each layer might capture a different level in the hierarchy of objects. The first layer is the lowest level in the hierarchy, where the CNN generally classifies small parts of the image into simple shapes like horizontal and vertical lines and simple blobs of colors.

The subsequent layers tend to be higher levels in the hierarchy and generally classify more complex ideas like shapes (combinations of lines), and eventually full objects like cats.

Once again, CNN learns all of this on its own. We don’t even have to tell CNN to go looking for lines or curves or noses or mouth. The CNN just learns from the training set and discovers which characteristics of a cat are worth looking for.

3. Filters

The first step for a CNN is to break up the image into smaller pieces. We do this by selecting a width and height that defines a filter.

Filter/Kernel: Filters or kernels are pre-chosen m*n matrices that scan the incoming image matrix and via matrix multiplication produce some results which give ideas about various image features. The filter looks at small pieces, or patches, of the image. These patches are the same size as the filter.

Image Source: medium

in the above gif, the green color matrix is a 5X5 image and the yellow color matrix is the 3X3 kernel/filter by computing the kernel, on the image matrix, we get convolved feature matrix. filter/kernel simply slides horizontally or vertically to focus on a different piece of the image.

Stride: The amount by which the filter slides is referred to as the ‘stride’. Increasing the stride reduces the size of your model by reducing the number of total patches each layer observes.

Let’s look at an example. In this zoomed-in image of the cat, we first start with the patch outlined in red. The width and height of our filter define the size of this square.

We then move the square over to the right by a given stride (2 in this case) to get another patch.

We move our square to the right by two pixels to create another patch.

What’s important here is that we are grouping adjacent pixels and treating them as a collective.

In a normal, non-convolutional neural network, we would have ignored this adjacency. In a normal network, we would have connected every pixel in the input image to a neuron in the next layer. In doing so, we would not have taken advantage of the fact that pixels in an image are close together for a reason and have special meaning.

By taking advantage of this local structure, our CNN learns to classify local patterns, like shapes and objects, in an image.

Point to remember: In CNN’s, filters are not defined. The value of each filter is learned during the training process.

Why filters/kernels are learnable and how kernels learn?

In the initial approach of convolutional neural network which is before 1988, there were using hardcoded filters means filters are not learnable filters. the problem with using hardcoded filters is those filters won’t help full to extract feature maps of different types of data.

for this reason, a learnable kernel came into use. filters/kernels learn and change their values during backpropagation so, they work similar to weights in Multi-layer perceptron.

Filter Depth

It’s common to have more than one filter. Different filters pick up different qualities of a patch. For example, one filter might look for a particular color, while another might look for a kind of object of a specific shape. The amount of filters in a convolutional layer is called the filter depth.

a patch is connected to a neuron in the next layer

How many neurons does each patch connect to?

well, that depends on our filter depth. If we have a depth of ‘n’ numbers, we connect each patch of pixels to the ‘n’ number of neurons in the next layer. This gives us the height of the ‘n’ number in the next layer. In practice, ’n’ is a hyperparameter we tune, and most CNNs tend to pick the same starting values.

But why connect a single patch to multiple neurons in the next layer? Isn’t one neuron good enough?

Multiple neurons can be useful because a patch can have multiple interesting characteristics that we want to capture.

for example, one patch might include some interesting different features like in the below image:

this patch or part of the cat contains many interesting features, these include eyes, nose, whiskers, etc

Having multiple neurons for a given patch ensures that our CNN can learn to capture whatever characteristics the CNN learns are important.

Remember that CNN isn’t “programmed” to look for certain characteristics. Rather, it learns on its own which characteristics to notice.

4. Parameter Sharing

When we are trying to classify a picture of a cat, we don’t care where in the image a cat is. If it’s in the top left or the bottom right, it’s still a cat in our eyes. We would like our CNNs to also possess this ability known as translation invariance. How can we achieve this?

As we saw earlier, the classification of a given patch in an image is determined by the weights and biases corresponding to that patch.

If we want a cat that’s in the top left patch to be classified in the same way as a cat in the bottom right patch, we need the weights and biases corresponding to those patches to be the same, so that they are classified the same way.

This is exactly what we do in CNNs. The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren’t shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters helps us with translation invariance and gives us a smaller, more scalable model.

5. Padding

Padding is a process of adding extra layers on top of the input images.

Image Source: GeeksforGeeks

types of padding:

  1. Valid padding: here we won’t apply any padding on input images which means the size of the original input image will remain for further process.
  2. Same padding: here, we add padding layers such that the output image has the same dimensions as the input image.

6. Dimensionality

From what we’ve learned so far, how can we calculate the number of neurons of each layer in our CNN?

Given our input layer has a volume of V, our filter has a volume (height * width * depth)of F, we have a stride of S, and padding of P

the following formula gives us the volume of the next layer: (V-F+2P)/S+1

Knowing the dimensionality of each additional layer helps us understand how large our model is and how our decisions around filter size and stride affect the size of our network

7. Pooling

The pooling layer operates on each feature map independently. pooling reduces the resolution of the feature map by reducing the height and width of features maps but retains features of the map required for classification. This is called downsampling.

types of pooling:

  1. Max Pooling: when max pooling is done on 4X4 feature map with stride =2 and filter size 2X2 it reduces the feature map to size 2X2 and values of that 2X2 feature map is calculated by picking up the maximum number in the metric which formed when the filter slides on the feature map.

2. Average Pooling: when Average pooling is done on 4X4 feature map with stride =2 and filter size 2X2 it reduces the feature map to size 2X2 and values of that 2X2 feature map is calculated by picking up the average number of the metric which formed when the filter slides on the feature map.

3. Global Pooling: it reduces each channel in the feature map to a single value.it can be either global max pooling or global average pooling.

9.Convolutional Layer in Neural network

above we learned about important parts of convolutional layer now let's build a convolutional layer structure

Let’s understand the architecture of CNN by creating one simple model:

Structure of single Convolutional Layer without padding and with padding.

here let's say the image size is 5X5 when applied kernel of size 3X3 and stride is 1 the learnable kernel extracts features and outputs 3X3 convolved matrix or feature map.2X2 Pooling with stride =1 is applied on top of the convolved matrix and the metric values are flattened to a single vector called flatten layer.

in the above structure, I am applying padding on convolved images or feature maps know the feature map size is 4X4 and after padding applying pooling with stride =1 and flattening values using flatten layer

Why pooling, Padding is needed?

as we took stride =1, there will be overlap in the extraction of feature which results in more data which are correlated so, by applying pooling we can reduce the size of data without losing important features.in our example, we are reducing the 3X3 convolved matrix into 2X2 data by applying pooling with stride = 1.if we increase stride value we can reduce the size of data to small

In our example, we are decomposing a 5X5 image to a 3X3 convolved matrix by applying a 3X3 kernel/filter in some situations we don’t want to do this because by this we may lose features at the edges of an image or we have to keep the same size metrics to do matrix multiplication so, there we use a type of padding. in the above example convolved feature is padded and the size is 4X4 when we apply pooling on top of that 4X4 convolved feature, it gives a 3X3 data matrix. which means we are keeping the same size as the previous matrix.

Note:

Pooling can only be done on feature maps/convolved images but padding can be done on input images as well on feature maps.

Structure of A Convolutional Neural Network

The CNN layer tries to learn what to extract from images using the kernel, padding, and pooling. after the CNN layer, we add a fully or forward-connected layer where it learns the relationship between patterns extracted by the CNN layer in terms of weights and gives an output.

Images are nothing but a matrix of RGB values(in terms of color images) when we apply kernel/filters on top of the images kernel starts extracting edges or features of the images. if we apply multiple filters we get multiple convolved images/matrix or feature maps on top of those convolved matrices.

we may apply some more convolution layers, max pooling, batch normalization to extract feature maps, and finally, we flatten those feature maps and add fully connected neurons to get the final output or classification of the image.

I tried covering the basic terminologies and architecture of a single convolutional layer above.

10.Training CNN using Backpropagation

  • We initialize all filters in CNN layers and parameters/weights in a Fully connected network with random values
  • The network takes a training image as input, goes through the forward propagation step (convolution, ReLU, and pooling operations along with forward propagation in the Fully Connected layer), and finds the output probabilities for each class.
  • In CNN architecture we use convolutional layer and maxpooling.so,we have to make sure that both convolutional layer and maxpooling needs to be differentiable to do backpropogation.
  • Convolutional layer contains convolution operator followed by activation function.both Convolutional operator,activation function are differentiable and maxpooling is also differentiable.
  • After we get output we can calculate loss between actual output(y) and predicted output(y^) by getting loss function we can have differentiate loss function.
  • Since weights are randomly assigned for the first training example, output probabilities are also random.
  • Calculate the total error at the output layer.
  • Use Backpropagation to calculate the gradients of the error for all weights in the network and use gradient descent to update all filter values/weights and parameter values to minimize the output error.

The weights are adjusted in proportion to their contribution to the total error.

Parameters like the number of filters, filter sizes, the architecture of the network, etc. have all been fixed before Step 1 and do not change during the training process — only the values of the filter matrix and connection weights get updated.

  • Repeating Backpropagation with all images in the training set until the error rate becomes optimal.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.

I explained training using backpropagation in simple English words if you are interested to understand training mathematically in detail please go through this link.

11. CNN Architectures

Convolutional Neural Networks have been around since the early 1990s.

  • LeNet (the 1990s)
  • AlexNet (2012)
  • ZF Net (2013)
  • GoogLeNet (2014)
  • VGGNet (2014)
  • ResNets (2015)
  • DenseNet (August 2016)

Bonus References for those who complete this article:-

if your beginner and want to understand CNN visually please visit this site CNN Explainer

Once you are familiar with CNN basic architecture you can visit TensorSpace Playground.

Thank you for reading this article …

REFERENCE:

  1. https://github.com/nehal96/Deep-Learning-ND-Exercises/blob/master/Convolutional%20Neural%20Networks/convolutional-neural-networks-notes.md

3. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

4. https://www.tensorflow.org/tutorials/images/cnn

5. https://developersbreach.com/convolution-neural-network-deep-learning/

5. Backpropagation in CNN

6. CNN explainer

--

--