An Overview of Convolutional Neural Network (CNN)

Liang Han Sheng
Analytics Vidhya
Published in
11 min readAug 8, 2021
Source: https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148

The convolutional neural network (CNN) is a class of deep learning neural networks. CNN represents a huge breakthrough in image recognition. They’re most commonly used to analyze visual imagery and are frequently working behind the scenes in image classification. They can be found at the core of everything from Facebook’s photo tagging to self-driving cars. They’re working hard behind the scenes in everything from healthcare to security.

Image classification is the process of taking an input (like a picture) and outputting a class (like “cat”) or a probability that the input is a particular class (“there’s a 90% probability that this input is a cat”). You can look at a picture and know that you’re looking at a terrible shot of your own face, but how can a computer learn to do that? With a convolutional neural network.

There are 4 main layers in CNN:

  • Convolutional Layers
  • ReLU Layers
  • Pooling Layers
  • Fully Connected layer

A classic CNN architecture would look something like this:

Input -> Convolution -> ReLU -> Convolution -> ReLU -> Pooling -> ReLU -> Convolution -> ReLU -> Pooling-> Fully Connected

A CNN convolves learned features with input data and uses 2D convolutional layers. This means that this type of network is ideal for processing 2D images. Compared to other image classification algorithms, CNNs actually use very little preprocessing. This means that they can learn the filters that have to be hand-made in other algorithms. CNN can be used in tons of applications from image and video recognition, image classification, and recommender systems to natural language processing and medical image analysis.

CNN are inspired by biological processes. They’re based on some cool research done by Hubel and Wiesel in the 60s regarding vision in cats and monkeys. The pattern of connectivity in a CNN comes from their research regarding the organization of the visual cortex. In a mammal’s eye, individual neurons respond to visual stimuli only in the receptive field, which is a restricted region. The receptive fields of different regions partially overlap so that the entire field of vision is covered. This is the way that a CNN works.

CNN have an input layer, an output layer, and hidden layers. The hidden layers usually consist of convolutional layers, ReLU layers, pooling layers, and fully connected layers.

  • Convolutional layers apply a convolution operation to the input. This passes the information on to the next layer.
  • Pooling combines the outputs of clusters of neurons into a single neuron in the next layer.
  • Fully connected layers connect every neuron in one layer to every neuron in the next layer.

In the convolutional layer, neurons only receive input from a subarea of the previous layer. In the fully connected layer. Each neuron receives input from every element of the previous layer.

A CNN works by extracting features from images. This eliminates the need for manual features extraction. The features are not trained. They are learning models extremely accurate for computer vision tasks, CNNs learn feature detection through tens or hundreds of hidden layers. Each layer increases the complexity of the learned features.

Step-by-Step process in CNN

  • Starts with an input image
  • Applies many different filters to it to create a feature map
  • Applies a ReLU function to increate non-linearity
  • Applies a pooling layer to each feature map
  • Flattens the pooled images into a long vector.
  • Inputs the vector into a fully connected artificial neural network,
  • Processes the features through the network. The final fully connected layer provides the “voting” of the classes that we’re after.
  • Trains through forwarding propagation and backpropagation for many epochs. This repeats until we have a well-defined neural network with trained weights and feature detectors.

In other words, at the very beginning of this process, an input image is broken down into pixels. For a black and white image, those pixels are interpreted as a 2D array (for example 2 x 2 pixels). Every pixel has a value between 0 and 255. (Zero is completely black and 255 is completely white. The grayscale exists between those numbers.) Based on that information, the computer can begin to work on the data.

Grayscale Image Representation in Array

For a color image, this is a 3D array with a blue layer, a green layer, and a red layer. Each one of those colors has its own value between 0 and 255. The color can be found by combining the values in each of the three layers.

RGB Image Representation in Array

Basics Building Blocks in a CNN

Convolution

The main purpose of the convolution step is to extract features from the input image. The convolutional layer is always the first step in CNN.

You have an input image, a feature detector, and a feature map. You take the filter and apply it pixel block by pixel block to the input image. You do this through the multiplication of the matrices.

Let’s say you have a flashlight and a sheet of bubble warp. Your flashlight shines a 5-bubble x 5-bubble area. To look at the entire sheet, you would slide your flashlight across each 5 x 5 square until you’d seen all the bubbles.

The light from the flashlight here is your filter and the region you are sliding over is the receptive field. The light sliding across the receptive fields is your flashlight convolving. Your filter is an array of numbers (also called as weights). The distance the light from your flashlight slides as it travels is called the stride. For example, a stride of one means that you’re moving your filter over one pixel at a time. The convention is a stride of two.

The depth of the filter has to be the same as the depth of the input, so if we were looking at a color image, the depth would be 3. That makes the dimensions of this filter 5 x 5 x 3. In each position, the filter multiplies the values in the filter with the original values in pixel. This is element wise multiplication. The multiplications are summed up, creating a single number is representative of the top left corner. Now you move your filter to the next position and repeat the process all around the bubble wrap. The array you end up with is called a feature map or an activation map. You can use more than one filter, which will do a better job preserving spatial relationships.

Visualization of High-Level Filters

You’ll specify parameters like the number of filters, the filter size, the architecture of the network, and so on. The CNN learns the value of the filters on its own during the training process. You have a lot of options that you can work with to make the best image classifier possible for your task. You can choose to pad the input matrix with zeros (zero padding) to apply the filter to control the size of the feature maps. Adding zero padding is wide convolution, Not adding zero padding is narrow convolution.

This is basically how we detect images. We don’t look at every pixel of an image. We see features like a hat, a red dress, a tattoo, and so on. There’s so much information going out into our eyes at all times that we couldn’t possibly deal with every single pixel of it. We’re allowing our model to do the same thing.

The result if this is the convolved feature map. It’s smaller than the original input image. This makes it easier and faster to deal with. Do we lose information? Some, yes. But at the same time, the purpose of the feature detector is to detect features, which is exactly what this does.

We create many features maps to get our first convolutional layer. This allows us to identify many different features that program can use to learn.

Feature detectors can be set up with different values to get different results. For example, a filter can be applied that can sharpen and focus an image or blue an image. That will give importance to all the values. You can do edge enhancement, edge detection, and more. You would do that by applying different feature detectors to create different feature maps. The computer is able to determine which filters make the most sense and apply them.

The primary purpose here is to find features in the input image, put them into a feature map, and still preserve the spatial relationship between pixels. That’s important so that the pixels don’t get all jumbled up.

ReLU Layer

The ReLU (Rectified Linear Unit) layer is another step to our convolution layer. You’re applying an activation function onto your feature maps to increase non-linearity in the network. This is because image themselves are highly non-linear. It removes negative values from an activation map by setting them to zero.

Convolution is a linear operation with things like element wise matrix multiplication and addition. The real-world data we want our CNN to learn will be non-linear. We can account for that with an operation like ReLU. You can use other operations like tanh or sigmoid. ReLU is a popular choice because it can train the network faster without any major penalty to generalization accuracy.

The graph of ReLU Activation Function

Pooling

The last thing you want is for your network to look for one specific feature in an exact shade in an exact location. That’s useless for a good CNN. You want images that are flipped, rotated, squashed, and so on. You want lots of pictures of the same thing so that your network can recognize an object (say, a cat) in all the images. No matter what the size or location. No matter what the lightning or the number of spots, or whether that cat is fast asleep or crushing prey. You want spatial variance. You want flexibility. That’s what pooling is all about.

Pooling progressively reduce the size of the input representation. It makes it possible to detect objects in an image no matter where they’re located. Pooling helps to reduce the number of required parameters and the amount of computation required. It also helps control overfitting.

Overfitting can be kind of like when you memorizing super specific details before a test without understanding the information. When you memorize details, you can do a great job with your flashcards at home. You’ll fail a real test, though, if you’re presented with new information.

Pooling with Different Kernel Size and Different Strike

Fully Connected

At this step, we add an artificial neural network (ANN) to our convolutional neural network.

The main purpose of the ANN is to combine our features into more attributes. These will predict the classes with greater accuracy. This combines features and attributes that can predict classes better.

How do the output neurons work when there’s more than one?

First, we have to understand what weights to apply to the synapses that connect to the output. We want to know which of the previous neurons are important for the output.

If, for example, you have two output classes, one for a cat and one for a dog, a neuron that reads “0” is absolutely uncertain that the feature belongs to a cat. A neuron that reads “1 is absolutely certain that the feature belongs to a cat. In the final fully connected layer, the neurons will read values between 0 and 1. This signifies different levels of certainty. A value of 0.9 would signify a certainty of 90%.

The cat neurons that are certain when a feature is identified know that the image is a cat. They say the mathematical equivalent of, “These are my neurons! I should be triggered!” If this happens many times, the network learns that when certain features fire up, the image is a cat.

Through lots of iterations, the cat neuron learns that when certain features fire up, the image is a cat. The dog (for example) neuron learns that when certain other features fire up, the image is a dog. The dog neuron learns that for example again, the “big wet nose” neuron and the “floppy ear” neuron contribute with a great deal of certainty to the dog neuron. It gives greater weight to the “big wet nose” neuron and the “floppy ear” neuron. The dog neuron learns to more or less ignore the “whiskers” neuron and the “cat-iris” neuron. The cat neuron learns to give greater weight to neurons like “whiskers” and “cat-iris.”

Once the network has been trained, you can pass in an image and the neural network will be able to determine the image class probability for that image with a great deal of certainty.

The fully connected layer is a traditional Multi-Layer Perceptron. It uses a classifier in the output layer. The classifier is usually a softmax activation function. Fully connected means every neuron in the previous layer connects to every neuron in the next layer. What’s the purpose of this layer? To use the features from the output of the previous layer to classify the input image based on the training data.

Once your network is up and running you can see, for example, that you have a 95% probability that your image is a dog and a 5% probability that your image is a cat. Why do these numbers add up to 1.0? (0.95 + 0.05).

There isn’t anything that says that these two outputs are connected to each other. What is it that makes them relate to each other? Essentially, they wouldn’t, but
they do when we introduce the Softmax function. This brings the values between 0 and 1 and makes them add up to 1 (100%). The Softmax function takes a vector of scores and squashes it to a vector of values between 0 and 1 that add up to 1.

After you apply a Softmax function, you can apply the loss function. Cross entropy often goes hand in hand with Softmax. We want to minimize the loss function so we can maximize the performance of our network.

At the beginning of backpropagation, your output values would be tiny. That’s why you might choose cross entropy loss. The gradient would be very low and it would be hard for the neural network to start adjusting in the right direction. Using cross entropy helps the network assess even a tiny error and get to the optimal state faster.

Classification Result using CNN

Conclusion

In this article, I covered what is CNN, how it works and what is the process in each of the layers. CNN make Deep Learning better. Feel free to contact me if you have any question. Cheers!

About Author:

This article is written by Han Sheng, Technical Lead in Arkmind, Malaysia. He has a passion for Software Design/Architecture related stuff, Computer Vision and also Edge Devices. He made several AI-based Web/Mobile Applications to help clients solving real-world problems. Feel free to read about him via his Github profile.

--

--

Liang Han Sheng
Analytics Vidhya

Loan Origination Solutions Provider | Full Stack AI Application Development