Introduction to CNN

Learn the basics of the Convolutional Neural Network in 10 minutes.

Published in

DataX Journal

10 min readJul 21, 2020

Image classification has always been a challenging task for all AI developers across all the fields. Convolutional Neural Network has emerged as the data-driven approach for this challenge. The overview of this post will be about the image representation and the layers that make up the convolutional neural network.

Image Classification

For image classification, we need a neural network that takes the image as an input and predicts the class of that image. For example, let us take an image for prediction of class.

It looks simple as it predicts the class of the image. But the question is how?

Convolutional Neural network

To approach image classification, we will use CNN (Convolution Neural Network) that can find and define special patterns in 3D image space. The important thing which stands out CNN from other neural network is that it can look at a group of pixels in the image and can find spatial features which many neural networks look at as an individual point.

CNN is made up of multiple layers but there are three main types of layers-convolutional, pooling, and fully-connected.

We have four different convolutional kernels, that will produce four different, filtered images as output. The idea is each filter will extract a different feature from an input image and these features will eventually help to classify that image, for example, one filter might detect the edges of objects in that image and another might detect unique patterns in color. These filters, stacked together, are what make up a convolutional layer.

High-pass Filter

Let’s have a closer look at one type of convolutional kernel: a high-pass image filter. High-pass filters are meant to detect abrupt changes in intensity over a small area. So, in a small patch of pixels, a high-pass filter will highlight areas that change from dark to light pixels (and vice versa). We will be looking at patterns of intensity and pixel values in a grayscale image.

For example, if we put an image of a car through a high-pass filter, we expect the edges of the car, where the pixel values change abruptly from light to dark, to be detected. The edges of objects are often areas of abrupt intensity change and, for this reason, high-pass filters are sometimes called “edge detection filters.”

Convolutional Kernels

The filters we will be talking about are in the form of matrices, so-called convolutional kernels, which are just grids of numbers that modify an image. Below there is an example of a high-pass filter, a 3x3 kernel which does edge detection.

You may notice that all the elements in this 3x3 grid sum to zero. For an edge detection filter, all of its elements must sum to zero because it’s computing the difference or change between neighboring pixels. If these pixel weights did not add up to zero, then the calculated difference would be either positively or negatively weighted, which has the effect of brightening or darkening the entire filtered image, respectively.

Convolution

To apply this filter to an image, an input image, F(x,y), is convolved with the kernel, K. Convolution is represented by an asterisk (not to be mistaken for multiplication). It involves taking a kernel, which is our small grid of numbers, and passing it over an image, pixel-by-pixel, creating another edge-detected output image whose appearance depends on the kernel values.

We’ll walk through a specific example, using the 3x3 edge detection filter. To better see the pixel-level operations, I’ll zoom in on the image of the car.

For every pixel in this grayscale image, we place our kernel over it, so that the selected pixel is at the center of the kernel, and perform convolution. In the below image, I’m selecting a center pixel with a value of 200, as an example.

The steps for a complete convolution are as follows:

Multiply the values in the kernel with their corresponding pixel value. So, the value in the top left of the 3x3 kernel (0), will be multiplied by the pixel value in that same corner in our image area (150).
Sum all these multiplied pairs of values to get a new value, in this case, 175. This value will be the new pixel value in the filtered output image, at the same (x,y) location as the selected center pixel.

This process repeats for every pixel in the input image, until we are left with a complete, filtered output.

Image Borders

The only pixels for which convolution doesn’t work are the pixels at the borders of an image. A 3x3 kernel cannot be perfectly placed over a center pixel, at the edges or the corners of an image. In practice, there are a few ways of dealing with this, the most common ways are to either pad this image with a border of 0’s (or another grayscale value) so that we can perfectly overlay a kernel on all the pixel values in the original image, or to ignore the pixel values at the borders of the image, and only look at pixels where we can completely overlay the 3x3 convolutional kernel.

Often, there is not a lot of useful information at the border of an image, but if you do choose to ignore this information, every filtered image will be just a little bit smaller than the original input image. For a 3x3 kernel, we’ll lose a pixel from each side of an image, resulting in a filtered output that is two pixels smaller in width and height than the original image. You can also choose to make larger filters. 3x3 is one of the smallest sizes and it’s good for looking at small pixel areas, but if you are analyzing larger images you may want to increase the area and use kernels that are 5x5, 7x7, or larger. Usually, you want an odd number so that the kernel nicely overlays a center pixel.

Above is another type of edge detection filter; this filter is computing the difference around a center pixel but it’s only looking at the bottom row and top row of surrounding pixel values. The result is a horizontal edge detector. In this way, you can create oriented edge detectors!

So, these convolutional kernels, when applied to an image, make a filtered image, and several filtered images make a convolutional layer.

Activation Function

Recall that grayscale images have pixels values that fall in a range from 0–255. However, neural networks work best with scaled “strength” values between 0 and 1 (we briefly mentioned this in the last post). So, in practice, the input image to a CNN is a grayscale image with pixel values between 0 (black) and 1 (white); a light grey may be a value like 0.78. Converting an image from a pixel value range of 0–255 to a range of 0–1 is called normalization. Then, this normalized input image is filtered and a convolutional layer is created. Every pixel value in a filtered image, created by a convolution operation, will fall in a different range; there may even be negative pixel values.

To account for this change in pixel value, following a convolutional layer, CNN applies an activation function that transforms each pixel value.

In a CNN, you’ll often use a ReLu (Rectified Linear Unit) activation function; this function simply turns all negative pixel values into 0’s (black). For an input, x, the ReLU function returns x for all values of x > 0, and returns 0 for all values of x ≤ 0. An activation function also introduces nonlinearity into a model, and this means that CNN will be able to find nonlinear thresholds/boundaries that effectively separate and classify some training data.

Max pooling layer

After a convolutional layer comes to a pooling layer; the most common type of pooling layer is a max-pooling layer. Each of these layers looks at the pixel values in an image, so, to describe max-pooling, I’ll focus on a small pixel area. First, a max-pooling operation breaks an image into smaller patches, often 2x2 pixel areas.

Zooming in even further on four of these patches

A max-pooling layer is defined by the patch size, 2x2, and stride. The patch can be thought of as a 2x2 window that the max-pooling layer looks at to select a maximum pixel value. It then moves this window by some stride across and down the image. For a patch of size 2x2 and a stride of 2, this window will perfectly cover the image. A smaller stride would see some overlap in patches and a larger stride would miss some pixels entirely. So, we usually see a patch size and a stride size that are the same.

For each 2x2 patch, a max-pooling layer looks at each value in a patch and selects only the maximum value. In the red patch, it selects 140, in the yellow, 90, and so on, until we are left with four values from the four patches.

Now, we might be wondering why we would use a max-pooling layer in the first place, especially since this layer is discarding pixel information. We use a max-pooling layer for a few reasons.

First, dimensionality reduction; as an input image moves forward through a CNN, we are taking a fairly flat image in x-y space and expanding its depth dimension while decreasing its height and width. The network distills information about the content of an image and squishes it into a representation that will make up a reasonable number of inputs that can be seen by a fully-connected layer.

Second, max-pooling makes a network resistant to small pixel value changes in an input image. Imagine that some of the pixel values in a small patch are a little bit brighter or darker or that an object has moved to the right by a few pixels. For similar images, even if a patch has some slightly different pixel values, the maximum values extracted in successive pooling layers, should be similar.

Third, by reducing the width and height of image data as it moves forward through the CNN, the max-pooling layer mimics an increase in the field of view for later layers. For example, a 3x3 kernel placed over an original input image will see a 3x3 pixel area at once, but that same kernel, applied to a pooled version of the original input image (ex. an image reduced in width and height by a factor of 2), will see the same number of pixels, but the 3x3 area corresponds to a 2x larger area in the original input image. This allows later convolutional layers to detect features in a larger region of the input image.

Fully-connected Layer

At the end of a convolutional neural network, is a fully-connected layer (sometimes more than one). Fully-connected means that every output that’s produced at the end of the last pooling layer is input to each node in this fully-connected layer. For example, for a final pooling layer that produces a stack of outputs that are 20 pixels in height and width and 10 pixels in-depth (the number of filtered images), the fully-connected layer will see 20x20x10 = 4000 inputs. The role of the last fully-connected layer is to produce a list of class scores, and perform classification based on image features that have been extracted by the earlier convolutional and pooling layers; so, the last fully-connected layer will have as many nodes as there are classes.

Summary

CNN’s are made of many layers: a series of convolutional layers + activation and max-pooling layers, and at least one, the final fully-connected layer that can produce a set of class scores for a given image. The convolutional layers of a CNN act as feature extractors; they extract shape and color patterns from the pixel values of training images. It’s important to note that the behavior of the convolutional layers, and the features they learn to extract, are defined entirely by the weights that make up the convolutional kernels in the network. A CNN learns to find the best weights during training using a process called backpropagation, which looks at any classification errors that a CNN makes during training, finds which weights in that CNN are responsible for that error, and changes those weights accordingly.

PyTorch Code Example

If you’d like to see how to create these kinds of CNN layers using PyTorch, take a look at my Github, tutorial repository. You may choose to skim the code and look at the output or set up a local environment and run the code on your computer (instructions for setting up a local environment are documented in the repository readme).

If you liked the tutorial, do clap and share among your friends. Follow me on LinkedIn and on Github as well.