Image Classification with Convolutional Neural Networks

Henok Tilaye and Haylemicheal Berihun
Analytics Vidhya
Published in
14 min readApr 22, 2019
Photo by Annie Spratt on Unsplash

Part 1

Introduction

Before we get into the topic of image classification, neural networks, and convolutional neural networks, let us first get familiar with a few basic concepts and terminologies. In this part we will take some time to introduce the terms: We will first see what an image is. Next, we will define computer vision. Then, we shall see the sub classes it has, and we will also clearly define what is meant by image classification. Finally, we finish this part by explaining what the basic intuition is about neural networks. So let’s get started.

What is an Image?

To us humans, images are just snapshots of the world we live in. For us they represent what a particular framed region looked like at a particular point of time. And if we look closely at an image, we can see that it is just a blend of different colors in a specific order that represent the things we conceive. But, this is not what they are to computers.

Computers see digital images as numbers. To understand this let us see how an image is represented in a computer. There are many different representations, from those we’ll use the “RGB” representation. The basic building blocks of any image is a pixel. Think of them as small(the smallest possible) boxes that represent a particular rectangular part of an image. If we had an image of resolution 32 X 32, it would be represented as a grid of pixels having 32 rows and 32 columns.

In the RGB representation of an image, each pixel holds the color intensity values (Red, Green, and Blue) of that pixel. These intensity values range from 0 to 255.

0 being the darkest(black) and 255 being white for each color. If we took a single pixel it would look like this

  • Black = (0, 0, 0)
  • White = (255, 255, 255)
  • Red = (255, 0, 0)
  • Green = (0, 255, 0)
  • Blue = (0, 0, 255)
  • Yellow = (255, 255, 0)

The above is true for colored images. However, the same is true for gray-scale images as well. Here instead of color intensity we would have gray-scale. And instead of 3 elements, a pixel would be represented as a single element.

  • Black = (0)
  • Dark areas = (closer to 0)
  • Lighter areas = (closer to 255)
  • White = (255)

For the later parts of this tutorial, it helps if you think of an image as a 3 dimensional box. The height of the box would be the number of rows, the width would be the columns, and the depth would be the 3 color channels (RGB); or like (rows, columns, channels).

This is all you will need to know about an image for now. Now let’s see what computer vision is all about.

What is Computer Vision?

Computer Vision, often abbreviated as CV, is defined as a field of study that seeks to develop techniques to help computers “see” and understand the content and context of digital images such as photographs and videos. Its applications are countless.

To mention a few:

  • it can be used to make a computer able to give a sentence that tells what’s going on in an image,
  • it is used in self driving cars to identify objects and
  • to help with navigation, facial recognition, object tracking, image classification tasks and the like.

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information.

Image processing

Image processing is a subcategory of digital signal processing and it is the process of creating new images by enhancing or editing the content of the image in some way.

Examples of image processing:

  • Image transformations such as translation, resizing, cropping, rotation and flipping
  • Normalizing photo-metric properties of the image, such as brightness or color
  • Blurring and Sharpening of an image.

Computer Vision requires image processing (prepossessing) for its row input. Next, let’s see what neural networks are all about.

Neural Networks

We assume that you know how artificial neural networks work and have some experience working with them. This section is just simply to brush up on the very high level intuition of the concept.

Neural networks are very old machine learning algorithms that help a machine to learn patterns in data. The basic intuition is that given the inputs and outputs, they can be used to help the machine learn the transformation function that is able to produce those output values from the inputs.

This function would then be applied to future values that it has not seen before to predict what the outputs wold be. Convolutional neural networks are a type of neural networks that have gained much success in recent years.

In this part we have seen what an image is and what computer vision is. We saw its application, and we saw that neural networks are one of the tools we use to do computer vision projects.

With this, let us dive into convolution neural networks.

Part 2

Convolutional Neural Networks

Convolutional Neural Network, often abbreviated as CNN, is a powerful artificial neural network technique. These networks achieve state-of-the-art results in a variety of application areas including

  • Voice-User interfaces,
  • Natural Language Processing, and
  • Computer vision.

In this section, we will discover CNN for image classification.

We take the grayscale MNIST dataset for hand written digits, where each image is 28 × 28, yielding a total of 28 × 28 × 1 = 784 total inputs to our network. A traditional feed-forward neural network would require 784 input weights.

This is fair enough, but consider if we were using 250 × 250 pixel images with a red, green and blue channels. The total number of inputs and weights would jump to 250 × 250 × 3 =187, 500 and this would be difficult for standard neural networks to give a good performance.

However, by applying convolutions, filters, nonlinear activation functions, pooling, and backpropagation, CNN’s are able to learn and the filters can detect edges and blob-like structures in lower-level layers of the network.

It can then use the edges and structures as “building blocks”, eventually detecting high-level objects (e.x., faces, cats, dogs, cups, etc.) in the deeper layers of the network.

Convolution is a mathematical operation that is an element-wise multiplication of two matrices followed by a sum.

Building Blocks of CNN:

  • Convolutional Layer
  • Activation Functions
  • Pooling Layer
  • Fully-Connected Layer

The following illustrates the flow of our process:

INPUT => CONV => RELU => Pooling =>CONV =>RELU => Pooling =>…=>FC

1 — Convolutional Layer

The Convolutional Layer(or the Conv Layer) is the core building block of a Convolutional Neural Network. Conv layer applies a series of different image filters also known as convolutional kernels to an input image. The resulting filtered images have different appearances. The filters may have extracted features like the edges object or the color that distinguish different classes of images.

Image is a multidimensional matrix. It has width(#columns), height(#rows) and depth(#channels). We can think of an image as a big matrix and a kernel(filter) as a tiny matrix that is used for convolution. This tiny kernel starts from a top-left of an image and passing it from left to right and top to bottom pixel by pixel by applying a mathematical operation(convolution) at each (x,y) coordinate of the image.

  • For the inputs to the CNN, the depth is the number of channels in the image(3 for RGB images).
  • For the output of the convolutional layer, the number of channels is the number of kernels applied to the layer.
  • After applying k filters to the input volume, we now have k 2D feature maps.

The distance that the filter is moved across the input from the previous layer is referred to as the stride. If the size of the previous layer is not cleanly divisible by the size of the filter’s receptive field and the size of the stride, then it is possible for the receptive field to attempt to read off the edge of the input feature map. In this case, techniques like zero padding can be used to invent mock inputs with zero values for the receptive field to read.

Putting all these parameters together, we can compute the size of an output volume as a function of the input volume size:

Output = ((W − F + 2P)/S) + 1

where

W is the shape of image (assuming the input images are square),

F is the filter size,

S is the stride, and

P is the amount of zero-padding

Splitting the image into a number of channels (in this case red, green and blue)

Original image
Red Channel
Green Channel
Blue Channel

The first 15x15 pixel value of the image

The Red channel value
The Green channel value
The Blue channel value

Some examples of kernels (Filters)

3x3 Gaussian Blur Filter
3x3 Average Blur filter
3x3 Edge detection filter

Convolution of a 5x5 image with the above edge detection filter and a stride of 1:

2 — Activation

After convolutional layer in CNN, we apply nonlinear activation function such as ReLU. ReLU is the abbreviation of the rectified linear unit, which applies the non-saturating activation function f(x) = max(0,x) It effectively removes negative values from an activation map by setting them to zero. It increases the nonlinear properties of the decision and of the overall network without affecting the receptive fields of the convolution layer.

3 — Pooling

A complicated dataset with many different object categories requires a large number of filters each responsible for finding a pattern in the image. More filters means a bigger stack, which means the dimensionality of the convolutional layer becomes quite large. Higher dimensionality can lead to overfitting.

There are two methods to down sample this dimensionality —

a. convolutional layer with a stride >1

b. pooling layers

POOL layers operate on each of the depth slices of an input independently using either the max or average function.

Max pooling takes a stack of feature maps as input and it has window size and stride. It starts on the top left corner of the image and strides to the right and the bottom. The value of the corresponding node in the max pooling layer is calculated by just taking the maximum of pixels contained in the window.

Max pooling with 2x2 window size and stride of 2.

4 — The fully connected layers

The definition on Wikipedia states:

Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.

It takes as inputs the output from the last convolution layer and does the computations in its hidden layers and produces class probability outputs in its output layer. They take in the flattened output of the last convolution layer. This output must be flattened because as we have seen the convolution operation takes an input image with (H, W, D) dimension where ‘H’ and ‘W’ are height and width, and ‘D’ is the depth representation.

In case of an ‘RGB’ images, this depth component will be 3 representing each color. Then the output will be in the form of (Hc, Wc, no. of feature maps at the last layer). And if you’ve seen an Artificial Neural Network before, then you know they take a vector with a dimension that looks something like (n, 1). And so when we give the output of the convolution layer to our FC layer we have to make it in the flattened form (Hc*Wc* no. of feature maps of the last layer, 1).

Then using this input, the FC layer learns patterns to finally classify the data into classes and provide us with class probability in the output layer.

Let’s break this down into 3 steps

  1. First, as we saw above, the output of the convolution part of the network gets flattened and is feed to the FC part of our network
  2. Then, this input is propagated through the network being operated on with each neuron in the network (we’ll see later what these operations are that the neurons do)
  3. Finally, at the output layer of this FC layer we would have ’n’ number of neurons. ’n’ is equal to the number of classes we are classifying the data into

Let's shine a light on the operations that each neuron is doing.

Since it is a fully connected network each neuron in one layer is connected to each neuron in another layer.

If we look at a single neuron’s operation it looks like the following (ignoring the bias term for simplicity)

where:

X is the inputs

W are the weights

g is the activation function

Each of the neurons in the network will do the above operation taking as inputs the outputs of the neuron before them, and keep on passing the outputs to the next layer until it reaches the last layer. This last layer in our case is going to be either a sigmoid or a softmax layer (sigmoid for binary classification, softmax for multi-class classification). This will output the probabilities of each class.

We’ve been talking about the architecture of our CNN model. We haven’t yet seen how to use it or how we can make it learn from the data. Before we go to the training part, let us recap what we’ve seen so far.

We saw that basically, CNN will have 4 very important parts.

1 — The convolution layer that convolves our input image using kernels (sometimes called weights) to identify different parts of the image at each layer.

2 — The non-linear function that introduce non-linearity in the network

3 — the pooling section that helps us to reduce the size of our matrices and help us identify important features wherever they are in an image.

4 — And finally the Fully Connected network part of the architecture that is used to make classification taking the high-level features of the last convolution layer output as inputs.

In the next part, let’s see how the learning and finally classification happens.

Part 3

Training/Learning

In the learning process of any neural network there are 4 basic steps:

  1. Forward propagation
  2. Loss calculation
  3. backward Propagation
  4. Optimization (Weight updates)

1. Forward propagation

First, the image with a dimension of (H, W, D) is given to the convolution layer. Then using filters (kernels) and following the convolution steps described above, we get a new matrix. Then, this will be passed through a non-linear activation layer (mostly RELU), then it goes through the pooling layer. There won’t be any learning on the first step (not yet).

The above 3 steps (convolution, activation, pooling) will be repeated for as many convolution, activation and pooling layers as we have in the model. Then the final output is passed to the fully connected layers(the multilayer perceptron).

Then as we saw above, the output from the previous layer would be propagated through the MLP and finally produce the class probabilities.

For now, all these step are performing the sequential operations in all the hidden layers of our network on our image and finally producing the class membership probabilities.

One thing to note here is: all of our filters in the convolution layers and all the weight in the MLP at first are all initialized to random values.

2. Loss

After the class prediction has been made we compare it to the ground truth value and see how far off the model is from the truth. This is what the Loss calculation is all about, i.e. calculating the error.

One very popular loss function is the Mean Squared Error. After knowing how much error there is in the model’s prediction, we go on to the next steps to reduce these error value.

3. Backpropagation

This step is where we find out the contribution of each weight and filter values to the Loss (error) we got in the previous step. We do that by using the mathematical operation derivation, specifically partial derivation because we will be calculating each weights contributions when everything else is held constant.

As the name suggests, we start at the last layer and go back through the network calculating the partial derivation of the loss value with respect to weight (MLP Weights and Filter values). This is equivalent to: dL/dW, where ‘L’ and ‘W’ are the loss and weight respectively. As we go through the network backward the chain rule will be used intensively taking the previous layers partial derivative to calculate the next ones.

Then we apply the chain rule to calculate the partial derivative of the Loss with respect to the Weights.

A similar operation is done for the activation layers.

For the pooling layers, it’s a bit different. In the Max Pooling layer, the gradient gets back propagated through the maximum values only, since a slight change won’t affect the output. In the process, we replace the maximum values before max pooling with 1 and set all the non-maximum values to zero then use the Chain rule to multiply the gradient by them.

Unlike the max pooling layer, in the average pooling layer, the gradient passes through all the inputs (before average pooling), the maximum and the non-maximum ones.

One thing to note here is that the backpropagation operation for the convolution is itself going to be a convolution operation.

After we get the gradients (partial derivatives of the loss with respect to the weights), we go to the next step.

4. Optimization(weight updating)

This step is where we will be taking small steps towards the correct predictions by making small changes to our weight values using the gradients we calculated before. This step can also be called gradient descent. Since we used a hyperbolic loss function, if we take only one weight for simplicity we would have something like this:

As we can see above, by updating the weights in the right direction, we take small incremental steps toward the weight values that produce the minimum cost value. The lengths of these steps that we take are known as the Learning Rate. Choosing a learning rate that is too big makes the gradient to overshoot the target, making it hard for the model to learn. So we have to be careful when choosing our learning rate. To update the weights, we use the following formula:

Where:

Wi+1 = the updated weight

Wi = the weight from before

α = the learning rate

the gradient of the loss function

These steps are cyclic (iterative). After creating the model’s architecture, we feed the training data to the model and the 4 steps get performed repeatedly on the data. Each time the weights get updated, the model gets closer and closer to the true values in the training set. While we are training the model we have to be careful as to not overfit it. Model should be generalized enough to work well on both training as well as validation datasets. To make this possible we have to introduce regularization techniques like dropout inside our model.

Part 4

PyTorch implementation of Convolutional neural network

PyTorch is one of many deep learning platforms and the implementations of this tutorial will be done with Pytorch. You can head on to this link to look at the full implementation.

--

--