Convolutional Neural Network: Learn And Apply

Sanket Doshi
9 min readMar 31, 2019

--

After the advancement in the neural network, everyone tried to implement it in the computer vision field. As we achieve great accuracy due to neural networks in computer vision fields there were also some disadvantages such as computing complexity.

Image recognition using CNN

As when using deep neural networks the model would require a large number of weights and huge matrix multiplications which requires more computation powers to tackle with this new type of algorithm was introduced. This algorithm is known as a Convolutional Neural Network or CNN. It was found that CNN is more efficient and faster than a regular deep neural network for problems related to computer vision.

Here, we’ll learn about CNN, it’s working and a standard architecture used. We’ll also implement CNN for hand-written digit recognition.

Introduction

A Convolutional Neural Network (ConvNets/CNN) are very similar to ordinary neural networks. They are made up of neurons with learnable weights and biases. All the tips and tricks learned or used for ordinary neural networks are also applicable to CNN’s.

The only difference between CNN and the ordinary neural network is that CNN assumes that input is an image. This vastly reduces the number of parameters to be tuned in a network.

Why ConvNets over normal neural networks?

Feeding image to normal neural networks

An image is nothing but a matrix of pixel values. So if an image is of size 40*40*1 we can flatten it and then feed it to the model. So, we would require 1600 neurons at an input layer. But not all the images are black and white some may be colored images. So each colored image has 3 layers each layer for corresponding to red, green and blue colors respectively. So, now to feed colored image with size 40*40*3 we need 4800 neurons at an input layer. This amount still seems manageable, but clearly, this fully-connected structure does not scale to larger images. Suppose, we have an image of size 1024*1024*3 so now we need approx. 3.1 million neurons in an input layer. So, you see as the size of an image increases the number of neurons required increases exponentially which makes the model computationally expensive and not feasible.

Comparatively, in CNN we assume the input is an image hence we can use this information such that we can make our model computationally inexpensive and faster.

Working of CNN

As we can see that input image is of size 10*10*1 and the kernel size is 3*3*1 and the output is 8*8*1 image. So in CNN rather than computing each pixel separately, we process on the group of pixels. And this group is defined by the kernel size. CNN only tunes the values of the kernel and hence only 9 variables are required to be tuned which is much lower than in normal neural network. Now, if the image is of dimension 10*10*3 then kernel size will also be equal to 3*3*3 and hence we would require 27 parameters to tune which is still very less. Here, the output image would be still of size 8*8*1.

Clearly, CNN is a favorable choice over the normal neural networks when processing is required on images.

Architecture Overview

Similar to traditional neural networks convolutional neural networks is also a sequence of layers. Each layer transforms an input image to an output image with some differentiable function that may or may not have parameters.

The ConvNet architecture consists of three types of layers: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.

Simple architecture

A simple ConvNet architecture could have the architecture[INPUT-CONV-RELU-POOL-FC]. In more detail:

  1. INPUT layer would hold the input image as a 3-D array of pixel values.
  2. CONV layer will compute the dot product between the kernel and sub-array of an input image same size as a kernel. Then it’ll sum all the values resulted from the dot product and this will be the single pixel value of an output image. This process is repeated until the whole input image is covered and for all the kernels.
  3. RELU layer will apply an activation function max(0,x) on all the pixel values of an output image.
  4. POOL layer will perform downsampling along the width and height of an image resulting in reducing the dimension of an image.
  5. FC (Fully-Connected) layer will compute the class score for each of the classification category.

In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. RELU and POOL layer implement the constant function and no variables are to be trained at this layer. Parameters at FC and CONV layer will be trained using gradient descent optimizer.

Convolutional Layer

This layer is the core of ConvNet. It does most of the computational heavy lifting.

Convolutional layer working

Hyperparameters used in this layer:

  1. The depth of an output volume represents the number of layers present. This value depends on the number of filters used. In the above image, the depth is 1 as it’s a 2-D image.
  2. Filter size (f) represents the height and width of the filters used. The depth of a filter is not defined as it’s same as the depth of an input image. For a given image, the filter size is 3.
  3. Stride (s) represents the step size to take when traversing horizontally and vertically along the height and weight of an input image. For a given image, the stride is equal to 1.
  4. Padding (p) helps to retain the height and width of an input image. As after processing the size of an input image reduces and hence we cannot model deep networks with many of these layers for small size images. To make . deep networks feasible padding is used. In padding, we add multiple rows and columns around the image. This added pixel has values 0 by default as shown below. One more advantage of padding is that it helps us retain more information on corner pixels as now they will be processed multiple times instead of single time.
using padding

All the above hyperparameters are fixed for a given layer. Multiple convolutional layers may have different values. The filter matrix is also a weight matrix for which values need to be trained using backpropagation.

Working

  1. The input 3-D volume is passed to this layer. The dimension would be H*W*C. H, W, and C represent height, width, and the number of channels respectively.
  2. There can be K filter used where K represents the depth of an output volume. The dimension of all the K filters is the same which is f*f*C. f is the filter size and C is the number of channels input image has.
  3. If we have configured padding then will add padding to the input volume. If padding is equal to same then will add one row and one column at each side of the dimension and the value would be zero. Padding is applicable only along the height and width of an input dimension and is applied along each layer.
  4. After padding, the computation begins. Now we’ll slide our filter starting from the top-left corner. The corresponding values of filter and input volume are multiplied and then the summation of all the multiplied value takes place. Now the filter is slide horizontally taking stride number of step in each slide. So, if stride is 2 we’ll slide 2 columns horizontally. The same process is repeated vertically until the whole image is covered.
  5. After getting all the values from the filter computation they are passed through RELU activation which is max(0,x). In this, all the negative values obtained are replaced by zero as negative values have no significance in the pixel.
  6. Step 4 & 5 generates just one layer of an output volume that is the 3-D input volume is transformed into a 2-D volume.
  7. Now, step 4 & 5 get’s repeated for K filters. And the output of each filter is stacked above one another and hence the depth of an output image is of dimension k.

Now, to calculate the dimension of an output volume we require all the hyperparameters of a convolutional layer. All the filters used at this layer needs to be trained and are initialized with random small numbers.

The height and weight of an output volume is given by

height, weight = floor( ( W+2*P-F )/S +1 )

depth = K (number of filters used)

This is how the convolutional layer works.

Computation at this layer

Pooling Layer

This layer is used for reducing the dimension of an input volume. This layer does not reduce the depth of an input. This layer can be used to reduce the spatial size so that the computational power required to process the image is reduced.

Max Pooling

Pooling layer does not lose the important property of an image. Pooling layer extracts the most dominant information and hence maintains the process of effective training of the model.

Two types of pooling are used: Max Pooling and Average Pooling.

In max pooling, the maximum value present in a selected kernel is retained and all the other values are discarded. While on average pooling the average of all the values present in a kernel selected is stored.

Both types of pooling

Pooling also acts as a noise suppressant. But max pooling performs better than average pooling and hence, it is more frequently used.

Hyperparameters used in this layer:

  1. Filter size (f) represents the size of the kernel to be used.
  2. Stride (s) represents the number of steps to take while sliding the kernel window.
  3. Padding (p) represents how much padding to apply on an input image. Usually at this layer padding is not used.

The filters used does not need to train. Hence, backpropagation has no effect on this layer. And once the hyperparameters are fixed they never change.

Now, the dimension of an output volume can be calculated similarly as calculated at the convolutional layer. Here, the depth of an output volume is similar to the depth of an input volume.

Fully-Connected Layer (FC-Layer)

This layer is used for the classification of the complex features extracted from previous layers. This layer is the same as the neural networks in which each neuron is connected to all the neurons on a consecutive layer. The final output is calculated using softmax which gives the probability of each class for the given features.

The working of FC-layer is similar to the deep neural networks used for classification.

To pass an input image to the FC-layer we need to flatten out the image so that all the pixel values are arranged in one column. Now, this flattened feature is passed to the FC-layer. All the weights assigned in FC-layer are initialized by small random number and are trained using back-propagation algorithm.

Now, we’ve learned about all the types of layers used in ConvNets and hence, we can now apply this knowledge for recognizing hand-written digits from images. We’ll be using keras for implementing ConvNet.

Implementation

You can get the dataset from entering this practice contest.

%pylab inline
import os
import numpy as np
import pandas as pd
from scipy.misc import imread
from sklearn.metrics import accuracy_score
import tensorflow as tf
import keras
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.optimizers import Adam

Now we’ll create a model with 6 Convolutional layers, 3 max-pooling layers, and FC-layer.

input_reshape = (28, 28, 1)hidden_num_units = 2048
hidden_num_units1 = 1024
hidden_num_units2 = 128
output_num_units = 10
epochs = 10
batch_size = 16
model = Sequential([Conv2D(16, (3, 3), activation='relu', input_shape=input_reshape, padding='same'),Conv2D(16, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=pool_size),
Dropout(0.2),

Conv2D(32, (3, 3), activation='relu', padding='same'),

Conv2D(32, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=pool_size),
Dropout(0.2),

Conv2D(64, (3, 3), activation='relu', padding='same'),

Conv2D(64, (3, 3), activation='relu', padding='same'),
MaxPooling2D(pool_size=pool_size),
Dropout(0.2),
Flatten(),Dense(units=hidden_num_units, activation='relu'),
Dropout(0.3),
Dense(units=hidden_num_units1, activation='relu'),
Dropout(0.3),
Dense(units=hidden_num_units2, activation='relu'),
Dropout(0.3),
Dense(units=output_num_units, input_dim=hidden_num_units, activation='softmax'),
])

You can see we’ve used theDropout method. It drops some neurons from training and hence helps in preventing overfitting. Dropout(0.2) represents 20% of neurons to be dropped randomly in every cycle. It is also a regularization method.

Dropout
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-4), metrics=['accuracy'])trained_model_conv = model.fit(train_x_temp, train_y, epochs=epochs, batch_size=batch_size, validation_data=(val_x_temp, val_y))

We compile and fit the model. This model achieved 99.34% accuracy.

--

--

Sanket Doshi

Currently working as a Backend Developer. Exploring how to make machines smarter than me.