Convolutional Neural Network

15 min readNov 3, 2019

What is CNN?

Convolutional neural networks (CNN) is a special architecture of artificial neural networks, proposed by Yann LeCun in 1988.

In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery and processing that is specifically designed to process pixel data.

CNNs are powerful image processing, artificial intelligence (AI) that use deep learning to perform both generative and descriptive tasks, often using machine vision that includes image and video recognition, along with recommender systems and natural language processing (NLP).

Let us take an example of emojis,

Here, as shown in below example the Convolutional network helps computer vision to recognize the input as a happy emoji.

CNNs are regularized versions of multi layer perceptrons. Multi layer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer.

One of the most popular uses of this architecture is image classification. For example, Facebook uses CNN for automatic tagging algorithms, Amazon uses CNN for generating product recommendations and Google uses CNN for search through among users’ photos.

How does it work?

We were taught at a young age on how to classify different objects or animals etc. Similarly, we need to teach an algorithm with numerous images before it is ready to recognize or classify or predict objects on its own.

For example: How did we learn to differentiate between a Wolf and a Dog?

By learning to differentiate based on features or characteristics. But computer sees these characteristics or features in a different way.

Every image can be represented as 2-dimensional arrays of numbers, known as pixels.

Suppose you are working with a black and white image dataset and each image size is 30 x 30. In this case, the size of the array will be 30x30x1. Where 30 is width, next 30 is height and 1 is the depth (black & white image contains only 1 channel). And if the image is a colored image of size 600 x 600. In this case, the size of the array will be 600x600x3. Where 600 is width, next 600 is height and 3 is the depth which describes the intensity of RGB channel values (Red, Green and Blue). The computer will assign a value from 0 to 255 to each of these pixels and this pixel value describes the intensity of the pixel at each point.

Image Representation

Images are encoded into color channels, the image data is represented into each color intensity in a color channel at a given point, the most common one being RGB.

Black and White Image:

The contrast ranges from black at the weakest intensity (0) to white at the strongest (255) and in between is the grey scale (0–255).

Colored Image:

A colored image has three channels RGB (Red, Green, Blue). So basically, pixel has three values assigned to it. Each one between 0 to 255. By combining these values, we can decide the color of the image.

Why do we need CNN?

Consider the above example, total number of neurons in input layer will 30 x 30 = 900, this is manageable. What if the size of image is 1080 x 900 which means we need 972000 neurons in input layer. This is a huge number of neurons for the said operation and is challenging computationally. And this is where Convolutional Neural Network or CNN is proven effective. In simple word what CNN does is, it extracts the features of image and convert it into lower dimension without losing its characteristics.

In the following example we can see that initially the size of the image is 224 x 224 x 3. And if we proceed without convolution, then we need 224 x 224 x 3 = 150,528 numbers of neurons in input layer. But after applying convolution, our input tensor dimension is reduced to 1 x 1 x 1000. It means we only need 1000 neurons in first layer of feed forward neural network.

Since, we now have a basic understanding about CNN let’s look at the steps involved in CNN.

Layers in CNN:

Convolution
Pooling
Flattening
Full Connection

1)Convolution:

Convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by the other.It is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a kind of integral transformation.

Now, let us consider the wolf example again.

In human understanding characteristics of a wolf are for example the size or eye color. For the computer, these characteristics are boundaries or curvatures. And then through the groups of convolutional layers the computer constructs more abstract concepts.

In the case of a CNN, the convolution is performed on the input data with the use of a Feature Detector or Filter or Kernel (these terms are used interchangeably) to then produce a Feature Map or Convolved Feature or Activation Map.

Filters were designed by computer vision experts, which were then applied to an image to result in a feature map or output from applying the filter that makes the analysis of the image easier in some way.

Convolution operation is signified by an ‘X’ as shown below.

We take the filter and place it on top of our image as shown above. And we basically multiply each value with each value of the filter and then we add the matched-up value. As we can see, we’ve reduced the size of the image.

We have used stride of 1 and the image dimension reduced a bit. But if we have a stride of 2, then the image is going to reduce more. So, the feature map is going to be even smaller. Stride denotes how many steps we are moving in each step-in convolution. By default, it is one. This is the main purpose of feature detector. It will make the image smaller so that it’ll be easier to process it and it will be faster.

Feature detectors doesn’t have to be a 3x3 matrix. It can be 5x5 or 7x7. But 3x3 is predominantly used.

For example: Alex-Net uses 7x7 filters.

We will create multiple feature maps because we use different filters. So that way, we preserve the loss of information. The network decides through training which features are important for certain types or certain categories.

Few types of different filters:

The primary purpose of a convolution is to find features in our image using feature detectors, put them into a feature map and by having them in a feature map, it still preserves the spatial relationships between pixels which is very important for us because if they’re jumbled up then we’ve lost the pattern.

The Rectified Linear Unit (RELU)

The Rectified Linear Unit, or ReLU, is not a separate component of the convolutional neural networks’ process. It’s a supplementary step to the convolution operation.

Why do we need a RELU activation function here?

The reason why we are applying the rectifier is because we want to increase non-linearity in our image, or in our convolution neural network. And rectifier acts as that filter or acts as that function that breaks linearity. The reason why we want to increase non-linearity in our network is because images themselves are highly non-linear, especially if we’re recognizing different objects next to each other, or just the background. And transition between adjacent pixels is often going to be non-linear. That’s because of the borders, there’s different colors, there’s different elements in your back ground. When we are applying mathematical operations such as convolution and running the feature detection to create our feature maps, we risk that we might create something linear.

Let me take an example of black and white image,

Feature detector

By putting the image through the convolution process, or in other words, by applying to it a feature detector, the result is what you see in the following image.

As you see in the above image, the entire image is now composed of pixels that vary from white to black with many shades of grey in between.

Rectification

What the rectifier function does to an image like this is remove all the black elements from it, keeping only those carrying a positive value (the grey and white colors).

The essential difference between the non-rectified version of the image and the rectified one is the progression of colors. If you look closely at the above image, you will find parts where a white streak is followed by a grey one and then a black one. And this is linear progression.

After we rectify the image, you will find the colors changing more abruptly. The gradual change is no longer there. That indicates that the linearity has been disposed of. So, ReLu breaks this linear progression and removes anything that’s negative. In this case its Black.

2) Pooling:

Let’s take the below image as an example to understand pooling.

As we can see we have different wolves looking in different directions, they are all in different part of the image, their faces are all positioned in different parts of the image, some on the right hand side, some on the left corner, some in the middle with different texture and lighting. So, if the neural network has a property called ‘Spatial invariance’, meaning that it doesn’t have to care where the features are located, not so much as in which part of the image. Neural Networks must have some level of flexibility to be able to still find the features and this is what is called as Pooling.

There are different kinds, Max, Mean, Sum pooling.

Below is an example of Max pooling.

Out of all the four values in the 2x2 matrix, the maximum value is considered. The maximum or the large numbers in our feature map, they represent where we found the closest similarity to our feature. By pooling these features, we are getting rid of 75% of the information that is not the feature.

Pooling is also called ‘Down Sampling’.

3) Flattening:

Basically, puts everything into one long column which would become the input layer for Artificial Neural Network as shown below.

4) Full Connection:

After completion of series of convolutional, nonlinear and pooling layers, it is necessary to attach a fully connected layer.

This layer takes the output information from convolutional networks.

Attaching a fully connected layer to the end of the network results in an N dimensional vector, where N is the number of classes from which the model selects the desired class.

And so here we’ve got the input layer, we’ve got a fully connected layer and output layer.

The fully connected layer, in the artificial neural networks, we used to call them hidden layers and here we’re calling them fully connected layers because they are hidden layers, but at the same time they’re a more specific type of hidden layers, they are fully connected layers. In artificial neural networks, hidden layers don’t have to be fully connected whereas in convolutional neural networks, we’re going to be using fully connected layers and that’s why they’re generally called fully connected layers. Not only the weights are trained in the ANN but also the features detectors are trained and adjusted with the same gradient descent process and that allows us to come up with the best feature maps.

Softmax and Cross Entropy

In Convolutional Neural Network, Cost function (error) is called as Loss Function. And its calculated with the help of cross entropy function.
Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

Softmax or normalized exponential function brings the values to be between 0 and 1 and makes sure that they add up to 1. Because these two final output neurons are not connected between each other. They wouldn’t know what the value is of the other. This is where softmax comes in. Because the final probability percentage should add up to 100%. It wouldn’t make sense if we have 70% and 70% each for dog and cat. Other thing is that, the softmax function comes hand in hand with the cross-entropy function (Loss function).

We will be using this,

This helps us to calculate the loss between the actual and the predicted value.

Few key things to consider

Normalization:

In image processing, normalization is a process that changes the range of pixel intensity values. Applications include photographs with poor contrast due to glare, for example.

What is normalized RGB?

At times, you want to get rid of distortions caused by lights and shadows in an image. Normalizing the RGB values of an image can at times be a simple and effective way of achieving this.

When normalizing the RGB values of an image, you divide each pixel’s value by the sum of the pixel’s value over all channels. So, if you have a pixel with intensified R, G, and B in the respective channels… its normalized values will be R/S, G/S and B/S (where, S=R+G+B).

Pixel values are often unsigned integers in the range between 0 and 255. Although these pixel values can be presented directly to neural network models in their raw format, this can result in challenges during modelling, such as in the slower than expected training of the model.

Instead, there can be great benefit in preparing the image pixel values prior to modelling, such as simply scaling pixel values to the range 0–1 to centering and even standardizing the values.

Image Data Augmentation:

Image data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset.
Training deep learning neural network models on more data can result in more skillful models, and the augmentation techniques can create variations of the images that can improve the ability of the fit models to generalize what they have learned to new images.

What will it do?

It will create many batches of our images and in each batch it will apply some random transformations on a random selection of our images like rotating them , flipping them, shifting them, or even shearing them and eventually we get many more diverse images during training.
since the transformations are random, our model will never find the same transformation twice.

The Keras deep learning neural network library provides the capability to fit models using image data augmentation via the ImageDataGenerator class.

Download and prepare the CIFAR10 dataset

The CIFAR10 dataset contains 60,000 32x32 color images in 10 classes, with 6,000 images in each class. The dataset is divided into 50,000 training images and 10,000 testing images. The classes are mutually exclusive and there is no overlap between them.

Python Implementation : Using TensorFlow 2.0

"""Importing required libraries"""import numpy as np
import time
import matplotlib.pyplot as plt
% matplotlib inline%tensorflow_version 2.ximport tensorflow as tf
tf.test.gpu_device_name()print(tf.__version__)from tensorflow.keras import datasets, layers, models

Importing the data

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()# Normalize pixel values to be between 0 and 1train_images, test_images = train_images / 255.0, test_images / 255.0train_images.shape

plot the first 25 images from the training set and display the class name below each image and verify the data.

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    # The CIFAR labels happen to be arrays, 
    # which is why you need the extra index
    plt.xlabel(class_names[train_labels[i][0]])
plt.show()

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(96, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation='softmax'))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Image data generator used for image augmentation

Here, I am using Cifar10 dataset , which is an inbuilt dataset from tensorflow library and hence I am using .flow(x, y) and in case if you are using a image from a directory kindly use.flow_from_directory(directory) option.

#Image data Generator
from tensorflow.keras.preprocessing.image import ImageDataGeneratordatagen = ImageDataGenerator(zoom_range=0.2, horizontal_flip=True)# train the model
start = time.time()
# fits the model on batches with real-time data augmentation:
history=model.fit_generator(datagen.flow(train_images, train_labels, batch_size=128),
                    steps_per_epoch=1000, epochs=100,validation_data=(test_images, test_labels))end = time.time()
print("Model took %0.2f seconds to train"%(end - start))# always save your weights after training or during training
model.save_weights('100_epochs.h5')

For detailed explanation on arguments, kindly refer “https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator”

Visualizing training accuracy with validation accuracy with each epoch

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

So, we obtained a training accuracy of 81.6% and validation accuracy of 79% which is good. To improve the accuracy, add more convolutional layers or more fully connected layer or both. And also tweak with the arguments in Image Data Generator.

Testing on a new Image

image_test= tf.keras.preprocessing.image.load_img('drive/My Drive/DATA files/CNN model weights/Deer.jpg', grayscale=False,
    color_mode='rgb')

This image is of different resolution. But the input shape with which the convolutional layer was trained on was 32x32. And hence the target size of this test image should be changed accordingly.

img_test3 = tf.keras.preprocessing.image.load_img('drive/My Drive/DATA files/CNN model weights/Deer.jpg',
    grayscale=False,
    color_mode='rgb',
    target_size=(32,32))
img1 = np.array(img_test3)
img1.shape
img1 = img1 / 255.0
img1 = img1.reshape(1, 32, 32, 3)

Reshape is done in order to show that we just have only 1 test image of size 32x32 with 3 channels (RGB).

plt.figure(figsize=(10,10))
plt.imshow(image_test)
plt.xlabel(labels[result1])
plt.show()

So far the model is doing a pretty good job in classifying the images.

The application of CNN is not restricted to classification. Few of the applications include :

Decoding Facial Recognition:
• Identifying every face in the picture
• Focusing on each face despite external factors, such as light, angle, pose, etc.
• Identifying unique features
• Comparing all the collected data with already existing data in the database to match a face with a name.
Analyzing Documents: Convolutional neural networks can also be used for document analysis. This is not just useful for handwriting analysis, but also has a major stake in recognizer.
Advertising: CNN’s have already brought in a world of difference to advertising with the introduction of programmatic buying and data-driven personalized advertising etc.

Resources :