INTRODUCTION TO CONVOLUTIONAL NEURAL NETWORKS

Convolutional Neural Networks also called Covnets are one of the main reasons why deep learning is so popular today. They are a very effective class of neural networks that is highly effective at classifying structured data where the order of arrangement matters. Such data include images, audio and video. Their uses span a wide range of domains, however, they are primarily used for image classification, object detection, generative models and image segmentation. As of recently, their performance on image recognition tasks has surpassed human performance on standard datasets.

In this post, I would explicate what convolutional neural networks is, how they work and how to build state-of-the art image recognition systems using Covnets. This topic is very broad, but I shall explain the basic foundations and a minimal implementation in this post, in the next posts, I would cover more details.

HOW COVNETS WORK

In my first post, I explained that neural networks have multiple layers, with each layer containing multiple neurons, each neuron in a previous layer is connected to every single neuron in the next layer. Convolutional neural networks differ in many ways to these structure.

Each layer in a convolutional neural network is made up of a number of channels, each channel can be regarded as a feature detector working to detect a specific feature.

For example, when trying to classify the picture of a man, we might use features such as the nose, eyes, ears and mouth. A convolutional layer with 4 channels would then have a channel for searching for the presence of the nose, another for the eyes, one for the ears and the last for the mouth. The presence of each of these features would be encoded in the activation map of each channel. To make things clearer, I shall explain how each channel works.

Consider the image below

The set X1 to X16 is a matrix of 16 pixels representing an Image, the set of parameters W1 to W4 is called the kernel or filter, it represents the features we are trying to detect in the image.

The convolution operation computes a dot product of the kernel with a local region of the image, the local region would be exactly the same dimension as the kernel. Hence, if you choose a 3 x 3 kernel instead of the 2 x 2 kernel I depicted above, the local region of the image that would be convolved with the kernel would also be 3 x 3 pixels.

As seen above, here, a dot product is computed between the kernel and the local region of the image, the result is then stored in the output activation map also called the output feature map.

For those not familiar with advanced linear algebra, dot product is the sum of element-wise product of two vectors. These can be clearly seen above.

Notice the B parameter above, this is the bias, which you are already familiar with in standard neural networks. If you do not fully understand this, I recommend reading my first post or if your want to get your hands dirty with deeper details, Stanford’s CS231 is an excellent point to go.

Note that the convolution operation is a vector to scalar transformation since the output is always a single scalar.

The first step above has only searched for the presence of the feature within the first 2 x 2 region, the operation would eventually cover the entire image, I have drawn up the entire operation for your study.

The images above are a clear depiction of how convolutional neural networks works, I drew them for use in a book I am writing.

Here, the convolution shifts one pixel at a time

This is represented as

Stride 1

This single shift is a stride of 1

A stride of 2 is entirely possible, and is sometimes used, these is represented as

Stride of 2

In a convolutional layer with 10 channels, these operation is repeated 10 times with ten different sets of parameters, each parameter set representing a unique feature detector.

Convolutional Neural Networks are locally connected as seen above. They also share parameters, as each channel only maintains a single set of parameters, these greatly reduces the number of parameters.

A big advantage of Convolutional Neural Networks is that they take advantage of the 2D structure of images. This is a sharp contrast to our earlier approach of flattening the entire image dimensions and feeding them into a standard neural network.

To have a feel for how this affects the performance of neural networks, take a look at the picture below:

Christiano Ronaldo

You can see this is Christiano Ronaldo

So I decided to flatten this image to have a width of 56, look at the result below

Same picture flattened to have a width of 56

Obviously this does not look like Christiano Ronaldo, now consider if I flatten this to just a width of 1 pixel, the image would practically disappear from view.

This is the terrible problem that neural networks face when we flatten images. Convolutional Neural Networks can handle two dimensional structures properly, hence we do not need to ever flatten images.

Without further ado, I would now go ahead to explain how to develop convolutional neural networks using keras.

Here is the full code which i shall fully explain shortly

import keras
from keras.datasets import mnist
from keras.layers import Dense,Conv2D,Flatten
from keras.models import Sequential
from keras.optimizers import SGD
import keras.backend as K

(train_x, train_y) , (test_x, test_y) = mnist.load_data()

img_rows, img_cols = 28, 28


if K.image_data_format() == "channels_first":
train_x = train_x.reshape(train_x.shape[0], 1, img_rows, img_cols)
test_x = test_x.reshape(test_x.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)

else:
train_x = train_x.reshape(train_x.shape[0],img_rows, img_cols, 1)
test_x = test_x.reshape(test_x.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)


print(train_x.shape)
print(train_y.shape)
print(test_x.shape)
print(test_y.shape)


train_y = keras.utils.to_categorical(train_y,10)
test_y = keras.utils.to_categorical(test_y,10)

model = Sequential()
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu",input_shape=input_shape))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Flatten())
model.add(Dense(units=10,activation="softmax"))

model.compile(optimizer=SGD(0.001),loss="categorical_crossentropy",metrics=["accuracy"])
model.fit(train_x, train_y, batch_size=32, epochs=20, validation_data=(test_x, test_y), shuffle=True)

If you have read my previous tutorials, then a lot of these would be very familiar with you, if you haven’t, i encourage you to do so.

A couple of things are new here, and i shall explain each.

if K.image_data_format() == "channels_first":
train_x = train_x.reshape(train_x.shape[0], 1, img_rows, img_cols)
test_x = test_x.reshape(test_x.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)

else:
train_x = train_x.reshape(train_x.shape[0],img_rows, img_cols, 1)
test_x = test_x.reshape(test_x.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

This code detects the appropriate shape for our data due to keras having different backends. Basically you don’t need to bother too much about this, think of it as a constant you need to include everytime you work with mnist in keras. When using Covnets.

model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu",input_shape=input_shape))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Conv2D(filters=128,kernel_size=[3,3],activation="relu"))
model.add(Flatten())
model.add(Dense(units=10,activation="softmax"))

Here is the real part you should pay attention to, first, notice that we now passed in our input shape as the value obtained from the previous code i told you to remmember as a constant.

Here the Conv2D layer in keras represents a single convolutional layer. The filters represents the number of channels or number of feature detectors as i earlier explained. Next we specify the kernel size, notice in my diagrams that i used a kernel size of [2,2], however, in practice, [3,3] kernels are preferred as they capture the structure better and are computationally efficient. We also specify the activation function as “relu”, i already explained this in my first post.

A new entrant into our simple architecture is the “Flatten” layer, this is very important. Convolutions can handle 2 Dimensional structures but our final classification layer has to be a Dense(Linear) layer, hence, after all our convolutions, we need to flatten the dimensions of the output before it is fed into the final dense classification layer. This is of no consequence. Its perfectly okay.

You should run the code above on a GPU enabled system. Convolutions are very computationally expensive but are highly efficient on GPUs, the brilliant guys at NVIDIA, led by Ian Buck have made a lot of excellent GPUs and software for running Deep Learning on them.

If your laptop has a Nvidia GPU, please install CudNN and the GPU version of tensorflow.

Run

pip3 install --upgrade tensorflow-gpu

To install the GPU version

If you do not have a Nvidia GPU on your laptop. Then the Cloud is the way to go.

You can use Google Colab, it is free and you get a high performance GPU to use. However, there are limits. It should suffice for this tutorial though, just make sure you click on the Runtime menu and choose a GPU accelerator.

Run the above code, you should get accuracy of about 99.10%

Feel free to modify the network as you wish.

There are many more components of CNN architectures that i would discuss in upcoming tutorials.

For now, its really cool to build a handwritten digit recognizer that can classify digits with 99% accuracy.

There are other environments for running large scale experiments, they include

Microsoft Azure,

Google Cloud Datalab

Amazon Web Services

LeaderGPU

and more.

But you got to pay.

My next post would be on a very interesting topic in AI. Stay tuned.

If you enjoyed this post, give some claps and share it on twitter.

You can reach me via @johnolafenwa