Deep Learning: Convolutional Neural Networks (CNNs)

Mohammed Terry-Jack
7 min readMay 16, 2019

--

A deep Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been behind numerous state-of-the-art models in image related tasks. This is largely because CNNs are so good at capturing features across various hierarchical scales.

(left) low-level features like edges (right) high-level features like wheels

A CNN’s neurons are exactly the same as a normal feed-forward neural network (FFNN).

The only difference is that instead of every neuron in one layer connect to every neuron in another layer (densely connected) like a normal FFNN, a CNN layer is sparsely connected by sliding over the neurons in a layer (the increments of the sliding window are known as the stride) and each window of neurons are multiplied by the same, shared set of weights (known as a kernel or filter).

Each filter (shared set of weights) corresponds to a learnt feature of the data (e.g. a steering wheel in an image, etc) and so you can specify more than one filter which will add more “depth” to the next convolutional layer (when viewed in 3D).

Calculating a new convolutional layer using a single 3x3 kernel/filter

To clarify, lets use a concrete example. Shown below we have a 3x3 image being fed into a CNN with just 2 layers. There are 9 inputs in the layer (one input per pixel) and there are 4 different kernels connecting the input layer with the next, hidden convolutional layer (shown as blue, red, green and pink coloured lines). The 9 inputs from the input layer multiply by a kernel’s weights and map to 4 neurons in the next layer. Since there are 4 different kernels, the next layer has a total of 4 neurons x 4 kernels = 16 neurons.

A CNN with flattened 1D layers

This example above depicts the CNN with flattened, 1D layers to make it comparable with the layers in a FFNN. However, images of CNN layers are often shown in 2D (or 3D). In 1D the layers are 9 -> 16. In 2D the layers would be shown as: 9x1 -> 4x4. In 3D they would be: 3x3x1 -> 2x2x4

The same CNN depicted with 2D layers (left) and 3D layers (right)

Fortunately, the trained weights of many famous CNN-based models (e.g. AlexNet, ResNet, VGG, Inception, Xception, etc) are readily available online, so its fairly straightforward to rebuild one of these SOTA architectures and test them out on some of our own images!

VGG-16

The VGG-16 is a CNN with 16 convolutional layers with a couple of max pooling layers in between and some dense fully-connected layers at the end.

VGG-16 architecture

Fortunately it uses a lot of repeated sub-structures, like blocks. Each block has 4 Conv2D layers with a MaxPooling2D layer (except for the first two blocks which only has 2 Conv2D layers and a MaxPooling2D layer). The final layers are 2 Dense layers with dropout and a final Dense layer. All layers use relu activation function except the final layer which uses softmax.

import kerasvgg_model = keras.models.Sequential()for block,h_dim in enumerate((64,128,256,512,512),start=1):
for layer in range(1,4):
if layer < 3 or block > 2:
if block==1 and layer == 1:
vgg_model.add(keras.layers.convolutional.ZeroPadding2D((1, 1), input_shape=(224, 224, 3)))
else:
vgg_model.add(keras.layers.convolutional.ZeroPadding2D((1, 1)))
vgg_model.add(keras.layers.convolutional.Conv2D(h_dim, (3, 3), activation='relu', name=f'conv{block}_{layer}'))
vgg_model.add(keras.layers.convolutional.MaxPooling2D((2, 2), strides=(2, 2)))

vgg_model.add(keras.layers.core.Flatten())
for _ in range(2):
vgg_model.add(keras.layers.core.Dense(4096, activation='relu'))
vgg_model.add(keras.layers.core.Dropout(.5))
vgg_model.add(keras.layers.core.Dense(1000, activation='softmax'))

Then we download and load in the trained weights

!wget https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5
vgg_model.load_weights('vgg16_weights_tf_dim_ordering_tf_kernels.h5')

We also download the output class labels (this model can classify between 1000 different images!)

!wget https://raw.githubusercontent.com/machrisaa/tensorflow-vgg/master/synset.txt
vgg_labels = np.loadtxt('synset.txt', str, delimiter='\t')

Finally we compile the model (even though we aren’t going to train it)

vgg_model.compile(optimizer=keras.optimizers.SGD(), loss='categorical_crossentropy')

Now we can test it out. Download some images and resize them to be 224x224 pixels (this is the size of image our network takes in)

import glob
image_files = glob.glob("*.jpg")
import cv2
batch = []
for image_file in image_files:
img = cv2.imread(image_file)
img = cv2.resize(img, (224, 224))
batch.append(img)

import numpy as np
batch = np.array(batch)

Feed in the batch of images into the trained VGG-16 and get its predictions

predictions = vgg_model.predict(batch)
Look how well it performs very well on a wide range of images
It does mistake Johnny 5 for a projector, however!

Transfer Learning

Now we need to fine-tune this model so that it can correctly categorise different robots as well as it does for classifying animals!

Four of the five robots our classifier shall learn to identify. From left to right: c3po, r2d2, k2so, bb8

We shall use the VGG-16 model to embed each image as a vector and then use those image vectors as training inputs to train a multi-class classifier (this can be a normal FFNN, or a non-neural classifier like Random Forests or XGBoost)

VGG-16: we can keep the CNN part as an image embedder and swap out the final dense layers with our own classifier model that we shall train

First we cut off the dense layers on the end of the VGG-16 model so that we are left with only the convolutional layers

for _ in range(5):
vgg_model.pop()
vgg_model.compile(optimizer=keras.optimizers.SGD(), loss='categorical_crossentropy')
VGG-16 before (left) and after cutting off the final dense layers (right)

Now when we run an image through our VGG-16 model, instead of predicting a probability distribution across 1000 classes, it outputs the logits for the final flattened CNN layer which serves as the feature vector representing that image (we now have a way to embed each image into a 25,088 dimensional vector)

an example of 3 images embedded as vectors using VGG-16’s convolutional layers

Using a very small data set (6 images per category) we train a simple FFNN (with 5 layers and 1000 neurons per layer) to classify between 5 types of robots (r2d2, c3po, bb8, k2so and johnny 5). (We withhold 1 image per category for the test set).

our tiny training set of robots (bb8, r2d2, c3po, k2so, johnny 5)

We pass the training labels and image vectors (output from our vgg16 model) into our first classifier: a one-vs-rest ensemble of Gradient Boosting Classifiers.

from sklearn.multiclass import OneVsRestClassifier
from xgboost import XGBClassifier
xgb_classifier = OneVsRestClassifier(XGBClassifier())
xgb_classifier.fit(image_vectors, [label_idx[label] for label in train_labels])
predictions = xgb_classifier.predict(image_vectors_test)

It confuses c3po with k2so which is fairly forgivable but then it mistakes r2d2 for bb8!?!

Test results of XGBoost classifier

Next we train a shallow neural network classifier and it does even worse and still confuses r2d2 with bb8!? So we make it deeper (15 hidden layers with 100 neurons per layer) and it did better (3/5 correct)

from sklearn.neural_network import MLPClassifierffnn = MLPClassifier(hidden_layer_sizes=([100]*15)) 
ffnn.fit(image_vectors,label_vectors)
predictions = ffnn.predict(image_vectors_test)
test results from our shallow (left) and deep (right) neural network classifiers

Finally we train an Extreme Learning Machine (ELM) classifier — a special kind of shallow neural network (with single hidden layer of 10,000 neurons) that is able to perform one-shot learning using a unique learning algorithm. It does even better than the deep ffnn classifier with 4/5 correct!

image_vector_dim = 25088
elm = ELM(image_vector_dim, 10000)
elm.learn(image_vectors,label_vectors)
predictions = elm(image_vectors_test)
Test results of an ELM classifier

This is quite amazing considering we only had 5 training images per class and it learnt to classify in one-shot! Superb!

clustering of the image vectors

Analysing the image vectors reveals why it wan’t perfect. Bb8 images form a nice cluster on their own but other robot types seem harder to distinguish. To improve the quality, however, we could improve the training data (by adding more images and using image augmentation techniques like image inverting, flipping, resizing, rotating, etc) or improve the quality of the image embeddings by using another pre-trained CNN model

--

--