A Journey into Convolutional Neural Networks

Published in

CUNY CSI MTH513

8 min readMay 8, 2019

Introduction

For years, machine learning was seen as a mystical black box. Terabytes of data went in; predictions came out. Only a select, chosen few had the secret knowledge of the inner workings of the algorithms at play. They lived as gods among mortals, and commanded exorbitant salaries on the market. Well, those days exist no longer. The black box has been opened up; from it has come math. Lots and lots of math. This is the story of a boy’s journey into that box and the wild lands inside of it.

The Challenge

Image result for fashion mnist — Fashion MNSIT Sample Data

The challenge assigned was simple: identify articles of clothing from B&W images. The Fashion MNIST set contains 60,000 images to train on and 10,000 to test on. Each image is 28x28 pixels in size, for a total of 784 pixels per image.

The Model

When first given this challenge, my mind immediately went to using a CNN, a Convolutional Neural Network. A CNN is a specialized form of a neural network. Artificial Neural Networks are a group of artificial neurons that are modeled on the neurons found in the human brain. Each neuron takes an input, performs an evaluation on that input, and then produces an output. Often, the output of one neuron is used as the input of another neuron. Neural networks can be trained to optimize their performance in a specific area. CNNs are a type of deep neural network that specialize in image recognition. CNNs preserve the spatial integrity of an image, which means that they learn based on local patterns. Once recognized, these patterns can be seen across the entire image.

Before a CNN can be used to predict what object an image is depicting, it must first be trained. A neural network is trained by giving the network training data and telling it what class the data belongs to. The network will analyze the image and try to guess what class it belongs to. By comparing the predicted class to the true class, the network can learn and improve its performance. This allows the network to associate certain features in an image with membership in a certain class.

Using Python and the Keras library, Neural Networks are simple to create:

model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(1, 28, 28), activation='relu')) 
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

This simplistic model can achieve an accuracy score of more than 90% on the basic MNIST dataset. The MNIST dataset is comprised of 70,000 handwritten digits. It is seen as the classic example of a machine learning dataset. The rule is “If it doesn’t work on MNIST, it’s not going to work at all”.

Every model had some layers that remained the same throughout the competition. For example, the first two layers reshaped the data and normalized it. The last three layers were used to build the classifier that actually made the determination about what class an image belonged to. For the rest of this post, all layers that changed are located between these two bookend groups.

All the Stuff in Between

This initial model had 6 alternating Convolutional2D and MaxPooling layers.This model, using Adam as an optimizer and trained for 50 epochs, was able to achieve an accuracy of 89.2%. My next attempt was to add more layers. More layers means that more features can be extracted from an image. This second model consisted of 2 repeated sets. Each set had 2 Conv2D layers, followed by a normalization layer, a MaxPooling layer, and finally a Dropout layer, with a 25% dropout rate. Dropout layers are used to reduce overfitting and can increase accuracy as a CNN gets deeper and deeper. By selectively “turning off” a certain percentage of neurons, the network must become more redundant; that is, the same information must propagate through the network with less neurons to carry it. This model had an accuracy of 93.03%.

The next version is similar to the previous, except has some differences:

The dropout rate was increased to 35%, with the idea being that more dropout would cause more redundancy and decrease overfitting.
The training time was increased to 100 epochs. The more time the network has to analyze the data, the more accurate it can be.
A train-test-split was implemented, with 25% of the training data reserved for validation purposes (This was kept for all models following this one) A validation set is used to estimate a model’s performance when evaluating unseen data.

Making these changes caused a very, very, very small increase in score to 93.06%.

The next model actually caused a decrease in score and helped to refine what NOT to do in making the model. For this model’s structure, well, it’s easier to show than explain.

model.add(layers.Conv2D(64, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))    model.add(layers.BatchNormalization())    model.add(layers.MaxPooling2D((2, 2)))    model.add(layers.Dropout(0.40))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))    model.add(layers.Conv2D(256, (3, 3), activation='relu'))    model.add(layers.Dropout(0.40))    
model.add(layers.Conv2D(512, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.40))

My thinking in this model was that, given the depth, the model was overfitting. So I increased the dropout rate to 40% to try and reduce that. I wanted to cause a dropout especially when going from 256 filters to 512. This model was trained for 75 epochs, which I also hoped would reduce overfitting. In the end, this model achieved an accuracy of 92.90%.

In my attempt to do better, I actually made it worse. The next model was far deeper than any model before it

model.add(layers.Conv2D(64, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))    model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))    
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.25))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.40))model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(256, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(512, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.40))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.40))

Basically, this model can be summed up as having repeated sections throughout. The basic “unit” is Conv2D layer, normalization, and Dropout (25%). Do that twice, then do a MaxPool and another Dropout(40%). Then, do that entire thing over again. The score for this model was 90.96%. Again, the model is so deep that even the increased dropouts cannot compensate for overfitting.

This next model was slightly less complex in its topology.

model.add(layers.Conv2D(64, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(64, (3, 3), activation='relu'))    
model.add(layers.BatchNormalization())    model.add(layers.Conv2D(64, (3, 3), activation='relu'))       model.add(layers.BatchNormalization())   model.add(layers.Dropout(0.25))
model.add(layers.MaxPooling2D((2, 2)))    model.add(layers.Dropout(0.40))model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(256, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(512, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.40))

My thinking was that by normalizing after every Conv2D, the model would be better able to handle the data and would therefore reduce overfitting. I was wrong; this model had a score of 90.06%.

At this point in the contest, one of my teammates beat my high score of 93.06% with his score of 93.50% with this model.

model.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu', padding='same',input_shape=(32,32,3)))
model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same',input_shape=(32,32,3)))
model.add(Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same',input_shape=(32,32,3)))
model.add(BatchNormalization())
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.25))
              
model.add(Conv2D(128,(3, 3), activation = 'relu'))
model.add(Conv2D(256,(3, 3), activation = 'relu'))
model.add(Dropout(0.25))
model.add(Conv2D(512,(3, 3), activation = 'relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.40))

From his model, I made some modifications and arrived at this model

model.add(layers.Conv2D(filters=32, kernel_size=(3,3), activation='relu', padding='same',input_shape=(28,28,1)))
model.add(layers.Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same'))
model.add(layers.Conv2D(filters=64, kernel_size=(3,3), activation='relu', padding='same',))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.30))
model.add(layers.Conv2D(128,(3, 3), activation = 'relu'))    model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(256,(3, 3), activation = 'relu'))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(512,(3, 3), activation = 'relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.40))

I combined this model with image augmentation. Image augmentation is used to create new training data from existing training data. A picture can be flipped along the x or y axis. It can be shifted a certain number of pixels in any direction. It can rotated any number of degrees. By applying these transformations to the training set images, new images are created. Feeding these new images to the CNN exposes it to new data and helps increase its accuracy. Using Keras, image augmentation is simple

datagen = ImageDataGenerator(
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True)

After training this model for 100 epochs, the final accuracy was 94.40%.

Up until this point, all models used the optimizer Adam, but Adam is one of many possible optimizers. An optimizer is used to minimize the error function in the network. Using the same CNN topology and data augmentation setup, I tested various optimizers to find the best one for this setup.

Optimizer = Adam | Accuracy = 94.40%
Optimizer = Nadam | Accuracy = 42.00%
Optimizer = Adamax | Accuracy = 93.40%
Optimizer = Adagrad | Accuracy = 93.70%
Optimizer = RMSprop | Accuracy = 79.17%
Optimizer = Adadelta | Accuracy = 92.53%

Because Adam performed the best, I wanted to see if increasing the training epochs would increase the overall accuracy. When trained for 200 epochs, the score increased to 94.90%.

I also tested different Activation functions with the same topology as above, with Adam used as the optimizer. The activation function is used to add nonlinearity to the network. Basically, the activation function determines if, and to what degree, a given neuron produces an output.

Activation = relu | Accuracy = 94.40%
Activation = elu | Accuracy = 93.90%
Activation = selu | Accuracy = 38.27%
Activation = tanh | Accuracy = 67.87%

Additionally, I tested various values for the data augmentation section, to try to find the optimal setup. Again, these tests kept the same topology, and hyperparameters as above. For the height_shift and width_shift range, in each trial the fraction of the total height/width of the image that the augmented version can shift each pixel by was changed.

Shift Range = 15% | Accuracy = 94.40%
Shift Range = 20% | Accuracy = 94.40%
Shift Range = 25% | Accuracy = 94.30%
Shift Range = 30% | Accuracy = 93.27%
Shift Range = 40% | Accuracy = 93.60%

So, after it was all said and done, my model was able to achieve a final accuracy score of 94.90%. I hope this post has helped to open up the black box that is machine learning, and I wish you luck dealing with all the math that has exploded out of it!

A Journey into Convolutional Neural Networks

The Challenge

The Model

All the Stuff in Between

Written by Christopher Harris