Part 3: Model selection and implementation with convolutional neural networks (CNN)

11 min readApr 13, 2020

Introduction

In this four-part series, we will be working through the Kaggle Aerial Cactus Identification challenge from 2019. The series will cover the following topics:

Part 1: Pre-processing and dimensionality reduction with autoencoders
Part 2: Comparing dimensionality reduction between autoencoders and principle component analysis (PCA)
Part 3: Model selection and implementation with convolutional neural networks (CNN)
Part 4: Model interpretation with LIME and concluding remarks

The aim of this series is to show an example implementation of the full data science pipeline for an image classification problem.

All the code for this series is available in this GitHub repository.

This post is the continuation of the discussion on addressing the problem of the classification of cacti images. As a recap, in Part 1 we talked about the pre-processing of the dataset and dimensionality reduction with an autoencoder. The dimension of the images in the original dataset is 32x32x3 (when separated by the RGB channels), and the encoded version has dimensions of 4x4x64. In Part 2, we proceeded to the comparison between the result of dimensionality reduction with an Autoencoder vs those with the classical PCA. The conclusion was that PCA captures less information per pixel, but takes less time to run.

Part 3 is concerned with the discussion about the Convolutional Neural Networks, their structure, and complexity. We will also have a discussion of the quality of the prediction as a function of the dimensionality of the input images as well. The main agenda is to build a CNN network that would have an appropriate tradeoff between the training time and the prediction quality, given different options of data representation. Without the further ado let’s proceed!

Background: A note on history and anatomy of CNNs

Like anything in the data science field, convolutional neural networks (CNNs) are perceived to be around for a while. This is true —this type of network has been gaining more and more popularity in the data science community in the past decade. But there were four important milestones before it became one of the most famous deep learning structures.

It all began with the results of biological experiments in the 1950s. Essentially, the original seed that was planted, i.e. the fundamental basis for CNNs, was in the following concept: human visual system achieves the perception of complex visual objects by summing up the simple detectors. The simple detectors here are the ‘simple cells’ are the actual physical cells in the primary visual cortex that is located in the occipital lobe of the human brain (see the pictures below)

Figure 1. Simple Cells (Draelos, R., 2019)

Figure 2. The location of the Occipital Lobe (Brain Made Simple, 2019)

These cells respond to edges and bars of particular orientations (Draelos, R., 2019), and then ‘complex cells’ collect the information (here, in the physical sense — sum up the physical response of the cells in the form of ) from the cells that respond to the stimuli originating from different parts of the scene or an image — top, bottom, etc.

Following this biological discovery proposed by Hubel and Wiesel, Dr. Kunihiko Fukushima’s research published in 1980 proposed a mathematical model consisting of S-cells (simple) and C-cells (complex) cells to describe this relationship. This concept was implemented in computational algorithms in 1990s — and the number of citations of that original paper has exceeded 25,000! — by Yann LeCun, who performed the analysis on the MNIST dataset of handwritten digits.

Finally, the state-of-the-art algorithm that gave CNNs a peak of popularity was presented in 2012 as an entry in the ImageNet competition. It’s called AlexNet (named after the author Alex Krizhevsky), and the original paper that was cited almost 60,000 times. The idea behind this dataset is a large variety of ‘natural images’, each requiring proper labeling — such as ‘furniture’, ‘door’, ‘fish’, etc — in contrast to labels 0 to 9 for the MNIST dataset.

To understand what kept the researchers’ interest for so long, let’s see what is going on behind the curtains in this type of network.

Structure

There are two main kinds of operations: convolutions and pooling, that take place in separate layers. What is meant by convolution is merging two sets of information (from the mathematical standpoint). The goal is to produce a feature map — reduced representation of the image, that presumably contains information, but in the way that is better perceived by the machines, and is the result of the dot product between each cell and the filter/kernel.

Figure 3. Convolutions — 10 5x5 filters go through each pixel in the 32x32x3 input to get the result of dimensionality 32x32x10 Dertat, A. (2017)

Don’t let the term ‘dot product’ confuse you — this is simply the sum of the results of element-wise multiplication on 2D, but since we are usually working with multiple channels in the image, on 3D it’s turned into the multiplication of the entire vectors.

Figure 4. The moving window creates a feature map. Striding allows controlling the dimensionality of the output Dertat, A. (2017)

A kernel or a filter — used interchangeably — is just a matrix with weights, that are randomized at first but then are adjusted the more algorithm is learning. It behaves like a sliding window, by default, we take one step at the time, but this parameter — called a stride — can be adjusted. With larger steps, the filter may miss some subsets of the pixel cells, that are also called the receptive fields (when the mathematical operations are performed on them).

Figure 5. Maxpooling. The filter selects the maximum value in the window and assigns a corresponding cell in the output Dertat, A. (2017)

One may say that this necessarily implies that we decrease the dimensionality and the size of the feature map is smaller than the original image. Well, for the ones who want them to be the same, there is an option for padding — where zero cells are added on the sides of the image. Additionally, we would usually use multiple filters and stack the results together. So that the ‘depth’ dimension of the feature map corresponds to the number of filters.

Yet, we still reduce the dimensionality with the next step, which takes place in the Pooling layer. Here, we also have a moving window, which selects a subset of cells, finds the maximum value and replaces that subset with that single maximum value (MaxPooling example). In this way, we reduce the width and the height, preserving the depth. The main goal of this layer is to reduce the number of parameters keeping only the ‘important’ ones for the training.

To sum up, we have four hyperparameters of the network: the number of filters, their size, stride, and padding. These parameters are usually the ones adjusted during the optimization of the training.

There are two more layers that can be present in a CNN — Flattening and fully connected layers. The former just turns its input into a one-dimensional vector. The computations that are directly relevant to the problem that the network is trying to explore are in the fully connected layers, which are usually present in a pair. The first layer applies the weights to the output of the previous layer to determine the vector of probabilities for each feature that the image is of a particular label in a classification problem. Finally, the last fully connected layer returns a single value, i.e. the consensus of all previous layers on the class that the image belongs to.

Complexity

For those familiar with the big-O notation, and who is used to make a quick count of the operations in the algorithm, there is a natural question — what can we say about the complexity of such a network? For those who aren’t familiar with the concept of the complexity and this notation: a quick explanation is the following. We are using this as a metric of how well the algorithm is doing. To characterize the running time as a function of the parameters that the algorithm is working with, we bound that quantity from above (or above and below) by infinitely large input size values. Further reading here.

So, following the results presented by He and Sun, the total time complexity of CNNs will be dependent on the following parameters: the depth of the network (d), the number of input channels (n), the size of the filter (s), and the size of the feature map (m). Because the filter and the feature map are square, the relationship is quadratic. Hence, we get

Intuitively, the complexity of calculations depends on the number of multiplications needed to compute, which are dependent on those parameters, even for a trained network.

Implementation: how is this related to our problem?

We want to test the following hypothesis: if we reduce the dimensionality with an autoencoder or using PCA, we will preserve the information that is essential for training but save space and time for the training. What we would expect in this case is high accuracy for the classification. However, it is possible that the model wouldn’t have enough information to train on.

We will perform the following. Since we validated the autoencoder in terms of its reconstructed power, we can assume that the encoder part can preserve the needed information. We will reuse this part for classification. But as mentioned above, to perform actual classification, we would need to add fully connected layers as shown below.

Figure 6. Representation of a simple autoencoder architecture with neural networks (modified to show the classification layer) (Jordan, 2018)

To begin with, we load both the original dataset, the encoded one, and the one that has reduced dimensionality with PCA. We do that by using the encoder model.

As a result, we are getting images of dimensions 4x4x64 For PCA, we have the dimensions from24000x96 to 24000x25and 24000x10 respectively. As mentioned in the introduction, this type of dimensionality reduction was shown to be less effective than the one with the autoencoder, but we could still test whether the amount of information in this representation is sufficient for the training.

The model that was used for this part is the following

The result of the training was not optimal — the accuracy on both datasets was below 60% . The distribution of the predicted values looked like the following:

We can see that it resembles a normal distribution. What we can assume is happening is that the model is unable to learn properly and therefore averages out the predictions — with the mean at around 0.7 , even though we’d expect it to produce binary values.

The convolutional network above downsamples the data so that there is less information than it was before — and we started working with reduced representations. Therefore, we can conclude that using reduced representation will not lead to results of appropriate accuracy.

The next step is to try using the previous, trained model — a part of the autoencoder and add fully connected layers. Below is the function for that. We have one with 128 neurons, that would perform the calculation of the probabilities for each label, and the following one with 2 neurons, that returns a single label.

Here, we add the FC layers to the previous structures:

Next, we obtain the set of weights that the autoencoder converged to. Since we validated its results, we can assume that those weights are useful for the next part of the training as well. Then we want to only train the fully connected layer (as we’ve already trained the rest!)

Also, the labels that we have been working with are 0 and 1, but for the model to treat them as categorical we need to convert them explicitly with the to_categorical module from keras.utils

Next, we compile our model. Since it is a categorical problem, we are going to use categorical cross-entropy for the loss function, and accuracy as the metric. The Adam optimizer was chosen as a default, but there is room to exploring alternatives (such as SGD — stochastic gradient descent). Finally, we proceed to the training:

The final accuracy for this model was 0.749 with the loss of 0.578 . This result is still sub-optimal, but there is a lot of room for improvement in the structure of the model and the training itself. The next steps to get better results would be changing the optimizer, changing the objective function, and using cross-validation. The reason why additional training will not improve the results is the fact that the model converges to the result after a couple of epochs — which means that at a certain point it stops learning.

The current best result on Kaggle has an accuracy of 1.

With our model, the limitation is in the data. Because of the space constraints, we took the images of smaller size for the training. Even though we validated the result fo the autoencoder, the final model cannot provide the results as high as the ones above primarily because of the input (the data that is in the competition is of size 148x148).

As another result check, it is worth comparing to the performance of a random model. I chose an SVM — support vector machine — that is trained on the grayscaled flattened images. What greyscale shows is the intensity of each pixel so instead of three RGB channels, we have a single one. Flattening was done to satisfy the dimensionality requirement of SVM (≤2 dimensions). We can expect greyscaling to reduce the amount of information in the image as well.

If we look at the precision values, we will 0 which means that the SVM did not predict any 0s. The original classifier did. So even though the accuracy is almost the same, the classifier gave better results.

Conclusion

We worked on the classification problem that addresses the presence/absence of cacti in images. We tried two different approaches: using a reduced representation of the images achieved with an autoencoder and PCA to save time and space during the training, but the results were of low accuracy. We then reused a part of the autoencoder that encodes the images and then added fully connected layers to make predictions regarding the labels, which produced much better results. It is worth noting that the amount of information in the images is reduced due to their small size. However, there is still a lot of room for optimization of the model through the choice of the loss function, optimizer, etc.

Bibliography

Asiri, S. (2019). Buiding A Convolutional Neural Network For Image Classification With Tensorflow. Medium. Retrieved from: https://towardsdatascience.com/building-a-convolutional-neural-network-for-image-classification-with-tensorflow-f1f2f56bd83b

Dertat, A. (2017). Applied Deep Learning — Part 4: Convolutional Neural Networks. Towards Data Science. Retrieved from: https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2

Draelos, A. (2019). The history of Convolutional Neural Networks. Glass Box Medicine. Retrieved from: https://glassboxmedicine.com/2019/04/13/a-short-history-of-convolutional-neural-networks/

He, K., & Sun, J. (2015). Convolutional neural networks at a constrained time cost. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5353–5360).

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

Saikia, S. (2019). Building a Convolutional Autoencoder with Keras Using ConvTranspose. Medium. Retrieved from:https://medium.com/analytics-vidhya/building-a-convolutional-autoencoder-using-keras-using-conv2dtranspose-ca403c8d144e

Sharme, A. (2018). AutoEncoder as A Classifier Using Fashion-MNIST Dataset. DataCamp. Retrieved from: https://www.datacamp.com/community/tutorials/autoencoder-classifier-python