Computer Vision — Building a Convolutional Neural Network based on the AlexNet architecture
In this article, we will try to explore one of the CNN architectures, AlexNet and apply a modified version of the architecture to build a classifier to differentiate between a cat and a dog.
Reasons it is a modified version:
- Here we are building a binary classifier, which is to classify two categories, but in AlexNet, it can classify 1000 categories, hence the final layer is a dense layer of 1000 tensors, but our binary classifier has the final layer with only 1 tensor.
- We cannot match the number of filters learnt in each convolutional layer as it requires a large amount of GPU resources and a long time to train the model.
Number of filters in the original AlexNet:
Convolutional Layer 1 — 96
Convolutional Layer 2 — 256
Convolutional Layer 3 – 384
Convolutional Layer 4 — 384
Convolutional Layer 5 — 256
AlexNet was introduced in 2012, and as such, we are actually using an “old” architecture to build our classifier model. But as we will see later, this architecture still gives relevant and accurate results compared to our last base model.
In AlexNet, it uses 5 convolutional layers as seen in the figure above, with 3 fully-connected layers. In fact, it only adds a few more layers compared to LeNet-5, one of the simplest architectures that uses 2 convolutional layers and 3 fully-connected layers.
- AlexNet is one of the earliest architecture to use Rectified Linear Units as activation functions
Alright, without further talking, let’s dive into the code. For this notebook, I am using Google Colab as it has provided me the fastest and most convenient way to start learning computer vision. Hence, mounting my Google Drive is inevitable…
from google.colab import drivedrive.mount('/content/drive')
Going back to a little bit of talk here: I am going to include how I prepare my datasets and show you my fixes for the issues that I encountered, though they might not be the best recommendation. Caution ahead, it will be a bit code-heavy up front!
I have gone ahead and downloaded the Cats and Dogs Image Dataset from Kaggle and it has ever since been in my drive for various use cases. For the preparation, I would show only the code snippets for cats dataset and I would expect you to do the same for the dogs dataset. Here we go.
After preparing the folders, it is time to segregate the images into respective folders.
As you can see, we have a total of 2000 images for each category. We are assigning 1000 cats images to training, 500 for validation and the remaining for test, totaling up to 2000 images for the training set, 1000 for the validation set and another 1000 for the test set.
Now that we have prepared all datasets in their respective folders, let’s check if they contain the right number as we would expect.
If you remember the last convolutional neural network that we built, it was a basic model which takes in images of size (150, 150). However, AlexNet takes in images of size (224, 224), thus we would need to resize them. I ran into a format issue when resizing and the suggested fix below might not be ideal.
My solution: I ran through all the images to look for those with the incorrect format and removed them. To replace these images, I duplicated existing images in the original folder and made the replacements. I’m sure there are better ways to resolve this issue, but since the number of images with the wrong format is insignificant, I decided to take the shortcut. If anyone has any idea on how I can resolve this issue in a better way, I would really appreciate if you could comment down and let me know.
Once that is taken care of, it is time to resize the images.
Just remember to do it for the dogs dataset as well…
Build the model
Now that everything looks prepared, we are going to build the model. Note that the number of filters is not the same as the original AlexNet architecture as we are short of resources here.
Let’s do a little data preprocessing before feeding our resource-hungry model with cute images of cats and dogs. What we are going to do here is a basic normalization of the image sizes. Rescaling them to a number between 0 and 1 helps with convergence, but of course, there are many more normalization methods that can ensure a more consistent data distribution.
- Read the image files
- Decode the JPEG content to RGB grids of pixels
- Convert them to floating point tensors
- Rescale the pixel values (0–255) to (0, 1)
By using the ImageDataGenerator module, we can customize our own preprocessing pipeline before we fit the model to the images.
Using ImageDataGenerator, there is a lot more that you can do than just rescaling, such as:
The ImageDataGenerator can take in much more arguments like shear_range, zoom_range, horizontal_flip or vertical_flip to help generate more images of such properties. You can check out this page to make more tweaks before training.
Then, we generate batches of augmented images from our directory using flow_from_directory() which also takes in a few parameters like class_mode and batch_size. The class_mode parameter determines how we transform our label arrays. For example, a ‘categorical’ would do a one-hot encoding to the arrays of labels, while ‘binary’ here will return a 1D binary labels.
Train the model
Then, we’re off for training the model by feeding it the images. This is then followed by taking a look at the final accuracy and loss. We can also visualize them by plotting their trend over the number of epochs.
This might not be the most accurate representation of what AlexNet is capable of, if you actually ran the code and saw the accuracy. But if you are interested in knowing the results of each cell, I have my GitHub notebook linked down below. However, this is merely a demonstration of how the AlexNet architecture is built and what types of issues I ran into(including the not-so-recommended fix).
The OG AlexNet
To give you an idea, below is the AlexNet model from Google TensorFlow:
In conclusion, AlexNet is one of the most promising models when it comes to CNNs. In fact, when it won the ILSVRC in 2012, it startled the entire computer vision community mainly because the addition of a few layers that made all this possible took many years of research for the developers to discover the concept.
Before we end, I’ll attach my GitHub notebook here for reference. We will try out another architecture in the next article before we start using Transfer Learning to do actual classification/detection/recognition. Thank you for reading.