Learning about Data Science — Building an Image Classifier
Hi! My name is Gabi, and I’m learning about Data Science. I plan to write about what I’m learning to help me organize my thoughts, and to keep track of what I’ve covered.
The challenge: Building an image classifier, which can correctly identify whether or not an invasive species of plant is present in an image (and submit the results to Kaggle to see how it compares to others). I have 2295 sorted images to train my model with, which I’ll then use to classify 1531 images.
I’ll be programming in Python, and using Keras (with a Theano backend) to run my model. Theano is a Python library which allows me to “define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently”. Given that neural networks consist of lots of array multiplications, this is the first step to building one. Keras then takes advantage of Theano’s expressions to build a deep learning library.
I’ll use Jupyter Notebook for my programming. Let’s get set up:
Creating the model:
Since the dataset is so small, I’ll be using a pre trained model. This is actually fairly common in image classification problems, as good models exist which have been trained on millions of images; without the resources to do the same, its convenient to use a pre trained model and tweak it for my needs. Specifically, I’ll be using the VGG model, which won the ImageNet 2014 challenge, which was trained on 1.5 million images.
The VGG network is a convolutional neural network.
To understand how this is different from a ‘traditional’ neural network, I’ll first consider how I would expect to use a neural network for image recognition: I would assign each pixel a node, where some mathematical operation takes place. These nodes would then connect to further nodes through some weighted combination, eventually yielding an output. This is known as a fully connected network. A significant problem from this is that it lacks translational invariance; if I train my network to identify flowers in the bottom right of an image, it would fail to recognize a flower in the top left of a new image (because each pixel is individually considered).
A convolutional neural network takes an area the fraction of the image, and weights each pixel in that area. It will then slide the ‘area of consideration’ across the image, and consider the new pixels with the same weights. This means the network will get good at identifying features (eg. petals of a flower), which it can then look for anywhere in an image.
The VGG model consists of both convolutional layers and fully connected layers, with spatial pooling carried out by max pooling layers. These pooling layers combine the outputs of multiple neurons. Since these layers all have similar configurations, I can define some methods which will help me build the model:
The last method written above, the mean center method, allows me to add a layer to the model which automatically applies mean centering to the images, which is an important preprocessing step to take. I add this using a Lambda layer, which wraps an expression — in this case the mean center method — as a model layer. The means were provided by the VGG researchers, so that the pre-processing I do to my images is consistent with what the VGG model was trained on.
I can then use the methods I wrote to build the model:
One final note as part of the setup: Keras requires that each category is in its own folder. I’ve also split the training data from Kaggle into a training set (2095 images) and a validation set (200 images, 100 for each class), which I’ll use to test the success of my model.
Now that I have a model ready, I want to apply it to the invasive species problem. Lets take a look at two images from the data I’m going to categorize:
My goal is to optimize my model and the way I train it.
Before I start, there’s one thing I must do: currently, the last layer of the VGG model outputs 1000 categories (as this is how many it was trained to identify in the Image Net competition); I need to make sure it only outputs 2 (the number of classes I want to identify: invasive or non invasive). I do this by removing the last layer of the model, and replacing it with my own 2-node model.
Finally, to be able to train my model — which I plan on doing- I need to compile it. In this step, I can define the model’s optimizer, and learning rate. For these, I’ll use an Adam optimizer (a method of stochastic gradient descent) with a learning rate of 0.001.
I also define the loss function to be categorical cross entropy (motivation).
Initially, I’m going to only train the last layer of my model. What I’m doing is assuming that the model can already recognize the invasive species from the 1000 categories it was trained to identify using the ImageNet data; it now just needs to sort the invasive species from those 1000 categories.
I’m now nearly ready for a first fit. I want to do this quickly, so I have a benchmark of what I’ll be improving on.
For efficiency, I’m going to measure the success of the model over 5 epochs, and use this as a proxy for training the model for a much longer time. Since the validation data has an equal number of invasive and non-invasive datapoints (100 each), accuracy is a reasonable gauge of the model’s success.
In order to train my model on the images, I need to ‘call’ them. Keras can do this using an Image Data Generator:
Okay! I’m ready to do my first fit. I’ll be using the history object to keep track of the performance of my model across each epoch:
Plotting the success of my model:
Not bad, but there’s still clearly room for improvement. The decrease in validation accuracy at the end is indicative of overfitting; luckily, Keras provides me with ways to deal with this (more on that later).
The first thing I’m going to do is allow more of the model to be trained. The ImageNet dataset was very different from this dataset (for instance it didn’t include the invasive species I’m trying to identify). It therefore makes sense to train the model from an earlier layer (motivation). Since I don’t want to distort the weights too quickly, I’m also going to reduce the learning rate.
Changing how much of the model is trained:
I’m going to experiment with two approaches: only retraining the fully connected layers, or retraining both the fully connected layers and the first convolutional block:
Training the model, with a learning rate of 0.0001, yielded the following:
It looks like there may be overfitting occurring, so I think the only way to extract more information is to assume both scenarios B and C will do okay, and aim to create more data from the dataset.
Tackling overfitting: using Image Data Generator to create more images
Keras has an Image Data Generator object, which can generate data from images. The really awesome thing is that it can also augment those images; this means it can slightly alter the images so that from a batch of 2000 training images, I actually end up with many more thanks to slight changes the Image Data Generator makes to the images.
This is especially effective for this dataset, because I’m ultimately training my model to recognize the Hydrangea flowers in an image. In different images, these flowers might be at different angles, sizes or angled differently (see shear mapping). I can artificially add these differences to images to create a larger dataset:
Using this augmented dataset to train the scenario B and C models was far more successful:
However, it definitely looks like Scenario B is yielding the more successful model; this is the one I will use. Now, I want to more aggressively augment my data; there are still quite a few random changes I can make to the data using Image Data Generator which I’m not taking advantage of. Using this first run as a benchmark, I’ll introduce some new augmentations to the data.
These yielded the following:
So in this instance, both the benchmark and the third image generator perform fairly well. For my Kaggle submission, I’m going to use the third image generator; I’m going to run many epochs, and having more data (which the third generator does) is going to be advantageous.
Putting it all together — final model creation:
Now that I have put all the pieces together, I can prepare my final submission to Kaggle. When doing this, I’m going to add a few things to my fit_generator() method:
I’m going to train the model for 50 epochs. However, I’ll be using the EarlyStopping callback. This callback monitors some metric — in this case, the validation accuracy — and will stop training the model when it stops improving. However, given that I’ve seen some fluctuations in training, I’ve also introduced a ‘patience’ of 5 epochs; this means that if the model doesn’t improve its validation accuracy for 5 epochs, it will stop training and be considered ‘trained’.
I’ll also only be saving the best performing model, using ModelCheckpoint.
Once I train the model, I can use it to find the values on the test data:
This gets me a 0.98475 AUC ROC on Kaggle (AUC ROC is a particular metric of a model’s success — closer to 1 is better, where you can get between 0.5 and 1). Success!
Overall, this was an awesome introduction to image recognition.
There were a few metrics I didn’t explore, such as using different optimizers or learning rates, which would be cool to look at in the future.
I also don’t know that just running models with different parameters is the most efficient way to find out which Image Data Generator or model architecture is best; it felt a little random to just change things, so I’d like to find a better way to do that.
I’d also like to explore the ‘jumpiness’ of the validation accuracy — I couldn’t always understand its tendency to suddenly increase or decrease (although this may have had to do with the small validation set size).