Kaggle #1 Winning Approach for Image Classification Challenge
This post is about the approach I used for the Kaggle competition: Plant Seedlings Classification. I was the #1 in the ranking for a couple of months and finally ending with #5 upon final evaluation. The approach is pretty generic and can be used for other Image Recognition tasks as well.
Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know beforehand which technique or analyst will be most effective.
Also, check out the blog that achieves State of the Art results in Intent Classification task on NLP:
Can you differentiate a weed from a crop seedling?
The ability to do so effectively can mean better crop yields and better stewardship of the environment.
The Aarhus University Signal Processing group, in collaboration with University of Southern Denmark, released a dataset containing images of approximately 960 unique plants belonging to 12 species at several growth stages.
A database of images of approximately 960 unique plants belonging to 12 species at several growth stages is made publicly available. It comprises annotated RGB images with a physical resolution of roughly 10 pixels per mm.
To standardise the evaluation of classification results obtained with the database, a benchmark based on F1 scores is proposed. The dataset is available at this URL 
The following image is a sample depicting all 12 classes in the dataset:
The task of classifying the images into respective classes, the task has been divided into 5 steps:
The first and the most important task in Machine Learning is to analyze the dataset before proceeding with any algorithms. This is important in order to understand the complexity of the dataset which will eventually help in designing the algorithm.
The distribution of images and the classes are as follows:
As already mentioned before, there are 12 classes and a total of 4750 images. However, as seen from the above, the distribution is not even and the class distribution varies from maximum of 654 images to a minimum of 221 images. This clearly demonstrates the data is imbalanced and the data need to be balanced in order to get the best results. We will come to that in STEP 3.
Now it is important to visualize the images in order to understand the data even better. So, some sample images from each class is displayed in order to see how the images differ from each other.
There is nothing much that can be understood from the images above as all the images looked pretty much same. So, I decided to see the distribution of the images using a visulaization techninque called t-Distributed Stochastic Neighbor Embedding (t-SNE).
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets.
Now after looking closely, we can hardly see the difference in the classes. So it is important to understand if the data is very difficult to distinguish only for humans or it is difficult for the machine learning model too. So, we will do a basic benchmark for it.
Training and Validation Set
Before starting with the model benchmark, we need to divide the data into training and validation dataset. Validation set plays the role of the test dataset before the model is tested on the original test set. So, basically a model is trained on the training dataset and is tested on the validation set and then the model can be improved over the validation set over time. Once we are satisfied with the results of validation set, we can apply the model on our real test dataset. This way, we can see the whether the model is overfitting or underfitting on our validation set which can help us in better fitting the model.
So we divide our dataset of 4750 images by keeping 80 percent images as training dataset and 20 percent as validation set.
Once we have the training and validation set, we will start with the benchmarking of the dataset. As we can see this is a classification problem where upon give a test dataset, we need to classify it to one of the 12 classes. So we will use a Convolution Neural Network for the task.
In case, you are a beginner and need to understand the deep learning terms better, visit the blog here:
There are several ways of creating a CNN model, but for the first benchmark, we will use Keras deep learning library. We will also use the available pretrained models in Keras, trained over ImageNet dataset and we will fine tune it for our task.
It is almost practically inefficient to train a Convolution Neural Network from scratch. So, we take the weights of a pre trained CNN model on ImageNet with 1000 classes and fine tuning it by keeping some layers frozen and unfreezing some of them and training over it. This is because the top layers learn simple basic features and we need not to train those layers and it can be directly applied to our task. One important thing to note is we need to check whether our dataset is similar to ImageNet and how big is our dataset. These 2 features will decide how we shoould perform the fine tuning. To know more in detail, read the blog from Andrej Karpathy:
Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.cs231n.github.io
In our case, the dataset is small but a bit similar to ImageNet. So we can first use the weights of the ImageNet directly with just adding a final output layer with 12 classes to see the first benchmarks. Then we will move to unfreezing some bottom layers and just training those layers.
We will use Keras for initial benchmarks as Keras provides a number of pretrained models and we will use the ResNet50 and InceptionResNetV2 for our task. It is important to benchmark the dataset with one simple model and one very high end model to understand if we are overfitting/underfitting the dataset on the given model.
Also, we can check the performance of these models on ImageNet dataset and the number of parameters of each model here to choose our benchmarking model.
For the first benchmarking, I removed the last output layer, and just added a final output layer with 12 classes. Also, the model summary was printed and the number of parameters and following is a screenshot of the final few layers.
The model was ran for 10 epochs where the results saturated after 6 epochs. The training accuracy achieved was 88 percent and a validation accuracy of 87 percent.
To further improve the performance, some layers were unfrozen from the bottom and with a learning rate that decays exponentially, we trained some more layers. This further led to an improvement of 2 percent.
Also, the following hyperparameters were used in the process:
Once we have the basic benchmarks ready, it is time to improve over it. We can start with augmenting more data to increase the number of the images in the dataset.
No data, no machine learning!
But first the dataset is not balanced, and it needs to be balanced so that even number of images are used in every batch as training data for the models.
Real life dataset is never balanced and the performance of a model over a minority class is not so good. So, the cost of misclassifying a minority class example to a normal example is often much higher than the cost of a normal class error.
So, we use tried with two approaches to balance the data:
1.Adaptive synthetic sampling approach for imbalanced learning (ADASYN): ADASYN generates synthetic data for classes with less samples in a way that datasets that are more difficult to learn are generated more compared to samples that are easier to learn.
The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples.
2.Synthetic Minority Over-sampling Technique (SMOTE): SMOTE involves over sampling the minority class and under sampling of the majority class to get the best results.
A combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class.
For this use case, SMOTE results proved better and hence SMOTE was preferred over ADASYN. Once the dataset is balanced, we can proceed with data augmentation.
There are several ways in which the data augmentation can be performed. Some of the most important ones are:
- Adding Noise
- Changing lighting conditions
- Advanced techniques like GAN
There are some very good blogs already out there that explains all these techniques.  So I am not explaining them in details. All the data augmentation techniques mentioned above were used except GANs.
Now to further improve the results, we played with learning rate including cyclical learning rate and learning rate with warm restarts. But before doing that, we need to find the best possible learning rate for the model. This is done by plotting a graph between the learning rate and the loss function to check where the loss starts decreasing.
This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations.
So in our case, 1e-1 looked like a perfect learning rate. But, more and more we get closer to our global minima, we want to take shorter steps. One way to do so is learning rate annealing, but I used learning rate with warm restarts inspired from the paper . Also, the optimizer was changed from Adam to SGD and SGDR was implemented.
Now, another thing that can be done is to train several architectures using the above techniques and then the results can be merged together. This is known as Model Ensemble and this is one of the widely popular technique. But is very computational expensive.
So, I decided to use a technique called snapshot ensembling  that achieves the goal of ensembling by training a single neural network, and making it converge to several local minima along its optimization path and saving the model parameters.
Once the learning rate methods were fixed, I played around with the image size. I trained a model with 64*64 image size (fine tuned it over ImageNet), unfreeze some layers, apply the cyclic learning rate and snapshot ensembling, take the weights of the model, changed the image size to 299*299 and again fine tuned it over the weights of image size 64*64 and do the snapshot ensembling and learning rate with warm restarts.
We need to run the learning rate vs loss function again to get the best learning rate over time if we change the image size.
The last step is to visualize the results in order to check which class has the best and worst performances and necessary steps can be taken in order to improve the results.
One very good way to understand the results is to construct a confusion matrix.
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).
We can see from the confusion matrix for all the classes for which the model predicted label is different than the true label and we can take steps to improve it. We can do more data augmentation to try to make the model learn that class.
Finally, the validation set is merged to the training data, and with the achieved hyperparameters, the model is trained for the last time and the test dataset is evaluated on it before final submission.
NOTE: The augmentation used in training needs to be present in testing dataset to achieve the best possible results.
Don’t forget to check out my other blogs here