Lessons learned from a Deep Learning Hackathon

Sayak Paul
Intel Software Innovators
10 min readJun 25, 2019

--

Hackathons are a great way to learn new skills and build your personal brand. They push you to put your creative hats on and apply your theoretical and practical knowledge in an innovative manner. Generally, the problems given at Hackathons are not always very straight forward and require a good amount of understanding in the subject. As a matter of fact, Hackathons demand your full attention and dedication. This becomes quite difficult if you are a working professional to manage time and shift your focus in between contexts.

I participated in Analytics Vidhya’s Game of Deep Learning Hackathon. The participants were provided with a dataset consisting of images of different ships along with their categories. The task was to categorize new ships with respect to the given data — seems like a naive image classification problem, isn’t it? Well, it actually isn’t.

On the last day of the Hackathon, I took some time (collectively not more than 10 hours) out from my schedule (which has been going pretty busy) and attempted the problem. In this article, I am going to share my experience of this attempt — what steps did I follow, what lessons did I learn, what points of improvement I noted and so on.

I assume an intermediate familiarity with deep learning from the readers to make the best out of this article.

Lesson 1: Inspect the given data in every way possible

You might already be knowing about the importance of data inspection but I wanted to share my experience specifically for this Hackathon. Let me first give you a primer of the dataset —

Source: Hackathon homepage

There were 6252 images in train and 2680 images in test data. The categories of ships and their corresponding codes in the dataset were provided as follows -

{'Cargo': 1, 
'Military': 2,
'Carrier': 3,
'Cruise': 4,
'Tankers': 5}

Images were 140x210 (with the depth dimension omitted). We were given two separate files .csv files — one that contained the filenames of the given images with their labels (encoded as shown above) and the other one just contained filenames of the images belonging to the test set. However, all of the images were given in one single folder.

A snapshot of the .csv files

Now, a few images with their respective labels —

A few images from the training set

Observations:

  • Consider the fourth and the sixth images above — there are two ships in one image. Moreover, in the fourth image, there are tow ships belonging to different categories. Images like these are totally different from most of the images in the data and it can indeed create problems for a model.
  • As you can see in the above montage, a bit of data augmentation will definitely improve us in the generalization process. For example, if the third image would have been zoomed in a bit it will be easier to recognize it. Isn’t it?

I will come back to the augmentation part again in a moment.

The images were pretty good quality-wise, at least better than many of the Computer Vision Hackathons I have taken part so far. But, the distribution among the labels was a bit skewed —

Distribution of the labels in the training set

I decided to leave this skew as it is and proceeded along. Generally, I would go for methods such as SMOTE to handle this skew or even collect some data manually to better aid distribution. But for a humble baseline, I left this part. Expecting the consequences, I quickly preprocessed the dataset, prepared the batches and split the dataset into an 85:15 ratio (new_training_set:validation_set). Here are the label distributions of the newly created training and validation set —

Label distribution in the newly created training and validation sets

While defining the transforms for data augmentation, I was careful —

  • I incorporated transformations for random warping, zooming, and a little bit of meaningful flipping.
  • I kept the degree of rotational transformation to be really low because you would not want to rotate a landscape of a ship’s image to decide its category.

After incorporating data augmentation, a small subset of the dataset looked like this —

A small subset of augmented images

Note that the above images include the original images also.

By this point, I was ready to use my favorite image classifier i.e. ResNet50 pre-trained on the ImageNet dataset. I trained it for five epochs and this is how the initial network performed —

Results from the baseline model

And here’s the loss landscape of the training —

Training and validation losses of the baseline model

This was somewhat satisfactory to me since the network was not overfitting and the scores were pretty reasonable to start with. I unfroze the layers of the network and trained it for another two epochs and the performance was still improving —

Performance after unfreezing the network

No sign of overfitting as of now. I was now ready to go for submission now (maximum five submissions/day). I ended up getting a 0.91 score which got me among the first 350 participants out of a total of 2083. The evaluation criterion was weighted f1-score. This model acted as the baseline model for me and I serialized its weights as well so that I could use it from time to time.

I will come to the point of a thorough inspection of the data again in the article but before that let me proceed to the next section to share where the model was getting confused and what did I do about it.

Lesson 2: Model’s confusions and assisting the model to get past them

After making the first submission, the natural question that comes to your mind in a Hackathon is how can you improve the results? To proceed in a meaningful way to figure this out, I like to inspect the model to understand why it is producing the current results and where it is failing to predict something correctly — sounds more humane and works pretty well for me. As a first step, I plotted the confusion matrix of the model —

Confusion matrix of the baseline model

The model confused the most in identifying the images belonging to categories 1 (Cargo) and 5 (Tankers). This brought me to the next plot — the plot of the top losses incurred by the model —

Some images that caused the top losses during the training

Observations:

  • It is better to discard fourth and fifth images (and images similar to them) from the training set because the amount of relative information in them is too less for a model to conclude anything.
  • There were some black and weight images (with no third dimension)in the dataset which I decided to remove from the further training loops. Mixing B/W images with the colored ones did not seem like a good idea particularly when those images were causing a huge (well, relatively) amount of losses.
  • There were a few duplicate images as well in the dataset and there is no point in making the network see the exact same image multiple times.
  • Another point to look out for is label noise. Are the images labeled correctly? I am no expert in identifying different classes of ships neither I could contact any of the domain experts for this task. So, I used my Google (Image) search skills to do a bit of study regarding different classes of ships. From my study what I could infer is there were many images in the dataset which were not labeled correctly (and this situation is what you would typically expect in a real-life project setting). This where you get the opportunity to combine domain knowledge and the art of human labeling.

I decided to give a reality check again, this time to the data —

The fast.ai widget enabling you to format your image dataset

Above is a snapshot of the interface that fast.ai (the deep learning library I used for the experiments) provides you to —

  • Relabel the images of the dataset wherein you can specify the kind of dataset you want to relabel — training or validation.
  • Delete an image if found not suitable for further training loops

There were many images where the ships were actually cruise but were labeled as cargos. This issue was consistent across all the labels. I ended up relabeling many images in the validation set and also deleted a few.

I also discarded the duplicate images —

The fast.ai widget enabling you to remove the duplicate images from the dataset

After relabeling the images, removing wrongly labeled images and duplicate images, I constructed separate train and validation sets (in the same 85:15 ratio). I then repeated the entire modeling process as discussed above. The scores were much better with this new dataset. Here’s the performance of the network after 2 epochs of training —

Performance of the 2nd model after 2 epochs

Note that the above performance was obtained after unfreezing the layers of the network. Before that, the network was trained for 5 epochs with its lower layers frozen. Now was the perfect time for making another submission. As I had expected, I jumped up the leaderboard and got a pretty good performance boost from 0.90 to 0.925.

I had limited time and I gave myself the allowance to skip some steps from the standard workflow that I follow during Hackathons like this. After making the second submission and after getting a ramp up in the leaderboard, I verify if the train and test sets come from the same distributions by projecting them on a 2D space. I try a whole lot of different hyperparameter combinations to look for any improvement in the model. For example, I try with different epochs and batch sizes primarily. However, my considerations about tuning hyperparameters like learning rate, momentum, weight decay and so on depend on the problem statement. For this Hackathon, I had to skip these steps due to the shortage of time.

Lesson 3: The power of human involvement

By now, you should have got a sense of the power lies in the human involvement in deep learning. It took quite a while to understand this involvement properly but once you have a decent number of experiments, you will be able to convince yourself about this.

I was now onto the final set of steps as below —

  • Combine the new training and validation sets, retrain the model and make a submission (third submission)
  • Try a deeper model (ResNet101) on the same dataset and make another submission (the fourth one)
  • Inspect the images in the training set (relabeling, noise removal, duplicate removal), retrain the deeper model and get the predictions
  • Combine the predictions from two models and make the final submission

I will briefly discuss the results I got at each of the above steps.

The first step did not yield any significant improvement. This is maybe because the separated validation set was not that informative in nature. But ResNet101 did not fail me — another bump in the scores, this time from 0.925 to 0.932 and up in the leaderboard.

After this, I spent some more time with the actual training set on relabeling, noise removal and duplicate removal. I retrained the ResNet101 model on this new dataset but this time with no splits. I had two variants of this model — one with frozen layers and another with a bit of fine-tuning. After I was done training both these variants, I used them to get the predictions.

Model stacking has worked pretty well for me. It’s like having a little assembly of consultants where you listen to each one of them and adopt a voting scheme to conclude to a decision. In this final step, I combined the predictions I got from the first step I mentioned at the beginning of this section and the ones I got in the pre-final step. Time had come for me to hit the submit button from the Hackathon’s submission page for one last time.

I was elated to see that it actually worked —

My position in the competition’s leaderboard

As you can see I finished up in the 143rd position out of a total of 2083 participants. I know I was up there in the first one hundred, but I was still happy considering the amount of time I devoted to this Hackathon.

What’s next?

If I were to take things forward from here I would have definitely considered the following points —

  • Train the networks for a longer period of time and use model checkpointing
  • Try different hyperparameter settings for the network I used
  • Spend a lot more time in the human labeling process
  • Visualize the features of the images on a 2D plane
  • Try ResNet152 (optional as this will most likely give me a CUDAError)

I hope this article gave you some points with which you can confidently take part in Computer Vision Hackathons that deal with image classification problems. You should be able to add your own elements to these points and apply them in your own projects. I intentionally did not mention about techniques like leaderboard probing as I feel that is cheating and you will not anyway use it for production purposes.

A note on the technologies and hardware I used:

I used the fast.ai library for all the things I discussed in the article. Along with I used Intel’s distribution of Python to take advantage of the speed it offers natively.

For hardware (pretty low-cost — $0.38 per hour), I used a virtual instance (of type n1-highmem-8) on GCP which consists of —

  • A Tesla-P4 GPU
  • 8 vCPUs (Intel Xeon Scalable Processor (Skylake)) with 52 GBs of memory

This is where you can find the codebase.

--

--