A community of doers: learnings from winning a solo silver medal within 2 weeks in my 2nd Kaggle competition
I provide the context for how I got started in the competition, my struggles, journey to the final solution and everything I learned
I recently took part in the SIIM-ISIC Melanoma Classification challenge hosted on Kaggle during the last 2 weeks of the competition and was able to secure my first silver🏅on the platform. This was only my 2nd Kaggle competition, and the first one in 4 years.
The point of mentioning this is merely to indicate what is possible even within such a short time frame. It was a tremendous learning opportunity for me and I regretted not being a part of this earlier. The goal of this post is walk through my journey of beginning the competition in its final phase — my initial struggles, starting off step-by-step and arriving at my final submission — and to share everything that I learned in the process, specifically about the Kaggle community. This is how the blog post is structured:
- Participating in the competition
- Arriving at the final solution
- Some of the things I avoided
I hope this blog post can convince you that irrespective of what you might think of yourself, you are always welcomed in the Kaggle community and the best way to start learning is by just diving right in and doing something.
Participating in the competition
The main reason that I hadn’t participated in a Kaggle competition seriously before was that I was not sure what I could learn from it that I wouldn’t from my everyday work as an AI researcher and hence, most of my time was spent completely at work. However, over time, I’ve realized that I have not been able to stay updated about the latest practical tips for improving model performance as a majority of my time at work is dedicated towards data, compounded with the data size being relatively small. So, I needed a playground for me to test out my wild ideas and learn about the techniques that worked in practice. This was a major reason for me to explore Kaggle and I found the SIIM-ISIC Melanoma Classification challenge as the perfect starting point as the binary image classification task is simple enough for me to get a feel for the platform and also, have a realistic chance of performing well as I’ve been working on computer vision for many years now.
Initial struggles 😔
Contrary to what I had expected, I felt extremely overwhelmed seeing the amount of discussions, knowledge sharing (in the form of data and code) that was taking place and the current best leaderboard scores. I wanted to keep track of everything and understand everything in one day so that I could be at the same place as everyone else. I ended up putting a lot of pressure on myself and this made me lose my initial motivation.
Slowly making progress 🤞
Thankfully, I realized quickly that it wasn’t possible for me to get a hold of everything and there will be things that I am missing out. I convinced myself to be comfortable with knowing that I won’t be knowing everything there is to know. I started off slowly by creating the right cross-validation data for validating model performance. Creating the right split was crucial for this competition and as I’ll mention later, the cross-validation score was very important in determining the final outcome. Then, I created a baseline model using ResNet-18 with a basic input processing pipeline. This helped me get an AUC of 0.84 on the public LeaderBoard (LB) and allowed me to complete the pipeline of submitting a result.
I usually breathe a sigh of relief once the end-to-end pipeline of anything that I’m working on is complete. This allows me the flexibility to tune individual components and be sure that there are no unknown unknowns down the road that would need my attention. So, I was happy after doing this and going forward, I kept adding new components to it one at a time and I’ll describe what finally landed me in the top 5% out of 3000+ competitors.
Before describing the final solution, I want to share three philosophies that helped me iterate faster:
- The first one is an advice from Jeremy Howard from fast.ai that practically, you often don’t need all of the data or even the entire input to arrive at a decent performance. This could mean using a fraction of the entire dataset and using images of smaller resolution (say, 224 x 224) instead of using the entire image, which could be 1024 x 1024 and take a lot of time to load and process.
- Another philosophy is something that can be considered widely known to truly matter for improving model performance but I’ll still state it explicitly — feed the data correctly, more data helps, use the correct data augmentations, find the right optimization recipe and identify the right model class. Just focusing on these things should land you a pretty decent performance. It’s all about finding the right mix now. Once I found the right recipe on images with a smaller resolution, I used the same recipe with a larger image resolution. That is expected to improve performance.
- Ensembling generally always helps. Model ensembling has a theoretical justification for model improvement with the current set of base models. This blog post is an excellent reference. Simple techniques like combining diverse models (models with low correlation in their predictions) and combining the same model trained on different input sizes can provide a significant lift.
Arriving at the final solution 🗻
Once I had a solid base, I played around with different aspects of the entire pipeline which I listed towards the end of the last section— model, optimization, image sizes, using external data (data that is not a part of the competition), data augmentation and handling data imbalance.
My best model was an ensemble (taking the mean of the predictions of the individual models) of 4 models.
3 models were trained on 512x512 images and 1 model on 384x384.
2 models use the following augmentations (using
- name: Rescale
- name: RandomAffine
- name: RandomHorizontalFlip
- name: ColorJitter
- name: Normalize
- name: RandomErasing
To iterate faster and handle data imbalance, instead of upsampling the minority class, I downsample the majority class per epoch. However, to avoid wasting data, I sample different instances from the majority class per epoch.
The same network architecture is used for all the 4 models — EfficientNet-B5 features followed by a Linear layer. I tried a vast range of models but EfficientNet-B5 was the best single model.
Optimization — SuperConvergence
One of the decisions I made for iterating faster was to restrict the number of epochs to
20. To achieve this, I used the
OneCycle learning rate scheduler introduced in this paper. This scheduler requires one to specify the minimum and maximum learning rates. These two were found to be 5e-6 and 2e-4 respectively, using the LR-range-test. Also, to address overfitting, I apply a weight decay of 0.1.
Adam usually obtains the best results in a very quick time. However, this paper clarifies how weight decay is not applied correctly in
Adam, and instead, proposes a modification to
AdamW. That is the final optimizer that I have used.
Test Time Augmentation (TTA)
TTA was introduced in fast.ai to improve model performance during inference. Contrary to the general wisdom of turning data augmentation off during inference, TTA refers to keeping it on during inference as well. Since there is randomness associated with augmentations, trusting on running inference only once can lead to wrong conclusions and might even give a worse performance. Hence, in TTA, we run inference for
N_TTA number of times. Finally, to obtain the prediction for each instance, we combine the predictions across the
N_TTA inference runs. One simple way of combining is by taking the mean of the predictions. For this competition, I used
N_TTA = 15. Note that this is computationally very expensive but it leads to clear performance improvement for this task.
Things I wanted to try but couldn’t
That’s it. These components helped me land in the top 5% of the competition in just 2 weeks. If they seem too simple to you, then you are right! :)
However, there were still many ideas that I wanted to try but I couldn’t, given the time shortage. I’m also listing them down in case there is something to learn from them:
- Train on images of different sizes better: I would like to diligently keep track of the best experiments on images of lower resolution and re-run ALL of them for images with higher resolution followed by ensembling all of them. I didn’t stay consistent with this framework.
- Combine metadata with images: The competition also provides a lot of metadata for each image that could be used to improve performance. The winning team used the metadata by stacking it with the features extracted from the CNN before feeding it to the classifier.
- Spend time improving my ensembles: As I mention in the learnings section, ensembling was key and I, unfortunately, didn’t spend time learning about the best practices here and just stuck to averaging the predictions of my base models. More on this later.
- Trying more augmentations: There were a variety of augmentations discussed throughout the competition along with augmentations specific to the type of data in the augmentation (like adding hair randomly to images because a lot of images contained hairs).
- Custom head: My network consisted of a single linear layer after the features extracted from the convolutional network. However, it is often better to add more than one linear layer.
Some of the things I avoided
There were many overly complicated things mentioned in the discussions like generating images from the data and using that as additional training data. These might be good ideas but I tend to like simplistic solutions and I wanted to achieve the best possible results using them. I still feel that I could get a higher score simply by using a better ensembling technique.
1 - the power of ensembling
Ensembling is the technique of combining the predictions of several independent models/learners. I won’t go into technical details here as there are many excellent articles that already do that (this one for example). I noticed people reporting a cross-validation (CV) score much higher than what I was getting with a similar setup. It was only much later that I realized that the best CV score for a single model was still very close to what I was getting. The winners used advanced ensembling techniques like stacking, ranking (this one is specific to the metric being optimized here) and model blending to improve their final performance dramatically. One of the top submissions was actually just a weighted average of 20 public submissions. Thus, one should focus on getting the right ensembling recipe.
2 - trust your CV
One of the dramatic moments of the competition was when the private leaderboard results were opened. There was a massive shake-up in the entire leaderboard with many top submissions dropping significantly low and many teams jumping up more than 1000 spots (I myself climbed around 800 spots). This left many people reasonably disappointed as they had overfitted to the public leaderboard. However, a lot of the solution overviews posted after the competition ended strongly emphasized the need to focus on the CV score as the public leaderboard can often lie but they found a strong correlation between their CV scores and the private leaderboard ranking. So, the mantra became: “In CV, we trust”. Because I found this out very late, contrary to the general method of reporting cross-validation performance by taking the mean and standard deviation across all the folds, there is a better way that is typically used to report CV score in Kaggle competitions. This notebook illustrates that very nicely. Essentially, for each fold, you should save your Out-Of-Fold (OOF) predictions (predictions for the instances forming the validation set in that particular fold) for each fold. At the end of 5 folds, you will have a list of predictions for each sample in the train set and you should compute the metric on this list of predictions to get your final CV score.
3 - code sharing through kernels
Many people shared their submission notebooks publicly which truly helped clarify several doubts that often stay, even after people have tried their best (or not) to explain their methodology. It also helps to learn about minor implementation details that often get left out while talking about the big picture. Additionally, reading other people’s code is a great way of improving your own coding skills. I personally learned quite a bit as well. Finally, Kaggle Kernels offer both GPU and TPU for a limited quota. This is awesome and it removes the need to depend on on-prem infra or the requirement of being able to spend a lot of money on cloud VMs.
4 - discussions and knowledge sharing 💡
I was truly surprised to see how willing people were to engage with each other, often sharing their code and data to provide a very good starting point for other people. People also shared their approaches while the competition was still ongoing while also discussing what worked for them and what didn’t. Considering the fact that this is a competition with a monetary prize associated with it, I was truly taken aback by how collaborative the nature of discussions was and how kind everyone was to any newcomer asking silly questions as well.
5 - community of doers 👨💻👩💻
Finally, I just want to say that it was amazing to have found a community of doers — people who focus on actually getting the job done, with whom I could have deep, meaningful technical discussions (and jokes). I have struggled to find the right online community for myself and although I am yet to find my footing there, I definitely know that I am here to stay! :)
I’ll keep adding more to this list.
- Collections of solutions from the competition
- Ten Techniques Learned From fast.ai
- Kornia: an Open Source Differentiable Computer Vision Library for PyTorch
- Intuition to LR Range Test, Cyclical LR and the One-Cycle Policy
- A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay
- Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
- AdamW — Decoupled Weight Decay Regularization
I hope this blog post was able to serve as a good starting point for anyone looking to jump into Kaggle. I have tried to cover all the topics that I felt was completely new to me and I had to dig through the discussions to understand them. In the near future, I plan to release my experimentation framework along with the code for this competition to foster reproducibility and help others build on top of my work — built with ❤️ using PyTorch.