How to win a Kaggle classification competition?

Published in

Analytics Vidhya

6 min readJul 5, 2020

Disclaimer: This post serves as my learning journal.

Photo by Darreb Baker/Getty on billboard

Kaggle is one of the world largest platform for machine learning enthusiasts to compete and fight for glory. It is also a good platform for machine learning beginners to learn and prove their capabilities when applying Data Science role in industry.

This posts contains 6 easy tips and tricks to help you improve your performance in your next Kaggle competition (classification task).

1. Use different optimisers

There are many different type of optimisers available on both Tensorflow and pyTorch. Among all of them, Adam optimiser and it’s variant, NAdam are often a good option to begin the training with. Adam optimiser has been long touted as an efficient and fast optimiser with adaptive property. NAdam is simply Adam with Nesterov momentum. Nesterov Accelerated Gradient (NAG) was proposed by Yurii Nesterov in 1983. Instead of measuring to gradient of the loss function at local location, NAG measures at a position slightly ahead in the direction of the momentum. This often results in a faster and more accurate convergence. For more information about NAG, please check out this article : https://dominikschmidt.xyz/nesterov-momentum/

However , Adam and NAdam optimizers are not the magic optimizer that can guarantee the best performance. SGD+momentum or SGD+momentum+Nesterov typically perform better than Adam and NAdam in terms of validation accuracy. Thus, I would advise that you should always start with Adam/NAdam in the early stage of training and switch to SGD+momentum / SGD+momentum+Nesterov in the later part of training. Please check out this article for a more detail comparisons between different optimiser: https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/

Figure 1: Validation accuracies of different optimiser on Cats and Dogs classification task. Image adopted from: https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/

2. Use different augmentation techniques

It is very important to employ different augmentation techniques in your training to allow the model to learn different variations of input data as well as prevent overfitting. There are too many fancy augmentation techniques available online, I will briefly mention few of them which is often very useful for image classification tasks. Warning: Using too many or wrong types of augmentation techniques could harm your performance.

2.1 Random Cutout / Cutmix

Random cutout is a technique which randomly block part of your input image. This easy to implement technique can effectively prevent your model from overfitting the input data. This technique is very useful if your input data contains a lot noisy information such as price label, product title etc.

Cutmix is a technique which randomly cut out a small region on two of your randomly picked images (A and B) of the same class, and swap them to generate two mixup images.

Figure 2: Cutmix technique. Image adopted from: https://forums.fast.ai/uploads/default/original/3X/d/0/d0c959c1d585438e9095e1ba7ee3deb608edec9a.jpeg

2.2 Channel shift

Channel shift is a technique which randomly shifts the channel values by a random value chosen from the range specified by the user. It is useful if the subjects in your classification task could have different type of colours. For instance, dogs, cats, plastic cups and cars could have many different colours. It is not very useful for subjects that has universal colour, such as crows.

To keep the post short, i would not cover those basic but useful augmentation techniques such as rotation, shear, zoom and many more. Feel free to check out this amazing post to know more : https://towardsdatascience.com/exploring-image-data-augmentation-with-keras-and-tensorflow-a8162d89b844

These are some of the image augmentation resources that i use very often:

3. Swish activation function

There many different kind of activation functions for different machine learning task. Generally, people will go for the ReLU activation function or it’s variants such as leaky ReLU. However, when it comes to classification or machine translation task with very deep network, Swish activation could be a better choice than ReLU.

Swish activation function: y = x * sigmoid(x)

On closer look, Swish activation function does not have the abrupt change like ReLU near the 0. Instead, it has a very smooth curvy transition. Swish activation function was proposed by Ramachandran et al. in this paper. The authors claimed that through their extensive experiment, they managed to show that Swish consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as image classification and machine translation. Thus, in your future Kaggle competition, it might be a good idea for you to give Swish a try.

4. Monte Carlo Dropout

Dropout is a commonly used regularisation technique in Deep learning. It randomly deactivate or “dropout” a portion of the neurons during training stage. This can effectively reduce the co-dependency between neighbouring neurons and forces them to learn more effectively. Generally, dropout technique is only implemented during the training stage, and deactivated during validation and testing stage.

Monte Carlo dropout is a technique that keep the dropout layers active during testing stage and ensemble the results of N predictions. This idea might sound strange, why the heck do i need to remove part of the neurons during validation and testing stage? Well, you can think of this way. Randomly removing some of the neurons in a model will result in a different network structure. During the testing stage, N predictions will be made for each sample. However, due to the fact that the dropout is activated, the N predictions are not made by the same model. In fact, the prediction for a sample is the aggregated result of N different models with slight differences.

The beauty of Monte Carlo dropout is that it can be easily implemented to any trained model with dropout layers. This means that there are no re-training / modifications required. Cool right?

5. Test Time Augmentation (TTA)

Similar to the data augmentation mentioned above, TTA performs random modifications on the test images. The samples in test dataset will be augmented N times each and feed to the model to yield N different predictions per sample. These N different predictions (in this case softmax probabilities) will be averaged and yield the final guess. The concept is very similar to the idea of ensemble, combine several predictions to produce a strong prediction.

6. Focal Loss and Label Smoothing

Generally, categorical cross entropy (CCE) loss function is used when it comes to multi-class classification problem. CCE is useful when the dataset is balanced, meaning that every class has almost the same number of samples. However, this is usually not true for real world scenario. In real world scenario, the dataset collected is usually imbalanced. Imbalanced dataset will cause trouble to the training as the classes with less samples will be ignored, and more focus will be placed on classes with more samples.

Focal loss can overcome this problem by putting more weights on the classes that are hard to classify and decrease the impact on easy correct prediction. A scaling factor is added to the CCE function to achieve this. The value of the factor will increase (decrease) as the confidence in prediction goes up (decrease). This is akin to the human eyes shifting attention from background to foreground subject, hence the name focal loss.

Label smoothing is useful when the model is overfitting and overconfidence. Label smoothing is only applicable when cross entropy loss function is implemented and softmax activation function is used in the final layer. Instead of feeding one-hot vector with only one “1” per sample and “0” for the rest, label smoothing will smooth the vector by reducing the value of “1” and giving the rest a non-zero value. The formula below illustrates the label smoothing:

y_s = (1 — α) * y_hot + α / K

α is a parameter to control the degree of smoothing while K is the number of classes. The higher the value of α, the greater is the smoothing and the less overconfidence the model will be.

An overconfident model is not calibrated and its predicted probabilities are consistently higher than the accuracy. For example, it may predict 0.9 for inputs where the accuracy is only 0.6. Notice that models with small test errors can still be overconfident, and therefore can benefit from label smoothing. (Quoted from source)

That’s the end of my first medium post on Deep learning. Machine learning is a very vast and fast progressing topic. Feel free to let me know if there is any confusion or mistake in this post. If you like my content, please follow me and give me a clap.