My 1st challenge in Kaggle. How to be ranked in top 1%.

Recently, I have challenged a competition in Kaggle. Though it was my 1st challenge, I could get fairly good result. I would like to share my approaches and what I’ve learned from it.

Statoil/C-CORE Iceberg Classifier Challenge is the competition I chose. In this competition, competitors classifies ship and iceberg by using the radar images which are taken from satellite. As additional information, inc_angle is given. More details are available at the page of the competition.

Here is the summary of my approaches and more.

Successful Approaches Taken

  • Stacking, Blending
  • Bagging
  • Handmade feature extraction
  • Experiment with base model
  • Diversity
  • - Multiple DNNs having different architecture
  • - Different scale of input
  • - Use augmentation or not
  • Pseudo labeling

Successful Approaches, but not taken

  • Effective leak utilization

Time I spent

  • Learning Stacking and Blending
  • Building source code for Stacking and Blending
  • Training many DNNs
  • Trying some classifiers at 2nd stage of stacking

What I should have done more

  • Exploratory Data Analysis
  • Probabilistic approach
  • Some of Unsupervised techniques

What made difference between me and other top rankers above me is mainly Effective leak utilization. It means that I should have looked at data more carefully, but I missed it. Though this kind of technique sounds a little bit tricky, the skill of Exploratory Data Analysis makes it possible and it’s one of the most important skills for us.

As this is the 1st challenge for me, I spent relatively a lot of time to write code. I have open-sourced my code at Github for my later purpose.

My starter kit for ml project

Let’s take a look at more details.

Successful Approaches Taken

Before I try this competition, I was not so familiar with the techniques such as Staking and Blending. Luckily, I have received an email from Coursera at earlier day of this competition. It was the invitation of the course titled How to Win a Data Science Competition: Learn from Top. It helped me a lot. Especially, Ensembling at Week 4, which is provided by Marios, is quite helpful. I watched it many times.

Fig1. Final stacking architecture

Diversity for Stacking

The important point of Stacking is Diversity. The base model I used is simple CNN having 4 layers. By using this CNN, I made 6 patterns of predictions.

  • CNN-4L trained with no augmentation, no aux Inputs.
  • CNN-4L trained with no augmentation, MinMax Scaled aux Inputs.
  • CNN-4L trained with no augmentation, Standard Scaled aux Inputs.
  • CNN-4L trained with augmentation, no aux Inputs.
  • CNN-4L trained with augmentation, MinMax Scaled aux Inputs.
  • CNN-4L trained with augmentation, Standard Scaled aux Inputs.

The aux Inputs which I used here are given from the kernel below.

Ensembling GBMs (LB .203~

In this kernel, it extracts statistics of image. I fed them to CNN as aux Input with different kinds of scaling. And also, I extracted more statistics from linear scaled image (original image is Decibel, so it is log scale).

After verifying these effects, I trained some more different kinds of architectures.

  • CNN-4L
  • WRN (Wide Residual Net)
  • VGG16
  • Inception V3
  • MobileNets
  • VGG16 + MobileNets

Thus, I could get 36 (= 6 * 6) patterns of models. Some of them could not converged well without data augmentation, so I picked up 24 patterns from them at last.

Base Model

I have realized that it’s important to find a base model at earlier timing. CNN 4L was the base model in this case.

The base model must be trained in short time. By using the base model, I could examine several kinds of experiments, such as Pseudo labeling, using linear scaled image, augmentation, aux inputs and etc, quickly.

Bagging

It turns out that bagging is quite effective to give DNN models generalization ability. Here is the result of bagging of one of the models.

Fig2. Number of bags and log loss

Though I was sure that more bagging will provide more gains, I did not have enough computing resources for it.

2nd stage of Stacking

As I have successfully got 24 patterns of predictions at 1st stage, I tried to feed them to several classifiers at 2nd stage. I picked up two classifiers at last.

  • NN-1L (100 bags)
  • ExtraTreesClassifier (100 bags)

The log loss of NN-1L is 0.1321 / 0.1292 on public and private leaderboard respectively. The log loss of ExtraTreesClassifier is 0.1316 / 0.1280. It looks that Public LeaderBoard is not so reliable because of small data.

I simply averaged them with same weights, 0.5 and 0.5, for submission. The final log loss of this averaged submission is 0.1306 / 0.1271.

I have tried some gradient tree based classifiers such as XGBoost and LightGBM. They are highly overfitted to train data. Though they show good score on CV, they were not so good in Public LB. I guess that they read the leakage of inc_angle in wrong way.

Successful Approaches, but not taken

Now, let’s take a look at other top ranker’s approaches. I was really impressed by their solutions.

3rd-Place Solution Overview

As you can see, his earlier steps are almost similar with mine. But he trains models without using inc_angle. In another path, he calculates another kind of prediction using inc_angle. Thus, he got two kinds of independent probabilities.

Then, he mixes them by following formula. Cool indeed.

p = p1 * p2 / (p1 * p2 + (1 - p1) * (1 - p2))

1st Place Solution overview

As common in top rankers, they carefully look at data. The visualization of 1st place solution is really clear.

Fig3. 1st place solution — Visualization of leakage

And also, their intuitive is easily comprehensible.

>> Why do you re-train 100+ models on the group2 training sample only instead of group1 or both?

> Because we believe the distribution of those group2 samples are different from group1. Training on group2 only will remove the ‘bias’ induced by group1 which are 100% icebergs, thus improve the accuracy for group2 predictions, even at the small cost of having less images (around 400 less)

2nd Place Solution Overview

He adjusts CV holds to prevent his model from utilizing the leakage of inc_angle. I never imagined such way…

There are a lot of things to learn here. Congrats winners!

Summary

This competition was really fruitful for me.

  • I could build my code base for ml project.
  • I could know what I have to learn more. I will watch through How to Win a Data Science Competition: Learn from Top again.
  • I could know how top rankers think about. They have patience, which is supported by their confidence about their own skills.