My 1st challenge in Kaggle. How to be ranked in top 1%.

Akira Sosa
Jan 29, 2018 · 5 min read

Recently, I have challenged a competition in Kaggle. Though it was my 1st challenge, I could get fairly good result. I would like to share my approaches and what I’ve learned from it.

Image for post
Image for post

Statoil/C-CORE Iceberg Classifier Challenge is the competition I chose. In this competition, competitors classifies ship and iceberg by using the radar images which are taken from satellite. As additional information, inc_angle is given. More details are available at the page of the competition.

Here is the summary of my approaches and more.

Successful Approaches Taken

Successful Approaches, but not taken

Time I spent

What I should have done more

What made difference between me and other top rankers above me is mainly Effective leak utilization. It means that I should have looked at data more carefully, but I missed it. Though this kind of technique sounds a little bit tricky, the skill of Exploratory Data Analysis makes it possible and it’s one of the most important skills for us.

As this is the 1st challenge for me, I spent relatively a lot of time to write code. I have open-sourced my code at Github for my later purpose.

My starter kit for ml project

Let’s take a look at more details.

Successful Approaches Taken

Before I try this competition, I was not so familiar with the techniques such as Staking and Blending. Luckily, I have received an email from Coursera at earlier day of this competition. It was the invitation of the course titled How to Win a Data Science Competition: Learn from Top. It helped me a lot. Especially, Ensembling at Week 4, which is provided by Marios, is quite helpful. I watched it many times.

Image for post
Image for post
Fig1. Final stacking architecture

Diversity for Stacking

The important point of Stacking is Diversity. The base model I used is simple CNN having 4 layers. By using this CNN, I made 6 patterns of predictions.

The aux Inputs which I used here are given from the kernel below.

Ensembling GBMs (LB .203~

In this kernel, it extracts statistics of image. I fed them to CNN as aux Input with different kinds of scaling. And also, I extracted more statistics from linear scaled image (original image is Decibel, so it is log scale).

After verifying these effects, I trained some more different kinds of architectures.

Thus, I could get 36 (= 6 * 6) patterns of models. Some of them could not converged well without data augmentation, so I picked up 24 patterns from them at last.

Base Model

I have realized that it’s important to find a base model at earlier timing. CNN 4L was the base model in this case.

The base model must be trained in short time. By using the base model, I could examine several kinds of experiments, such as Pseudo labeling, using linear scaled image, augmentation, aux inputs and etc, quickly.


It turns out that bagging is quite effective to give DNN models generalization ability. Here is the result of bagging of one of the models.

Image for post
Image for post
Fig2. Number of bags and log loss

Though I was sure that more bagging will provide more gains, I did not have enough computing resources for it.

2nd stage of Stacking

As I have successfully got 24 patterns of predictions at 1st stage, I tried to feed them to several classifiers at 2nd stage. I picked up two classifiers at last.

The log loss of NN-1L is 0.1321 / 0.1292 on public and private leaderboard respectively. The log loss of ExtraTreesClassifier is 0.1316 / 0.1280. It looks that Public LeaderBoard is not so reliable because of small data.

I simply averaged them with same weights, 0.5 and 0.5, for submission. The final log loss of this averaged submission is 0.1306 / 0.1271.

I have tried some gradient tree based classifiers such as XGBoost and LightGBM. They are highly overfitted to train data. Though they show good score on CV, they were not so good in Public LB. I guess that they read the leakage of inc_angle in wrong way.

Successful Approaches, but not taken

Now, let’s take a look at other top ranker’s approaches. I was really impressed by their solutions.

3rd-Place Solution Overview

As you can see, his earlier steps are almost similar with mine. But he trains models without using inc_angle. In another path, he calculates another kind of prediction using inc_angle. Thus, he got two kinds of independent probabilities.

Then, he mixes them by following formula. Cool indeed.

p = p1 * p2 / (p1 * p2 + (1 - p1) * (1 - p2))

1st Place Solution overview

As common in top rankers, they carefully look at data. The visualization of 1st place solution is really clear.

Image for post
Image for post
Fig3. 1st place solution — Visualization of leakage

And also, their intuitive is easily comprehensible.

>> Why do you re-train 100+ models on the group2 training sample only instead of group1 or both?

> Because we believe the distribution of those group2 samples are different from group1. Training on group2 only will remove the ‘bias’ induced by group1 which are 100% icebergs, thus improve the accuracy for group2 predictions, even at the small cost of having less images (around 400 less)

2nd Place Solution Overview

He adjusts CV holds to prevent his model from utilizing the leakage of inc_angle. I never imagined such way…

There are a lot of things to learn here. Congrats winners!


This competition was really fruitful for me.

Vitalify Asia

Vitalify Asia Co.,Ltd.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store