6 Tricks I Learned From The OTTO Kaggle Challenge

4 min readMay 24, 2015

Here are a few things I learned from the OTTO Group Kaggle competition. I had the chance to team up with great Kaggle Master Xavier Conort, and the french community as a whole has been very active.

Hacking The Otto Group Challenge

1 — Stacking, blending and averaging

Teaming with Xavier has been the opportunity to practice some ensembling technics.

We heavily used stacking. We added to an initial set of 93 features, new features being the predictions of N different classifiers (Random Forest, GBM, Neural Networks, …). And then retrained P classifiers over the 93 + N features. And finally made a weighted average of the P outputs.

We tested two tricks :

for average, use the harmonic mean (instead of the geometric mean) : it improved a bit our score
when adding the N features, add the logit of the prediction, instead of the prediction itself (it didn’t improve things in our case)

2 — Calibration

This is one of the great functionalities of the last scikit-learn version (0.16). It allows to rescale the classifier predictions by taking observations predicted within a segments (e.g. 0.3–04), and comparing to the actual truth ratio of these observation (e.g. 0.23, with means that a rescaling is needed).

Here is a mini notebook explaining how to use calibration, and demonstrating how well it worked on the OTTO challenge data.

Using Scikit-Learn calibration

3 — GridSearchCV and RandomizedSearchCV

At the beginning of the competition, it appeared quickly that — once again — Gradient Boosting Trees was one of the best performing algorithm, provided that you find the right hyper parameters.

On the scikit-learn implementation, most important hyper parameters are learning_rate (the shrinkage parameter), n_estimators (the number of boosting stages), and max_depth (limits the number of nodes in the tree, the best value depends on the interaction of the input variables). min_samples_split, and min_samples_leaf can also be a way to control depth of the trees for optimal performance.

I also discovered that two other parameters were crucial for this competition. I must admit I never paid attention on it before this challenge : namely subsample (the fraction of samples to be used for fitting the individual base learners), and max_features (the number of features to consider when looking for the best split).

The problem was to find a way to quickly find the best hyperparameters combination. I first discovered GridSearchCV, that makes an exhaustive search over specified parameter ranges. As always with scikit-learn, it has a convenient programming interface, handling for example smoothly cross-validation and parallel distributing of search.

However, the number of parameters to tune, and their range, was too large to discover the best ones in the acceptable time frame I had in mind (typically while sleeping, i.e 7 to 10 hours). I had to fall back to an other option : I then used RandomizedSearchCV, that appeared in 0.14 version. With this method, search is done randomly on a subspace of parameters. It gives generally very good results, as described in this paper, and I was able to find a suitable parameter set within a few hours.

Note that some competitors, like french kaggler Amine, used Hyperopt for hyperparameters optimization.

About Grid Search and Calibration in scikit-learn

4 — XGBoost

XGBoost is a Gradient Boosting implementation heavily used by kagglers, and I now understand why. I never used it before, but it was a hot topic discussed in the forum. I decided to have a look at it, even if its main interface is in R (but there is a Python API, that I didn’t use yet). XGBoost is much faster than scikit-learn, and gave better prediction. It will remain for sure part of my toolblox.

A gentle XGBoost tutorial

5 — Neural Networks

Someone posted on the forum :

“guess this is really the time to try out neural nets”.

He was right. It has been for me the opportunity to play with neural networks for the first time.

Several implementations have been used by the competitors : H2O, Keras, cxxnet, … I personally used Lasagne. Main challenges was to fine tune the number of layers, number of neurons, dropout and learning rate. Here is a notebook on what I learned.

6 — Bagging Classifier

One of the secret of the competition was to run several times the same algorithm, with random selection of observations and features, and take the average of the output.

To do that easily, I discovered the scikit-learn BaggingClassifier meta-estimator. It hides the tedious complexity of looping over model fits, random subsets selection, and averaging — and exposes easy fit() / predict_proba() entry points.