How I won the Data Science Olympics 2019 (with code)

Published in

ManoMano Tech team

6 min readJun 5, 2019

TL;DR

I won the Data Science Olympics 2019, a 2 hours real life data science challenge.

My solution is a single LightGBM model with strong feature engineering on categorical variables and dates. You can find the code here.

We’re recruiting tech talents at ManoMano (data scientists/engineers, developers, etc.), don’t hesitate to apply here!

Competition concept

The Data Science Olympics is a real life data science challenge. Previously called The Best French Data Scientist (“Le Meilleur Data Scientist de France”), the 2019 edition was held simultaneously in Berlin and Paris.

The concept is pretty simple: every participant has 2 hours to build the best predictive model and to compute predictions on a test dataset that doesn’t contain the variable to predict. The competition’s platform computes a score for each participant in real time. In a nutshell, it’s like a Kaggle competition but rather than lasting 3 months, it lasts… 2 hours.

This year, the contest gathered simultaneously over 1000 data scientists on the same challenge, during the same evening, in Berlin and Paris!

Competition description

The DGCS (French acronym for Direction Générale de la Cohésion des Territoires, in English General Directorate for Territorial Cohesion) is linked to the French ministry of territorial cohesion. Its mission is to design, direct and evaluate public policies regarding solidarity, social development and equality promotion so as to foster social cohesion and support people self-reliance.

Recently, the DGCS renovated and standardized the information system of the centers responsible for emergency housing distribution. The challenge dataset comes from this renovation project. Families in emergency situations live on the streets or do not have a home. We can compare this problem to the accommodation for refugees, which must be sent to care centers.

The goal is to predict the number of nights an emergency service can provide to an individual or a group (couple, family, …).

The data contains about 480k samples (with data related to requests, groups and individuals). The number of nights is only known for a subset of requests: the train dataset.

The objective is to predict the number of nights for the requests of the test dataset.

The target to predict represents a categorized number of nights that the person or group will stay in an emergency structure:

0: the person or group won’t be granted a solution or has refused it
1: 1 or 2 nights
2: between 3 nights (included) and 1 month
3: more than 1 month

My solution

My solution is a single LightGBM model with strong feature engineering on categorical variables and dates:

My log-loss evolution during the competition

The 2 keys to the challenge were as follows:

Key n°1: correctly optimize the loss function. The loss function was a bit tricky this year:

In English terms, it means that errors on category 3 are much more penalizing (10³) than errors on category 0. Ignoring this loss function and optimizing the log-loss was therefore doomed to failure, no matter how good your feature engineering was.

The simplest way to optimize the loss function is to optimize the log loss but by weighting your samples. LightGBM allows you to assign a weight to each of your sample in the fit function:

lgb.fit(X, y, sample_weight=10**y)

Another way to optimize this loss function was to code a custom loss function and to feed it to LightGBM but it would have taken longer for the exact same result.

Key n°2: manage categorical variables well. Most of the predictive signal was present in the categorical variables, especially those with many modalities (group_id, group_main_requester_id, housing_situation_id, etc.). I used 3 tricks to create useful variables derived from these categorical variables:

a) Label encoding: encode each modality to a single number, it seems counter-intuitive but tree-based models (like LightGBM) manage to pull predictive signal from these kind of variables.

for var in categorical_features:
    encoder = LabelEncoder()
    df[‘le_{}’.format(var)] = encoder.fit_transform(df[var])
    df.loc[df[var].isnull(), ‘le_{}’.format(var)] = -1

b) Value count: another way to extract information from a categorical variables, just count the number of times each modality appears.

for var in categorical_features:
    mapping_vc = df[var].value_counts()
    df[‘vc_{}’.format(var)] = df[var].map(mapping_vc)

c) Target encoding: the process of computing the mean of the target variable for each modality of the categorical variable. Check the code on github if you want more details on the implementation.

You can find the code here. The provided notebook will give you a leaderboard score of 0.44057 if you run it as it is. For clarity, I removed the tuning part and modelling attempts that didn’t work.

What didn’t work

Random forest: do not directly optimize log loss, too weak compared to LightGBM
Neural nets: too slow, no time to tune the net architecture, no GPU :(
Blend of several LightGBM (with seed variation): surprisingly it didn’t work on this dataset

“Did you use the individuals dataset?”

That’s the first question another participant asked me once the challenge was over. I had no idea what he was talking about.

In fact, halfway through the competition, the organization released another dataset with individual features (features on the people in the group doing the request: age, gender, marital status, etc.), but I didn’t hear it since I was focused on listening to music with my headphones :-)

Fortunately, the participants who tried to use it told me that they failed to create discriminating features from it within the time available.

Data science competitions vs real life

Data Science competitions are great, they allow every data scientist to improve their data manipulation, feature engineering and modeling skills. They’re also an excellent way to learn data science quickly alongside a talented and passionate community.

But data science competitions are only a small part of a data project. In real life, you will need much more if you want to put machine learning models into production (non-exhaustive list):

Business problem definition and understanding
Translation of business problem into a data science problem
Target definition
Metric selection
Training dataset extraction
Feature engineering (data science competition scope)
Modelling and tuning (data science competition scope)
Industrialization into your infrastructure
Production dataset extraction
Leakage detection & data distribution tests
Model maintenance once in production
Metric monitoring & alerting
Feedback loop management
Communication

At ManoMano, in order to build end-to-end data products in an efficient way and to tackle all these tasks, data scientists are integrated into feature teams that are composed of various profiles (developers, product managers, QA engineers, data engineers, etc.).

We’re recruiting tech talents in Paris, Bordeaux and Barcelona (data scientists/engineers, developers, etc.) so don’t hesitate to apply!

Many thanks to the ManoMano data team, from whom I learn a lot every day. We prepared this competition as a team, meeting each other once a week to share code & tips on competitions. Special mention to Jacques Peeters, data scientist at ManoMano, who finished 7th.

My brilliant data colleagues and me, after the competition

Last but not least, I won this huge monstrous electric bike, If someone is interested, I’m selling it!