Each Kaggle challenge is like an odyssey …
What I Learned From The Kaggle Criteo Data Science Odyssey
Each Kaggle challenge is like an odyssey : you start with nothing, you don’t know how it will end, and when it’s finished, it reminds you good old memories.
The goal of the Criteo challenge was to predict if display ads will be clicked, based on traffic logs. Criteo is one of the french company that has the most challenging real-time data problems to solve. This is crucial for the firm, as it is at the basis of its whole business model.
The dataset was pretty large : 13 GB (train + test), representing around 52 Millions rows. It’s the second time I had to process this size of data (the first time was during the Yandex challenge), but for once I had only a 16 GB RAM laptop in hand, and had no other choice than to find tricks.
A bit of theory
The particular Criteo problem has been discussed in academic litterature. The challenge is to scale to billions of samples, and hundreds of millions of features (because of the dimensionality of the categorical features) — a class of problem where logistic regression is king. Here are two key papers regarding this subject, written by a Criteo scientist :
- A Reliable Effective Terascale Linear Learning System
- Simple and scalable response prediction for display advertising
I first heard about Online Learning in the “Large Scale Machine Learning” chapter of the Andrew Ng Coursera MOOC. This kind of approach is suited when you have a continuous stream of data coming in, which is typically the case of logs.
For that, as suggested at the beginning of the adventure in the Kaggle forum, I used Vowpal Wabbit (VW), a program developed originally by Yahoo! Research.
Even if it was new for me, its use was straightforward, and I liked several aspects of the software :
- it was able to process the 10+ GB of data as a line by line stream, avoiding to load the full dataset in memory
- it was capable to handle the categorical features, with huge number of different values, by using the “hashing trick”
- nice debug messages allowed to follow the processing progress, and monitor the model convergence
Unfortunately, I run out of time to fine tune VW, (seems a good 0.452 score was doable). I hope there will be new opportunities in the future to use this tool.
Small is beautiful
What I love with Data Science, is that it’s paved with surprises.
Close to the end of the competition, a Kaggle hero posted the python code of a mind-blowing death-simple solution. It was a very sober logistic regression implementation that, in this case, performed very well. This piece of code should be shown in every computer science school of the world.
My winning model started from this code (Logarithmic Loss public score = 0.46902), with some improvements (public score = 0.46314) with the following tricks :
- increased the dimensionality from D = 2**20 to 2**29 (maximum size possible with my laptop RAM)
- tuned the learning rate from 0.10 to 0.15
- changed the hash function to MurmurHash, that has good distribution properties suitable for machine learning use cases, such as features hashing
- changed slightly the called parameter of the hash function, to avoid some collisions
Wait a minute… and scikit-learn ?
I’m a great Python scikit-learn machine learning library enthusiast, but this time, I must admit I didn’t get anything out of it.
In my quest, I first tried logistic regression. My main discovery was that linear_model.SGDClassifier(loss=’log’) is quicker than linear_model.LogisticRegression() (this has been discussed here by Olivier Grisel and Mathieu Blondel), and what’s more, it supports the .partial_fit() method. This allows to process large files chunk by chunk, avoiding to load the full dataset in memory.
I also experimented the absolute necessity, not only to standardize numerical features by removing the mean and scaling to unit variance (using preprocessing.StandardScaler()), but also to make them as gaussian as possible (here, I took the log() for that). The mistake to avoid here, was to apply different scaling on each chunk (using preprocessing.scale(chunk)) — the good approach was to apply the same coherent scaling rules over the entire dataset.
Regarding the categorical features, the challenge was to One Hot encode them, in a memory friendly manner. I struggled, because the standard built-in functions like preprocessing.OneHotEncoder(), feature_extraction.DictVectorizer(), or pandas.get_dummies() are not suitable : indeed, they don’t support chunk by chunk ingestion, and often produce memory untractable dense matrix.
To solve this issue, I used a hand made encoding function, inspired from this code, producing sparse matrix, and supporting chunk by chunk (partial_fit) ingestion. For the .transform() method, I used COO sparse matrix (COOrdinate format), which, in this case, were much more performant compared to LIL matrix (row-based linked list sparse matrix) : you can find my code here. This is pointed out in the scipy documentation for the LIL matrix : “consider using the COO format when constructing large matrices”.
Finally, I also tried Random Forests. As partial_fit() is not supported (read here for example), I was only able to use part of the data, lying in memory. I also included likelihood engineered features (replacing each categorical value by its target likelihood, as suggested here), which gave good performance improvements, but unfortunately not enough to reach the Holy Grail and get a decent score.
What else didn’t work
We learn also from our failures. Here is what I tried during this journey, and that didn’t work neither.
There was clearly an intra-day seasonality in the data. I tried to make 24 models (one per hour), or even 2 models (two per day), but it was not better. I guess the loss due to a smaller training set, was not compensated by a specialized model.
I also tried to engineer features embedding the seasonality behaviour, to make the model aware of this, but it dind’t work.
- Ensemble of several models
Ensembling model often gave good results in other competitions (e.g. simply averaging the output of two classifiers). Here, it failed : the different models, all coming from derivates of logistic regression, with almost the same features set, where not different enough to bring diversity.