Our Avazu Kaggle Challenge Logbook
The goal of this data science competition was to predict clicks on ad banners, based on some anonymized information like the IP address of the user, the site category, the device type, etc. …
To train the model, we had in hand 10 successive days of click-through data, representing around 40 Millions of rows, and 6 GB of uncompressed data. Participants had to predict clicks on the 11th day.
Our team finished the competition at the 22nd position (out of 1600), and we want to congratulate another french team that finished 3 ranks above us, and of course the winning team, composed of 3 Kagglers that never lost any competition they entered !
AdPredictor is one algorithm which has been found pretty well adapted to this challenge. Originating in the Microsoft research labs, AdPredictor has replaced older algorithms for predicting CTR’s for Bing — thus it looked like a nice candidate for the Avazu challenge.
AdPredictor is a Bayesian algorithm, basically an online Bayesian probit regression. In the most simple way, it assumes that the probability of click is a function of the weighted linear combination of the features (the function being the cumulative distribution function of the Gaussian distribution, replacing the more standard sigmoid). As an online algorithm, it updates the weight coefficients for each sample, in a way which minimizes the “distance” between the new resulting distribution and the samples. As a Bayesian algorithm, it assumes the weights are indeed from a Gaussian distribution, and the algorithm keeps track of the centers and standard deviations.
It looked attractive because we were feeling that with such a mass of data, we were learning in different ways (“regimes”): sometime with lots of samples (for instance for learning the impact of the type of device, or time of day if relevant), sometimes with very few samples (for device_id, or some combinations such as device_id and position of banner). AdPredictor is able to learn at different rates for different features (as FTRL Proximal does), and also to keep track of how much you can put trust on a prediction (thanks to the standard deviation), allowing to track several regimes in parallel, as opposed to FTRL Proximal which does not have this “trust score”.
How did it perform ? Pretty well, indeed, as it is the most efficient single predictor in our ensemble. And this was with minimal tuning: we have only one β parameter. We have tried to introduce and vary other parameters, but to no avail ! And we have found also that optimal performance was obtained in one pass — but your milleage can vary, as some papers refer to multiple passes to get still better performances-, and the run time (for our Python / Pypy implementation) is similar to FTRL. And it is different enough from FTRL Proximal and Vowpal to improve things when merging !
Fun with algorithms — but is that well spent time?
I (Yannick) have spent a lot of time in the competition on finding new algorithms, and sometimes coding them in Python / Pypy. Was it fun ? Yes, for sure, as can be reading research papers, coding them in Python or finding nice existing implementation and testing them. Fustrating ? Yes, sometimes, as in most cases I spent some time in coding / compiling / running / tuning, and got worse performance than what I already had…
At the end of the day, the final model was formed with a combination of FTRL Proximal, AdPredictor and Vowpal Wabbit. FTRL and AdPredictor were hand-coded in Python / Pypy, and tuned with many many parameters (20 ? 30 ?). Vowpal Wabbit was fed with a Python / Pypy script. Did I get some improvement by combining FTRL, AdPredictor and Vowpal? Yes. Did I get improvements by testing neural networks, xgboost, random forests, extra trees? No. Am I ashamed for not testing libFM ? Yes, for sure, as it seems most of the winning combinations use them.
Was all this well spent time? I keep wondering. Sure, I have learned a lot and improved things, but on the other hand all this time could have been spent in engineering more features, tuning parameters on one or two algorithms, testing other combinations. I must give another look at the winning combinations. In the meantime, I will keep wondering if I should have stuck with tuning Vowpal (and maybe libFM…).
We all love Vowpal Wabbit
Our first try with Vowpal was on the previous release of the challenge, before they found the anonymity problem and changed the data set. Things were going fairly well, but only up to a point. When we ran it again on the new data set, we could not get very far from 0.40 — not something to be very proud of. This is why we turned to FTRL Proximal (not yet coded in Vowpal) and then AdPredictor. Much later, we tried again Vowpal, but with no better result, whatever the parameters: neural network, straight regression or FTRL…
Then, one day, I (Yannick) was kindly suggested to lower the learning rate, and to lower it strongly — from the standard 0.5 down to 0.04. With some small L2 (1e-8) I got down to 0.394 on the first run ! Much, much better ! And now it was worth exploring.
I found later that 0.04 was indeed my best learning rate, and I succeeded to lower the 0.394 a bit by engineering features and tuning L2. Vowpal was a bit weaker than my hand-coded FTRL Proximal and AdPredictor, but not by far. Ensembling it with them improved things much !
A few things I love with Vowpal: it is fast, easy to use from the command line, easy to provided data to with a simple input format, easy to engineer features to, with a few (indeed: many) parameters to tune, which indeed can improve things. One regret: I could have spent more time optimizing it, but its come back arrived two weeks only before the end of the challenge…
For feeding Vowpal, I use a Python / Pypy script. As I wanted to experiment with different combinations, I preferred to generate the Vowpal file on the fly, from the csv data. As my PC was more memory bound than CPU-bound (an i7 desktop with 16GB RAM), it was no problem to have a few more processes in parallel.
A note to myself: spend more time with Vowpal.
Forget good old K-Fold cross validation
Every data scientist learns how to evaluate a model performance through K-Fold cross validation. However in our case it was not applicable, because of the size of the data and its transactional nature. I had to reinvent the way we cross validate.
The first strategy was to train on the 9 first days of the training set, and validate on the 10th (held out data set). Splitting the available train data on a day boundary, and keeping the sequence of consecutive days, was natural — it was the best way to mimic the initial problem we had to solve.
One good point with held out is that you can use your test set to deeply understand where your model is making errors, and also to find out how to ensemble different models.
However, the problem with this validation strategy is that it is quite long to have a result : typically 2 to 3 hours. It was not very convenient to quickly tune parameters. An other technique I used is progressive validation (sometimes called online loss). The idea is to regularly evaluate the model error, before updating the gradient with the groud truth, for example every 10th sample. As stated in this article, this strategy presents several advantages :
- Because computing a gradient for learning requires computing a prediction anyway, there is no processing overhead
- If we want, we can use 100% of our data for both training and testing, which is not the case with hold-hout validation
Finally, we used both validation strategies : thanks to progressive validation, we were able to quickly find the best FTRL algorithm parameters. And thanks to held out strategy, we were able to ensemble and point out model errors.
The bottom line
That’s all for now. This was for part I of our Logbook, stay tuned for part II !