Results from the Criteo-AdKDD-2021 Challenge

Alexandre Gilotte
Criteo R&D Blog
Published in
8 min readSep 27, 2021

As an industry leader on Privacy and Advertising, Criteo ran this summer an open Advertising Prediction ML Competition relying on Aggregated and Noisy Data. The aggregation and noise addition methods, based on Differential Privacy, have been inspired by the reporting API proposed in Google Privacy Sandbox. Full details on the challenge can be accessed here: https://competitions.codalab.org/competitions/31485

We present in this blog post the winning results from this challenge. While encouraging, it should be noted that all winners made extensive use of sampled granular event-level data that were made available with the challenge’s dataset, and that methods capable of the level of performance reported whilst not using granular data is still a wide field of investigation.

This, combined with the complexity of standard bidding system where those models are bound to be used, entails that the results of this competition alone are not sufficient to set business expectations. In a separate blog post, we discuss the potential business impacts of the Privacy Sandbox.

The challenge

In this challenge, a dataset with 19 categorical features has been split into a large training set of 100M examples, and a test set of 1M lines.
The competitors had to predict the probability of a label “the banner is clicked”, as a function of those 19 features.

A raw sample of this dataset could look like this:

Partner Id        Url         Ad size     … (*)   Clicked?  
------------ ---------------- --------- ------- ----------
42 someurl.com large ... 0
43 someurl.com large ... 1
42 anotherurl.com small ... 0
... ... ... ... ...

(*) Here we show only 3 features, the dataset contains 16 more features. Also note that the features have actually been anonymized, and their modalities hashed.

Aggregated data

If the data had been provided to the competitors that way, then the challenge would have fallen in the category of “classical” ML challenges. This is where we spiced things up at bit: we did not provide the full training set but we instead provided some aggregation tables reporting the counts of examples and the counts of clicks, grouped in different ways.
Those tables look like this:

┌───────────┬────────────────┬───────────────────┬─────────────────┐
│ PartnerId │ Url │ Count of examples │ Count of clicks │
├───────────┼────────────────┼───────────────────┼─────────────────┤
│ 42 │ someurl.com │ 10000 │ 600 │
│ 42 │ someurl.com │ 55000 │ 1000 │
│ 43 │ anotherurl.com │ 20000 │ 500 │
│ ... │ ... │ ... │ ... │
└───────────┴────────────────┴───────────────────┴─────────────────┘
┌───────────┬────────┬───────────────────┬─────────────────┐
│ PartnerId │ AdSize │ count of examples │ count of clicks │
├───────────┼────────┼───────────────────┼─────────────────┤
│ 42 │ large │ 700000 │ 8000 │
│ 42 │ small │ 100000 │ 1000 │
│ 43 │ large │ 900000 │ 10000 │
│ ... │ ... │ ... │ ... │
└───────────┴────────┴───────────────────┴─────────────────┘
┌────────────────┬────────┬───────────────────┬─────────────────┐
│ Url │ AdSize │ count of examples │ count of clicks │
├────────────────┼────────┼───────────────────┼─────────────────┤
│ someurl.com │ large │ 300000 │ 2000 │
│ someurl.com │ small │ 40000 │ 500 │
│ anotherurl.com │ large │ 200000 │ 4000 │
│ ... │ ... │ ... │ ... │
└────────────────┴────────┴───────────────────┴─────────────────┘

The “counts” in those tables refer to the number of events in the large 100M examples training set.
Given the 19 features, we computed and provided one aggregation table grouping data by each pair of features, so a total of (19 x 18) / 2 = 171 tables.
On top of that, some Gaussian noise was added to those counts, to make the provided data abide by the differential privacy requirements.

We would like to re-emphasize here that learning with only those aggregation tables is a difficult problem, where none of the “classical” machine learning methods may be directly applied.

Additional granular datasets

However, we also provided two smaller “granular” (ie, non-aggregated) datasets:

  • a 1M-lines-unlabeled test set, so that the competitor may apply their model and send us their predictions.
  • a small 100K-lines-labeled training set, which we decided to include to keep the challenge easily accessible to many competitors. It was thus possible to participate by training a model with some “classical” ML on the small training set.

Our hope when designing the challenge was that those two granular sets would be small enough to be of limited use for training, thus encouraging new solutions making creative use of the large aggregated dataset.

Two sub-challenges: predicting clicks or sales

At Criteo, the label we are most interested to predict is usually a “matched sale” on a banner, which is roughly described as a click on a banner followed by a conversion from the user. Those labels are however very scarce, leading to rather noisy metrics. We thus decided to have two leaderboards, one where the goal was to predict clicks, and the other for predictions of sales.

(We thus provided both the “matched sale” and “click” labels in the small training set; and the counts of both matched sales and clicks in the aggregation tables.)

Challenge results

Skyline

We compared the performances of the challengers with the performances of a “skyline” model trained with the full 100M dataset. This “skyline” was a rather standard logistic regression with a quadratic kernel, aka “feature crosses”. While not state of the art, this model is known to perform reasonably well on this kind of dataset.

Results

We certainly did not expect the challengers to beat this “skyline”. Actually, we were rather pleasantly surprised that many teams managed to reach some performances not too far below this skyline.

╔═══════════════════════╦═════════════════════════════╗
║ Model ║ Click prediction metric (*) ║
╠═══════════════════════╬═════════════════════════════╣
║ Skyline ║ 0.31 ║
║ Winner ║ 0.295 ║
║ Second ║ 0.292 ║
║ Third ║ 0.288 ║
║ Logistic, small train ║ 0.24 ║
║ Aggregated data only ║ 0.265 ║
╚═══════════════════════╩═════════════════════════════╝

(*) the reported metric is a score between 0 and 1, and a higher score is better. More precisely, it is defined as “1 — LLH / Entropy”, where LLH is the log-likelihood of the predictions. This formula may be interpreted as the “Proportion of the labels’ entropy explained by the model”, also called “Improvement vs Naive” in the competition leaderboards. (Also, please note that this is a technical metric, the impact on business metrics, if we were to use such models, is not known.)

╔═════════╦═════════════════════════╗
║ Model ║ Sales prediction metric ║
╠═════════╬═════════════════════════╣
║ Skyline ║ 0.34 ║
║ Winner ║ 0.32 ║
║ Second ║ 0.306 ║
║ Third ║ 0.303 ║
╚═════════╩═════════════════════════╝

Methods used by the winning teams

Two main families of solutions emerged from the challenge:

  • The most popular consisted in enriching the small training set with additional columns computed from the aggregated data, and using classical ML on this enriched training set.
  • The winning method consisted in noticing that to train a logistic model, we only need the aggregated labels and some granular unlabeled samples; which were available in the test.

Enriching the small training set

The idea here is to use the large aggregated data set to add additional columns on the small training set. To define one such column, we can choose a pair of features. Let us call those features F1 and F2. On a line of the small training set, we can look at the modalities of those features (modalities “42” and “A” in the example of the figure below) and we can now query the aggregation tables with those modalities. The aggregation tables give us the count of examples where F1, F2 take the same modalities, and we can report this count as a new column in the small training set. We can similarly add a column with the count of clicks, or the CTR, on modalities of F1 and F2. And we can of course compute such additional columns for every pair of features, giving us hundreds of additional columns.

It turns out that the columns defined this way are much easier to handle for classical ML models than the “raw” categorical features of original data, and the CTRs computed this way are much more precise than what we could get from the small training set only because they are computed on the whole 100M samples set.

We can thus get good performances by training a classical ML model on the enriched dataset. In particular, we observed from competitors and internal experiments that GBDTs were performing very well when trained on those enriched columns.

Note that this method critically requires a set of labeled granular data to be possible.

Learning a logistic model with aggregated labels and granular samples

This method requires a bit more mathematics to explain. When we train a logistic regression, we typically search for the parameters minimizing the log-loss by gradient descent, and we thus need to compute the derivative of the log-loss with respect to each coefficient of the model.

The key insight is to look closely at the formula for this derivative. Let us assume that w is the weight associated with “the partner is 42”. Then the derivative of the loss with respect to this weight is exactly:

Where yi is the label on example “i”, ŷi is the prediction of the model on example i, and the sums are on examples where the weight w is active, ie where the partner is “42”.

The first term is directly available in the aggregated data.

The second term cannot be computed on the large training set, because computing the prediction of the model can be done only on granular samples. But it may be estimated on the granular samples available in the challenge. In particular, it is possible to compute it from the 1M samples of the test set!

Once we have noted this fact, we can now optimize a logistic model by gradient descent almost as usual. But let us remind that this method critically requires a set of (unlabeled) granular data.

Learning with aggregated data only?

We would like to commend the winners for the creative ideas they brought to this topic. However, we would like also to outline that those methods presented above critically rely on the availability of some granular data.

At Criteo, we started to work on methods using aggregated data only. Our method may be found here: https://github.com/criteo-research/ad_click_prediction_from_aggregated_data. To present it very succinctly, we can think about it as generating fake granular unlabeled samples which matche the counts of examples observed in the aggregation tables, and then applying the winner’s method with those fake samples instead of the test samples. It is so far the only method we are aware of which gets reasonable results with the aggregated data only, and we reported the results in the table above. But it is clearly still lagging far below methods having access to an (even small) granular dataset.

To conclude, we believe that this problem will still require much more work before a fully satisfying solution may eventually emerge, and we are happy that this challenge successfully brought attention to it. We are also preparing to release publicly the whole datasets we used in this challenge, to allow researchers to experiments with this dataset and maybe with different ways to aggregate the data.

Videos of the challenger’ solutions

You can learn more about the details of the top challenger submissions on our youtube channel:

--

--