Row IDs leaking?! Detect it using Nearest Neighbors!

Laurae
Data Science & Design
4 min readSep 11, 2016

Laurae: This post is about a row ID leak in a competition that stroke (openly) a competition 3 days before the end. The post was originally at Kaggle. The context of the leak can be found there, by fakeplastictrees. Without the available leak in the last days, two of the competitors were ensured a 100% win. A minor edit was made to the post (“uniform distribution” was added).

Summary: to detect a row ID leak, either use:

  • A cross-validated and supervised tree-based machine learning model (Random Forests, Extreme Gradient Boosting (gbtree), Gradient Boosted Trees…) with an appropriate performance metric (AUC, logloss…) => better than random reveals row ID leakage (“fastest” way)
  • An unsupervised machine learning model based on distances (Nearest Neighbors, k-means Clustering…), then you need to use your knowledge to output a conclusion => leakage or no leakage? (“safest” way)

Snow Dog wrote:

I understand the leak, but why these duplicates are there in the first place? Leaks tend to be more ‘complex’ than that.

Leaks do not need to be complex to be a leak. Even a relative row id is enough is given the sufficient information:

  • There is a complete data unshuffled, where an occurrence is very likely to happen several times in a row
  • The train set is an extract of the complete data
  • The test set is an extract of the complete data

Suppose in the picture the following:

  • Rows with color (not black) are specific people
  • There are so many devices (lets say 1000+) that having a sequence of them is unlikely
  • There are many different labels (lets say 12 like this competition)

Just by looking at these rows, you will notice:

  • There are many sequences of identical devices
  • These devices have identical labels

Remapping the categories by a “normalized” row id, you get the numbers between parenthesis. For instance:

  • Person 1 (ID 1 to ID 3) has Device 1 and is likely to have Label 1 (0–0.2 train -> 0–0.1 test)
  • Person 2 (ID 4 to ID 6) has Device 2 and is likely to have Label 2 (0.2–0.4 train -> 0.1–0.2 test)

As you go through the remapping of devices, you will notice you may end up on devices that have no matches, such as ID 9 or ID 18–19–20 in the picture: you simply ignore them for the leak, as they provide no information.

When the number of samples get large enough, the relative difference (“normalized row id” difference) between the train set row id and the test set row id gets near 0 (limit of difference when n=inf => 0).

See the picture of a nearest neighbor on Manhattan distance trained on the train set, predicted on the test set, using (row id normalized, brand, device). I changed the y axis to show the leak: it is nearly impossible to have such thing if the complete data was initially shuffled: the distance of rows to each other (including brand and device) is only a relative 0.01482, which is 1.482% row distance even when brand/device distance is included (if there is no match, the distance literally explodes). Supposing 100,000 rows and 2,000 devices, it is supposed to be around 0.01 (0.02/2) for the row distance, which makes our 0.01482 look stupidly high (and very unlikely!).

Why unlikely 0.01482? See this:

Let’s see how the train set and the test set correlates in distance by rowId by predicting the unsupervised nearest neighbor against the test set. Are we able to correlate the reconstructed rowIds? We can try the 8 nearest neighbors.

What about the 128th column? Clearly, the leak seem to disappear, as the RMSE is lower (RMSE near theoretical (if I didn’t forget my maths: assuming n = number of columns (uniform distribution) and k = the nearest neighbor considered, theoretical RMSE = 0.5kn => 0.5*2000*1 = 1000 RMSE for k=1, 128000 for k=128) is normal, high RMSE indicates something wrong / non-random, low RMSE indicates you have missing information (= the closest is not real closest anymore)).

Also, if there were no specific ordering in the training set, it is fairly possible to see the following:

Re-mapped to the test set, it shows this… what do you get? Unlikely to have the labels matching, very unlikely.

--

--