Leakage fun: Statistical point of view of rows leaking

Laurae
Data Science & Design
3 min readSep 11, 2016

Laurae: This post is about a row ID leak in a competition that stroke (openly) a competition 3 days before the end. The post was originally at Kaggle. The context of the leak can be found there, by fakeplastictrees. Without the available leak in the last days, two of the competitors were ensured a 100% win. The detailed explanation can be found here.

Summary: if the row ID is leaking information, you can statistically find it by computing whether a sequence is identical but lagged by 1 row, then using the statistical method of your choice (Chi-Square, Fisher’s Exact Test, Wilks’ G²…)

happycube wrote:

Herra Huu wrote:

467, 5861119094982137442, M31, M29-31
468, -4377371438971174786, M31, M29-31
469, -4374356101743562672, M31, M29-31

Yup, that’s a different leak from the row/n_rows one. There’s something fundamentally wrong with the data generator here…

Some fun with statistics about that specific leak:

> train <- as.data.frame(fread("gender_age_train.csv", header = TRUE, sep = ",", colClasses = c("character", "character", "numeric", "character")))
> summary(train$group[2:74645] == train$group[1:74644])
Mode FALSE TRUE NA's
logical 58128 16516 0

Let’s say… “that’s 3.5 to 1 when you should get 11 to 1”… yep, we have 16516 rows that are identical to the row just after.

FALSE TRUE
Actual 58128 16516
Predicted 68424 6220

Let’s do contingency matrix magic.

Test of independence between the rows and the columns (Chi-square):

Chi-square (Observed value) 5500.2052
Chi-square (Critical value) 3.8415
DF 1
p-value < 0.0001

Test of independence between the rows and the columns (Chi-square with Yates’ continuity correction):

Chi-square (Observed value) 5499.1369
Chi-square (Critical value) 3.8415
DF 1
p-value < 0.0001

Fisher’s exact test:

p-value (Two-tailed)        < 0.0001

Test of independence between the rows and the columns (Wilks’ G²):

Wilks' G² (Observed value)  5675.2140
Wilks' G² (Critical value) 3.8415
DF 1
p-value < 0.0001

Test of independence between the rows and the columns (Monte Carlo method / Number of simulations = 5000):

Chi-square (Observed value) 5500.2052
Chi-square (Critical value) 3.8335
DF 1
p-value < 0.0001

Even more fun with the length of identical sequences using run length encoding:

summary(rle(train$group[2:74645] == train$group[1:74644])$lengths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.284 1.000 17.000
summary(rle(train$age)$lengths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.204 1.000 25.000

For the age, I don’t know what to say, but the odd of having 25 identical numbers in a row is odd but possible (some ages are repeating way much more). However, 17 times the same label in a row… ok 4.5e-19 is possible but that’s still a probability of 3.4e-14 only assuming a uniform 1/12 distribution for 74645 observations…

Herra Huu wrote:

And if you multiply that by the probability of having exactly the same phone brand/model not just group, it would make those numbers much, much smaller.

Device leakage for dummies in the picture. Just an example of an imaginary original set (train+test set), the train set itself, and the test itself.

--

--