Data Science vs Competitions’ Public LB beginner: Bias & Variance

Published in

Data Science & Design

6 min readOct 13, 2016

Laurae: This post is about the downfall of thinking the generalization of “using Public LB = same for Private LB”. This is a particularly good way to show why you should not only follow your Public LB advancements if your local Cross-Validation does not follow in the same direction “rightfully” (ex: Santander). The post was originally at Kaggle.

This post is divided in four sections:

Detecting someone else has overfitted is nearly impossible
Checking what facts you are telling publicly before publishing them
Subsampling due to leakage does not matter when looking at the absolute rank as the leaked samples are moving in the same direction for “everyone”
Bias & Variance effect control on your models

Section 1: Detecting someone else has overfitted is nearly impossible

Mohd Shadab Alam wrote:

So if only a bunch of favorable observations (based on selected training set) has fallen on the public side, then we are just over-fitting by using preferential training sets. I hope my point will become more clear after 30 hours…

Your point you are trying to tell for hours is the following: “everyone in the public LB is overfitting” (either in the top public LB or everyone).

Now, suppose the following scenarios in a competition mind set:

Scenario 1

Leak/Non-leak public: 100000/100000
Leak/Non-leak private: 100000/100000
AUC leak public LB = 0.90
Model AUC public LB with leak = 0.95
Total model uplift = 50%, because we are not between [0, 1], we are in [0.90, 1.]
Private LB leak: 0.80, but who cares? (unless you used a different leak model)
Private LB non-leak: 0.90, what we care about (we are there to get the highest uplift in the Private LB)
Final private LB score: 0.95 (75% uplift)

Scenario 2

Leak/Non-leak public: 80000/20000
Leak/Non-leak private: 300000/50000
AUC leak public LB = 0.90
Model AUC public LB with leak = 0.95
Total model uplift = 50%, because we are not between [0, 1], we are in [0.90, 1.]
Private LB leak: 0.80, who cares? (unless you used a different leak model)
Private LB non-leak: 0.90, what we care about (we are there to get the highest uplift in the Private LB)
Final private LB score: 0.92 (60% uplift)

The question:

Can you tell us which scenario is overfitting more than the other? No one can tell, the models are giving the exact same score. You can only tell one had potentially higher variance than the other, provided the same scenario conditions. But if the higher variance hits, it hits everyone (cf Santander). That’s where local validation comes into play, hence the AUC can change but affects everyone (and you must minimize it). How you are affected depends on your validation, and how you built your model (and how you use the leak, obviously).

Section 2: Checking what facts you are telling publicly before publishing them

Context: these “hundreds of records” were supposed to outnumber by far the less than “hundreds of records”.

Mohd Shadab Alam wrote:

There are many cases where hundreds of records belong to a particular people_id.

Train:

More than 100 identical people_id: 2307 (580366 records)
Less than or equal to 100 identical people_id: 148988 (1616925 records)

Test:

More than 100 identical people_id: 563 (94764 records)
Less than or equal to 100 identical people_id: 37260 (403923 records)

???

Section 3: Subsampling due to leakage does not matter when looking at the absolute rank as the leaked samples are moving in the same direction for “everyone”

Mohd Shadab Alam wrote:

Then we selectively pick few groups and assign them labels. This selective picking and extending has resulted in 69k unlabeled observations. (…) So if only a bunch of favorable observations (based on selected training set) has fallen on the public side, then we are just over-fitting by using preferential training sets.

There is 0 selective picking/extending (also: what’s your “preferential training set” definition?). It’s purely the following assumptions:

The leak is always right and we don’t correct it (400k or so)
When the leak does not know (NA at first raw output from Loiso’s), we suppose it is wrong and we must correct them (70k or so)

If we loosen the first assumption to the maximum possible, it becomes:

The leak is right on perfect predictions and we don’t correct it (350k or so)
We suppose the leak is wrong when it cannot predict perfect predictions and we must correct them (80k or so?)
When the leak does not know (NA at first raw output from Loiso’s), we suppose it is wrong and we must correct them (70k or so)

So what we did:

A simple linear interpolation for the leak (what we suppose right)
A machine learning model for the non-leak (what we suppose wrong)
Combine 1 and 2 (to optimize predictions for the best performance)

All 1, 2, and 3 are up to be used the way you want. But saying because we use 1+2+3 we are all overfitting (either in the top LB or in general) is severely wrong as we have no clue about the observations in the public LB and the private LB.

Mohd Shadab Alam wrote:

So if only a bunch of favorable observations (based on selected training set) has fallen on the public side, then we are just over-fitting by using preferential training sets.

Do you never validate locally your models to see whether what you see locally is what you get externally?

Which brings us to this quote…

Mohd Shadab Alam wrote:

I can see that models are build on a lot of assumptions . I am expecting very huge shakeup (even if scores from different models look similar, AUC will change drastically when about 70% predictions from 69k will be included for calculating the private LB)

cf the higher 3 quotes and the answers + all the answers you have got. “Drastic” changes (AUC bias) are possible, shake up (AUC variance) if you decide to overfit without validating locally. cf Santander for an example where bias (no one cares about it) + variance (we do care about it) happened.

Section 4: Bias & Variance effect control on your models

Mohd Shadab Alam wrote:

The presence and utilization of leak has unbalanced the private/public split. Hence neither private set nor public set is the true representation of population. [2]
Moreover the split on people_id contributes to this issue. [1]
A few are of the view that since everyone has utilized leak there will be equal rise/fall. But as we will be effectively judged only on very limited sample people with same score on public LB can have a large score difference on private LB which will result in a shakeup. [3]

The split on people_id potentially creates that “issue” of having a small sample (public set or private set) being unrepresentative of the population (train set), but this issue can be alleviated using validation. Public/Private LB is split either randomly or by people_id. In both cases, appropriate validation fixes this issue.
Presence of the leak has clearly no meaning in the balance of the public/private LB split. Samples are not changing magically because there are leaks, are not they supposed to?
Large score difference (bias) != Large standard deviation (variance). If you overfit the public LB, you will encounter that large variance issue on the private LB (cf Santander, only 10% all of the top 100 remained in the top 100 while everyone had about the same AUC bias reduction).

I think this should be more clear to explain what we are thinking:

Population difference potentially brings bias (primary effect) and variance (secondary effect)
You cannot control the bias (“mean”, or whatever you want to describe the population)
You control the variance (“standard deviation”, or whatever you want to describe the population)
Bias has no effect on the order of ranking (if you have move down [0.98, 0.96, 0.959] by 0.20, they still have the same order)
Variance has an effect on the order of ranking (if you do [0.98+N(0, 0.01), 0.96+N(0, 0.02), 0.959+N(0, 0.03)] with N(mean, std) for the normal distribution function, you may not get the same order)

Data Science vs Competitions’ Public LB beginner: Bias & Variance

Section 1: Detecting someone else has overfitted is nearly impossible

Scenario 1

Scenario 2

Section 2: Checking what facts you are telling publicly before publishing them

Section 3: Subsampling due to leakage does not matter when looking at the absolute rank as the leaked samples are moving in the same direction for “everyone”

Section 4: Bias & Variance effect control on your models

Written by Laurae