Testing 99%+ certainty Association Rules vs xgboost on noisy data: why 3500:1 ratio loses

Laurae
Data Science & Design
5 min readAug 23, 2016

Laurae: This post is about “what to do” with association rules that seem to have over 99% certainty when you already know you work with a a very noisy training set. Not only you should not play with them manually, but let your typical models do the work for you. Otherwise, you end up poor. Even with a 3500:1 ratio, powered 14546+ times, you end up the loser by far. The post was originally at Kaggle.

Why poor? Here is an example of what overfitting will do for you automatically:

AdmiralWen wrote:

In all honesty even if there is a shakeup as a result of these hard coded rules, the impact will be small. If you count the number of samples these rules apply to in the test set, it is between 2000–4000 samples out of 70000+. The only factor here is if the Kaggle admins made an intentional effort to skew the 50/50 splits used in calculating the leaderboard some way, my feeling is that that the split should be completely random, and if it is, the univariate statistical distributions of the features should be roughly equal in both halves, as a result the rules that work well in one half should work well in the other.

Edit: Also I don’t think the Kaggle admins would have had the insight into the data at the start of the competition to skew it in a meaningful way. It has taken months of collective eyeballs and analysis to develop the scripts. This is assuming the 50/50 split is determined at the start and never changes.

No, the impact is huge.

Take 5 or 6 false negatives at 0.00 probability (lowest rank) and it will throw a lot in ROC. The higher the ROC you have, the higher the impact of a false negative at an extreme rank in ROC. This is due to the fact you are optimizing the area under the ROC, which is based on ranking probabilities.

Imagine you found an area where there are 14509 negatives and only 37 positives (yes, it exists in this data set). Lets kick away the 37 positives, and imagine the resulting ROC you have is 0.8470657 (yea, that’s really good if it was true).

For 14509 negatives you were sure at 99.74%+ let’s look at how it fares when N the amount of number false negatives goes from 0 to 37 when simulated 1000 times, with the baseline provided:

Current ROC:0.8470657(+) | from Simulation: 0.8470953+0.0031873[0.8376366, 0.8585345]       Falses in rule (Current|Boot):
00(++) => ROC = 0.8480724+0.0031332[0.8388510, 0.8594306] (0.000% | Adj: 0.000%)
01(++) => ROC = 0.8478175+0.0031281[0.8383498, 0.8589155] (0.007% | Adj: 0.169%)
02(++) => ROC = 0.8475608+0.0031465[0.8378493, 0.8584011] (0.014% | Adj: 0.339%)
03(++) => ROC = 0.8473098+0.0031603[0.8373494, 0.8578873] (0.021% | Adj: 0.508%)
04(--) => ROC = 0.8470565+0.0031614[0.8368502, 0.8578873] (0.028% | Adj: 0.678%)
05(--) => ROC = 0.8468185+0.0031644[0.8368502, 0.8573743] (0.034% | Adj: 0.847%)
06(--) => ROC = 0.8465624+0.0031702[0.8363517, 0.8568618] (0.041% | Adj: 1.017%)
07(--) => ROC = 0.8463014+0.0031780[0.8363517, 0.8563501] (0.048% | Adj: 1.186%)
08(--) => ROC = 0.8460340+0.0031806[0.8358538, 0.8560259] (0.055% | Adj: 1.356%)
09(--) => ROC = 0.8457689+0.0031969[0.8354073, 0.8560259] (0.062% | Adj: 1.525%)
10(--) => ROC = 0.8454995+0.0032071[0.8349107, 0.8555139] (0.069% | Adj: 1.695%)
11(--) => ROC = 0.8452594+0.0032142[0.8349107, 0.8550026] (0.076% | Adj: 1.864%)
12(--) => ROC = 0.8450205+0.0032221[0.8349107, 0.8550026] (0.083% | Adj: 2.034%)
13(--) => ROC = 0.8447645+0.0032139[0.8344147, 0.8550026] (0.090% | Adj: 2.203%)
14(--) => ROC = 0.8445093+0.0032218[0.8344147, 0.8547615] (0.096% | Adj: 2.372%)
15(--) => ROC = 0.8442498+0.0032416[0.8344147, 0.8544920] (0.103% | Adj: 2.542%)
16(--) => ROC = 0.8439994+0.0032480[0.8343640, 0.8539820] (0.110% | Adj: 2.711%)
17(--) => ROC = 0.8437473+0.0032672[0.8338687, 0.8537420] (0.117% | Adj: 2.881%)
18(--) => ROC = 0.8434952+0.0032584[0.8333741, 0.8537420] (0.124% | Adj: 3.050%)
19(--) => ROC = 0.8432343+0.0032685[0.8328801, 0.8532332] (0.131% | Adj: 3.220%)
20(--) => ROC = 0.8429691+0.0032798[0.8328801, 0.8527251] (0.138% | Adj: 3.389%)
21(--) => ROC = 0.8427066+0.0032981[0.8324372, 0.8527251] (0.145% | Adj: 3.559%)
22(--) => ROC = 0.8424513+0.0033064[0.8323868, 0.8522177] (0.152% | Adj: 3.728%)
23(--) => ROC = 0.8422086+0.0033187[0.8318940, 0.8517110] (0.159% | Adj: 3.898%)
24(--) => ROC = 0.8419510+0.0033109[0.8318940, 0.8512694] (0.165% | Adj: 4.067%)
25(--) => ROC = 0.8416907+0.0033121[0.8314523, 0.8509851] (0.172% | Adj: 4.237%)
26(--) => ROC = 0.8414400+0.0033238[0.8314020, 0.8507650] (0.179% | Adj: 4.406%)
27(--) => ROC = 0.8411875+0.0033277[0.8314020, 0.8507650] (0.186% | Adj: 4.575%)
28(--) => ROC = 0.8409297+0.0033426[0.8309608, 0.8506994] (0.193% | Adj: 4.745%)
29(--) => ROC = 0.8406780+0.0033493[0.8304700, 0.8501946] (0.200% | Adj: 4.914%)
30(--) => ROC = 0.8404285+0.0033600[0.8304700, 0.8501946] (0.207% | Adj: 5.084%)
31(--) => ROC = 0.8401842+0.0033739[0.8299798, 0.8501946] (0.214% | Adj: 5.253%)
32(--) => ROC = 0.8399355+0.0033878[0.8294902, 0.8501946] (0.221% | Adj: 5.423%)
33(--) => ROC = 0.8396845+0.0033940[0.8294902, 0.8500457] (0.227% | Adj: 5.592%)
34(--) => ROC = 0.8394416+0.0034063[0.8290013, 0.8500457] (0.234% | Adj: 5.762%)
35(--) => ROC = 0.8391875+0.0034158[0.8289512, 0.8500457] (0.241% | Adj: 5.931%)
36(--) => ROC = 0.8389394+0.0034127[0.8289512, 0.8495387] (0.248% | Adj: 6.101%)
37(--) => ROC = 0.8386791+0.0034275[0.8284630, 0.8491870] (0.255% | Adj: 6.270%)

How do you lose all your ROC immediately? In 4 false negatives. And that was offset by 14505 true negatives you were sure about! (ratio of 3626.25:1) — And once you hit the “average” loss, you lost 0.01 ROC.

Imagine if you were to deal with 2000 samples only (if you were having 2000, you would be a bit higher than 0.843). You would lose all (if not more) at a uncertainty rate of 0.055% false negative — that is assuming from the point you would not have gambled on rules (i.e the base model before you set hard rules). Hence, at the first false negative, your ROC will tank immediately (if your base model was 0.847 — if it was 0.842, it would still tank that much, although a bit less).

Just reminding:

Current ROC:0.8470657(+) | from Simulation: 0.8470953+0.0031873[0.8376366, 0.8585345]       Falses in rule (Current|Boot):
00(++) => ROC = 0.8480724+0.0031332[0.8388510, 0.8594306] (0.000% | Adj: 0.000%)
01(++) => ROC = 0.8478175+0.0031281[0.8383498, 0.8589155] (0.007% | Adj: 0.169%)
02(++) => ROC = 0.8475608+0.0031465[0.8378493, 0.8584011] (0.014% | Adj: 0.339%)
03(++) => ROC = 0.8473098+0.0031603[0.8373494, 0.8578873] (0.021% | Adj: 0.508%)
04(--) => ROC = 0.8470565+0.0031614[0.8368502, 0.8578873] (0.028% | Adj: 0.678%)

All your rules must be right, else if they are right at 99.322% (to find non-positivity, i.e ~99.972% right) you have 50|50 to drop!

N.B: the confidence intervals are 95%, and when it’s 50% sampling it announces nothing good for the private LB :)

--

--