Why under-predicting or over-predicting might not be an issue?

Published in

Data Science & Design

3 min readNov 13, 2016

Laurae: This post is about the rationale between over-predicting/under-predicting and the performance metric you are optimizing. It takes the example of the Matthew’s Correlation Coefficient (MCC). The post was originally at Kaggle.

DavidGbodiOdaibo wrote:

(…)
We have almost identical number of train and test records, so one can assume if the test train split was random your models should predict ~6000 failures in test as well and not 1800 that give optimal LB rank. Why isn’t anyone concerned about this. This is just more evidence for massive shakeup.

Because it is how MCC works: as we have many true negatives, the only objectives you have to optimize are the diagonals of the confusion matrix. If FP or FN is very small, then the other is larger. But the multiplication is smaller than if you balanced FP and FN (think of 0 x 1000 < 500 x 500). For most of us (if not predicting a lot of positives), it depends on how many true positives you have vs the amount of positives you predict. It’s a chase for maximum true positive accuracy in your positive predictions, while being able to maximize also the amount of positive predictions (the two objectives to chase and to balance appropriately).

If your features are tuned for a large amount of positives, you may be able to do as well as under-predicting the positive class. For instance, I have one submission with 5500+ positives, and yet it scores 0.4+ on LB (for MCC, it is about the same as just swapping FP and FN, which mathematically changes nearly nothing — just bigger numbers making improvement more difficult). You can do as well by predicting only 2000 positives, and get 0.4+ LB too. Obviously, if you predict 5500+, you lower the risk you take — but getting such amount of positives with a high score is not trivial as you need to catch most of them.

The difference post-bias (= after subtracting the bias between public and private LB) should be around 0.005–0.006, as it seems to be the standard deviation for 20% of data at around 0.5 MCC (so for 30% it’s even smaller).

This one will never be like the Santander one, unless the private LB is really much different (still a possibility, but a very low one). But then, this competition does not seem like the Seizure one with 3 patients only.

Remember also that you may see MCC linear in the brain, but the improvement you need is quadratic (because the true negative count is so large). Going from 0.5 to 0.6 is significantly harder than from 0.2 to 0.3 (approximately 2.2x more difficult). Therefore, as you go higher in MCC, it becomes more difficult to see linear improvements (in the MCC value) than if you were at a low MCC. Add into the factor how hard it becomes to improve once so many good features were engineered, and you hit a wall for improvement which makes it look like a real sigmoid (from a human perspective of performance improvement against time):

Therefore, if you have 75% true positives in your 2000 positives, it is clearly not trivial to go to 7500 positives and 93% true positives (about the same score). Let alone improving 93% true positives, it should be easier to increase from 75% than from 93%.

Why under-predicting or over-predicting might not be an issue?

Written by Laurae