Member-only story
Extreme Imbalanced Data — The Worst Data Scientist Nightmare
And the Accuracy Trap
We can say that we have imbalanced data when one of the target variable classes has a much lower frequency than the other(s). One common example is data on cancer detection. If we have 10,000 lab results to detect cancer, and we only have a relative frequency of 1% of positive results for cancer, our data is extremely imbalanced.
The accuracy trap
If we run a model (any model) in this extremely imbalanced data, we can expect to achieve 99% accuracy. Why?
Because running a model (the simplest model possible) that classifies all entries as “not cancer”, will still be accurate 99% of the time, once the relative frequency of “cancer” is only 1%.
We need to apply data balancing techniques
→ By balancing data, we give our model the opportunity to learn about all types of records, not only the ones with the target value with the highest frequency.
The solution is to balance the training data set so that the frequency of the “cancer” observations is increased. To increase relative frequency, we can apply two methods:
1. Resample a number of “cancer” records