Class Imbalanced explained — Machine Learning data science basics

Uniqtech
Data Science Bootcamp

--

This free article provides a quick intuitive explanation for why class imbalance is bad in data analysis and why accuracy score is not a preferred metric to measure model performance. Short-fall, weakness of accuracy as metric, when it comes to unbalanced dataset is a real problem.

Let’s use the real world example of orphan diseases. United States FDA defines orphan diseases that affects fewer than 200,000 people nation wide. US population today is 327.2 million so only 0.061% of the population.

Imagine if we have a model that never learns. May be it guessed a binary classification task randomly, just like a flip of point, 50% class 0 50% class 1. many patients will receive unnecessary treatment or tests. Valuable hospital time and resources will be wasted. That’s not good. Not useful. Even worse, we devise a model that guesses 100% of the time patients don’t have the diseases, the model sounds amazing because it will only be wrong 0.061% of the time!

The simple metric, accuracy = (true positive + true negative ) / (the number of patients tested), does not work in this case because there’s a severe class imbalance. Class 0 no diseases, class 1 yes to orphan disease. Proportion in class 0 is significantly higher than class 1.

Class balance is important for machine learning. Algorithms learn from data : examples, negative examples, noisy example, variations (cats, dogs, cat photos in low light settings, different breeds of…

--

--