Member-only story

Extreme Imbalanced Data — The Worst Data Scientist Nightmare

And the Accuracy Trap

Carla Martins
CodeX
3 min readJun 17, 2022

--

Photo by Luke Chesser on Unsplash

We can say that we have imbalanced data when one of the target variable classes has a much lower frequency than the other(s). One common example is data on cancer detection. If we have 10,000 lab results to detect cancer, and we only have a relative frequency of 1% of positive results for cancer, our data is extremely imbalanced.

The accuracy trap

If we run a model (any model) in this extremely imbalanced data, we can expect to achieve 99% accuracy. Why?

Because running a model (the simplest model possible) that classifies all entries as “not cancer”, will still be accurate 99% of the time, once the relative frequency of “cancer” is only 1%.

We need to apply data balancing techniques

By balancing data, we give our model the opportunity to learn about all types of records, not only the ones with the target value with the highest frequency.

The solution is to balance the training data set so that the frequency of the “cancer” observations is increased. To increase relative frequency, we can apply two methods:

1. Resample a number of “cancer” records

--

--

CodeX
CodeX

Published in CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Carla Martins
Carla Martins

Written by Carla Martins

Compulsive learner. Passionate about technology. Speaks C, R, Python, SQL, Haskell, Java and LaTeX. Interested in creating solutions.

Responses (1)