Imbalanced Data: is it all bad?

Indrani Banerjee
CodeX
Published in
4 min readFeb 3, 2023

Working with data often means having to make do with what you’re given. More often than not, that means unclean data, imbalanced data, and sometimes even unreliable data. Data can be messy for a myriad of reasons: people accidently ticking incorrect boxes, accidental input during digitization processes or during data entry, typos to name a few. Bad data will ultimately result in drawing bad conclusions, but what if we could mitigate some of these problems? I want to specifically address how to handle imbalanced datasets for classification models.

In the real world we expect to see imbalances in groups. This is normal. The number of passengers on a flight who have ordered special meals versus those who haven’t, the number of kids who are fussy eaters versus who aren’t, the number of people who don’t like the prequel Star Wars movies versus who do. So why is imbalance in datasets problematic? The real answer is it doesn’t have to be a problem, but rather it can be a problem.

I was recently working on a binary classification project where I had a ratio of 93:7 of the two different classes. The goal of project was to explore why only 7% of their clients subscribed to a product and others didn’t. I decided to first have a look at the outcomes of machine learning algorithms to see what the base models could yield. I made sure the training and test set had roughly the same proportional representation of the two groups. I was surprised that even prior to hyperparameter tuning, I got a 90–95% accuracy across pretty much all my models. I then did SMOTE, and the accuracy dipped a little for most of the models (have a look at the table below).

This confused me even more! Which one was ‘better’? Which one should I trust? I did some digging, had the chance to talk to mentor from Apziva, and I concluded for this classification problem, the imbalanced dataset wasn’t really a problem.

The primary reason I deemed this class imbalance as not a problem was the size of my dataset. Of the 40,000 clients’ data, I had only 2,800 samples of those who did subscribe for the product. Would I have preferred to have 28,000 samples? Absolutely! However, 2,800 is not too bad. I found this article very helpful, which discusses how to decide if you have enough data and how you can conclude if there are enough datapoints for the ML models to learn from, even in smaller sized classes.

The real problem for this model was that the clients set the success criteria as ‘an accuracy score of 90% or higher’. Take a look at this post if you want a quick refresher on evaluation metrics. I should have been using a different evaluation metric: the weighted F1 score. Below is an output from ScikitLearn’s Classification Report for one of my binary classification models.

classification_report(y_true, y_pred)

The support column shows the distribution of the classes: 92.7% of the training set are clients who didn’t sign up for the product and the remaining 7.3% are clients who did. We can see the F-1 score for the two classes in the first two rows. I’m more interested in the final row: the ‘weighted average’ row shows the average precision, recall and f1-scores when the average has been calculated based on the proportion of the classes. So, the f1 score for class 0, ‘clients who didn’t sign up for the subscription model’, is given more of a priority than those in class 1.

As a result, for my model selection, I prioritised the weighted F-1 score rather than the accuracy score to select the optimal model.

So, the question was the imbalanced dataset actually problematic — after all the imbalance was representing the real-life situation?

The truth is that most situations in the real world do not feature conveniently balanced sets of data. As a result, any approach to data science that cannot successfully handle imbalanced data will be severely limited, and ultimately not of much use for many real applications. Therefore, as data scientists, we must expect handling imbalanced data to be a constant feature data analysis. It is important to understand when the imbalance is a problem and strategies such as SMOTE, oversampling and undersampling is required. But I think it’s also important to identify when doing nothing about the imbalance is the best strategy.

Handling imbalanced data in a confident way and using an accurate technique is not just an important skill to have, but a fundamental requirement for a data scientist, and so good knowledge of the tools to do this is essential.

--

--