How to Deal with Imbalanced Data
A Step-by-Step Guide to handling imbalanced datasets in Python
A dataset with imbalanced classes is a common data science problem as well as a common interview question. In this article, I provide a step-by-step guideline to improve your model and handle the imbalanced data well. The most common areas where you see imbalanced data are classification problems such as spam filtering, fraud detection and medical diagnosis.
What makes Imbalanced Data a problem?
Almost every dataset has an unequal representation of classes. This isn’t a problem as long as the difference is small. However, when one or more classes are very rare, many models don’t work too well at identifying the minority classes. In this article, I’ll assume a two-class problem (one majority class and one minority class) for simplicity, but most of these techniques work for multiple datasets as well.
Usually, we look at accuracy on the validation split to determine whether our model is performing well. However, when the data is imbalanced, accuracy can be misleading. For example, say you have a dataset in which 92% of the data is labelled as ‘Not Fraud’ and the remaining 8% are cases of ‘Fraud’. The data is clearly imbalanced. Now say our model ends up as classifying everything it sees as ‘Not Fraud’. If we look at the accuracy, however, it’s a magnificent 92%. But the bank still cares about those ‘Fraud’ cases. That’s what it loses money on. So how do we improve our model?
The following are a series of steps and decisions you can carry out in order to overcome the issues with an imbalanced dataset.
1. Can you collect more data
You might say, “Well random guy on the internet, if I could collect more data, I wouldn’t be reading this, now would I? ColLECt MorE dAta”.
But hear me out. Take a step back and consider things such as:
- Was the data you received filtered on conditions that might have resulted in the minor class observations being dropped.
- How many years of history do you have, if you go back a year or two do you get more instances of the minor class with the data still remaining reasonable.
- If it is customer data, can we use customers that are no longer subscribers of our service, without it impacting the parameters we are trying to solve. Maybe using these previous customer records, along with a “current customer? (Y/N)” boolean variable will help.
- Can you combine your minority classes into a single class. For example, if you are classifying different types of anomalies in a person’s heartbeat, instead of trying to classify each type of anomaly, will the model still be useful if we combine them all into a single ‘Abnormal Heartbeat’ class.
Just some ideas and a reminder to always think about different possibilities.
2. Change Performance metric
Since, as explained above, accuracy isn’t a good measure when working with imbalanced datasets, let’s consider more appropriate measures.
Based on the confusion matrix we can measure the following:
- Precision: True Positives / All Predicted Positives = TP / (TP+FP). Precision is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
- Recall: True Positives / All actual positives = TP / (TP + FN). Recall is a measure of a classifier’s completeness. It is also the same as Sensitivity or the True positive rate. Low recall indicates a high number of false negatives.
- F1 score: 2TP/(2TP + FP + FN) A weighted average of precision and recall. If we wanted a balance between precision and recall then we’d look at F1 score.
Note: Here I’m assuming the minority class is labelled as the Positive class in the confusion matrix.
See this wikipedia page for a list of the performance metric formulas.
We can also take a look at:
- Sensitivity/Specificity from ROC Curve: The Sensitivity is basically the same as recall and tells us the True Positive Rate.
- Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
In these sorts of scenarios we want to be looking for high recall/sensitivity and f1 scores instead of accuracy to see how well our models do in predicting the minor class.
Jason Brownlee has more information on selecting different performance measures here.
3. Try Different Algorithms
As with most Data science problems, it’s always good practice to try a few different suitable algorithms on the data.
There are two main types of algorithms that seem to be effective with imbalanced dataset problems.
Decision Trees
Decision trees seem to perform pretty well with imbalanced datasets. Since they work by coming up with conditions/rules at each stage of splitting, they end up taking both classes into consideration.
We can try a few different decision tree algorithms like Random Forest, CART, C4.5.
Penalizing Models:
Penalized learning models (Cost-sensitive training) impose an additional cost on the model for making classification mistakes on the minority class during training. This forces the model to pay more attention to the minority class observations.
Anomaly Detection Models:
If it’s just the 2 classes, the majority class and the minority, then you could consider using an anomaly detection model instead of a classification model. What these anomaly models try to do is to create a profile for the majority class. Any observation that does not fit this profile is considered an anomaly or outlier, in our case an observation from the minority class. These sorts of models are used in situations such as fraud detection.
4. Resample the Dataset
This step can be done while trying the different models approach mentioned above. The different types of resampling are as follows:
- Under-sampling the majority class
- Oversampling the minority class
Under sampling (Downsampling) the majority class
Under-sampling randomly removes observations of the majority class. This reduces the number of majority class observations used in the training set and as a result balances the number of observations of the two classes better. This is suitable when you have a lots of observations in your dataset (>10K observations). The risk is you are losing information and so may lead to underfitting.
Scikit-learn provides a ‘resample’ method which we can use for undersampling. The imbalanced-learn package also provides more advanced functionality. A Python code sample is shown below:
Since many of the observations of the majority class have been dropped, the resulting dataset is now much smaller. The ratio between the two classes is now 1:1.
Note: we don’t necessarily have to use a 1:1 ratio, we can just reduce the number of majority class observations to any reasonable ratio using this method.
Oversampling (Upsampling) the minority class
Oversampling randomly duplicates observations from the minority class in order to make its signal stronger. The simplest form of oversampling is sampling with replacement. Oversampling is suitable when you don’t have a lots of observations in your dataset (<10K observations). The risk is if you duplicate too many observations, well then you are overfitting.
We can use the same scikit-learn ‘resample’ method but with different parameters. A code sample is shown below:
This time we sample with replacement to have more representation in the final training set. But as I mentioned this could lead to overfitting. So, how can we do things better to avoid overfitting?
5. Generate Synthetic samples
In order to reduce overfitting during upsampling, we can try creating synthetic samples.
SMOTE
A popular algorithm is SMOTE (Synthetic Minority Over Sampling Technique). Instead of using copies of observations to oversample, SMOTE varies attributes of the observations to create new synthetic samples.
You can find an example of SMOTE here.
Augmentation
Similar to SMOTE, if your data is things like audio or images, then you can perform transformations to the original files to create new samples as well.
6. Conclusion
As with most things in data science and machine learning algorithms, there is no definitive right approach that works every time. Depending on the nature of your dataset, distribution of classes, predictors and model, some of the above-mentioned methods will work better than the others. It is up to you to figure out the best combination.
A few pointers to keep in mind are:
- Always do the train/test split before creating synthetic/augmented samples. You want to validate and test your model on original data observations
- Use the same metrics for comparison — every time you try something new, remember to compare them right. Don’t look at only accuracy for one model and sensitivity for another.
In addition to these steps, don’t forget that you still have to do things such as data cleaning, feature selection and hyper-parameter tuning.
Hope this helped!
if_you_like(this_article):
please(CLAPS) # Do this :)
else:
initiate_sad_author() # Don't do this :(# Thanks :)
References and Further Reading
Boyle, Tara. “Methods for Dealing with Imbalanced Data.” Medium. February 04, 2019. Accessed June 21, 2020. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18.
Brownlee, Jason. “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset.” Machine Learning Mastery. January 14, 2020. Accessed June 21, 2020. https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/.
“How to Handle Imbalanced Classes in Machine Learning.” May 23, 2020. https://elitedatascience.com/imbalanced-classes.