Strategies for Handling Imbalanced Class in Machine Learning

Importance of Addressing Class Imbalance and Different Techniques to Handle it using Sklearn.

Nagraj Desai

8 min readApr 2, 2024

https://datascience.aero/wp-content/uploads/2017/12/imbalancedata.png

In this blog we will learn:

What is class imbalance?

Why it is important to treat class imbalance?

How do we handle class imbalance?

Random undersampling
Tomek Links
Random oversampling
Synthetic minority oversampling technique (SMOTE)
Adaptive synthetic sampling method (ADASYN)
SMOTETomek (Oversampling followed by undersampling)

Summary

What is class imbalance?

Class imbalance refers to the situation in a classification problem where the distribution of target classes in the dataset is skewed or uneven, meaning that one class (the minority class) is significantly underrepresented compared to the other class (the majority class).

For example, consider a binary classification problem where the task is to predict whether an email is spam or not spam. If only 5% of the emails in the dataset are spam (positive class), while the remaining 95% are not spam (negative class), then the dataset suffers from class imbalance.

Why it is important to treat class imbalance?

An imbalanced class leads to biased predictions and misleading accuracies. When the majority class dominates over the minority class, the machine learning algorithm tries to learn the majority class better and fails to predict the minority class. The algorithm becomes incapable of identifying the minority class while making predictions. This results in misleading accuracies. As the model is predicting the majority class well, naturally, the accuracy score will be higher, and it will not capture the inefficiency of the model in predicting the minority class.

Calculated metrics of the class imbalance dataset using Logistic regression model

For example, You have a dataset that contains 96% majority class(0) and 4% minority class(1), now if you don’t build a model and simply return all predictions as minority class(0) then also your accuracy would be really high i.e. 96%. Let's see if we face this problem then how to deal with it.

How do we handle class imbalance?

Since we understand now that imbalanced class has to be taken care then the next question is HOW?

We can handle imbalanced classes by balancing the classes by increasing the minority or decreasing the majority.

We can do that by following few techniques

Random undersampling (undersampling technique)
Tomek Links (undersampling technique)
Random oversampling (oversampling technique)
Synthetic minority oversampling technique (SMOTE) (oversampling technique)
Adaptive synthetic sampling method (ADASYN) (oversampling technique)
SMOTETomek (Oversampling followed by undersampling) (Hybrid technique)

NOTE: Use resampling methods on your training set, never on your test set!

Applying them on the test set will lead to data leakage, and the model will remember the instances of the test set. Note that the true test is always on the unseen data. Hence, there will not be any unseen data left to test the model performance on. So, always remember to do the train-test split before applying class imbalance techniques.

1. Random Under-Sampling

In Random Under-Sampling, instances from the majority class are randomly removed or downsampled to balance the class distribution. The aim is to reduce the dominance of the majority class while retaining all instances of the minority class. This approach helps prevent the model from being biased towards the majority class and allows it to learn from a more balanced representation of the data

In this method, you have the option of selecting fewer data points from the majority class for your model-building process. In case you have only 500 data points in the minority class, you will also have to take 500 data points from the majority class; this will make the classes balanced. However, in practice, this method is not effective because you will lose over 99% of the original data containing important information, which may lead to bias.

Make sure you have the latest imbalanced-learn package installed.

# install 
pip install imbalanced-learn
# upgrade
pip install --upgrade imbalanced-learn

# Random Undersampling
from imblearn.under_sampling import RandomUnderSampler
under_sample = RandomUnderSampler(random_state = 5)
X_resampled_us, y_resampled_us = under_sample.fit_resample(X_train, y_train)

# X_resampled_us: undersampled X_train
# y_resampled_us: undersampled y_train

2. Tomek Links

Tomek Links is a method used in data preprocessing for handling imbalanced datasets. These links are pairs of instances belonging to different classes that are close to each other in feature space. Specifically, for each instance of the minority class, a Tomek Link exists if the nearest neighbour of that instance belongs to the majority class.

https://images.app.goo.gl/xBGS96F6WZvkVj5V9

The main idea behind Tomek Links is to identify and remove instances that are in close proximity to instances of the opposite class, which are often considered noisy or ambiguous. By removing these instances, Tomek Links aims to improve the decision boundary between the classes, making classification more robust.

Tomek links is one of the undersampling techniques that is based on distance measures. It removes unwanted overlaps between classes and majority class links are removed until all the minimally distanced nearest neighbour pairs are of the same class. It tries to find the nearest neighbours of the majority class and resample/remove most of the nearest neighbours that overlap each other.

# Tomek Links
from imblearn.under_sampling import TomekLinks
tomek_sample = TomekLinks(sampling_strategy='majority')
X_resampled_tomek, y_resampled_tomek = tomek_sample.fit_resample(X_train, y_train)

3. Random Over-Sampling

In random oversampling, instances from the minority class are randomly duplicated or replicated until a more balanced distribution between the classes is achieved. Random oversampling is a simple technique to address class imbalance and improve the performance of machine learning models, particularly those that are sensitive to imbalanced class distributions.

Using this method, you can add more observations from the minority class by replication. Although this method does not add any new information, there is no information loss. It may also exaggerate the existing information to a certain extent, leading to the problem of overfitting.

# Random OverSampling
from imblearn.over_sampling import RandomOverSampler
over_sample = RandomOverSampler(sampling_strategy = 1)
X_resampled_os, y_resampled_os = over_sample.fit_resample(X_train, y_train)

4. SMOTE — Synthetic Minority Oversampling Technique

By creating synthetic samples rather than simply duplicating existing instances, SMOTE helps to avoid overfitting and introduces diversity into the minority class, which can improve the generalization performance of machine learning models. It effectively increases the amount of minority-class data available for training without introducing noise or redundancy.

Using this technique, you can generate new data points that lie on the vector between two data points that belong to the minority class. These data points are randomly selected and then assigned to the minority class. This method uses the K-nearest neighbours to create random synthetic samples.

The steps involved in this method are as follows:

Identifying the feature vector and its nearest neighbour
Taking the difference between the two
Multiplying the difference with a random number between 0 and 1
Identifying a new point on the line segment by adding the random number to the feature vector
Repeating the process for identified feature vectors

# SMOTE
from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=45, k_neighbors=5)
X_resampled_smt, y_resampled_smt = smt.fit_resample(X_train, y_train)

5. ADASYN — Adaptive Synthetic Sampling Method

ADASYN is an extension of the Synthetic Minority Over-sampling Technique (SMOTE). Like SMOTE, ADASYN focuses on generating synthetic samples for the minority class to balance the class distribution. However, ADASYN introduces a more adaptive mechanism to create synthetic samples, which can be particularly beneficial in scenarios where the class imbalance is severe or the data distribution is complex.

ADASYN adapts to the local data distribution by focusing more on generating synthetic samples for instances that are in regions of lower density, where the class imbalance is more severe. This adaptive mechanism helps to address the problem of overfitting that can occur with SMOTE when synthetic samples are generated uniformly across the feature space, regardless of the local data distribution. For a particular data point, it will add several synthetic samples that will have a density distribution, whereas for SMOTE, the distribution will be uniform. Here, the aim is to create synthetic data for minority samples that are harder to learn, rather than the easier ones.

By prioritizing the generation of synthetic samples based on the local data distribution, ADASYN can produce more representative and diverse synthetic samples, leading to better generalization performance of machine learning models.

To sum it up, the ADASYN method offers the following advantages:

It lowers the bias introduced by the class imbalance.
It adaptively shifts the classification decision boundary towards difficult samples.

# ADASYN
from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=45, n_neighbors=5)
X_resampled_ada, y_resampled_ada = ada.fit_resample(X_train, y_train)

6. SMOTETomek — Over-sampling followed by under-sampling

SMOTETomek is a hybrid technique used to address class imbalance. It combines the Synthetic Minority Over-sampling Technique (SMOTE) with Tomek Links to simultaneously oversample the minority class and undersample the majority class.

By combining SMOTE with Tomek Links, SMOTETomek aims to improve the effectiveness of class imbalance mitigation. Removing Tomek Links helps to clean up the boundary between classes, making it easier for SMOTE to generate meaningful synthetic samples.

Tomek links can be used as an undersampling method or a data-cleaning method. When applied in conjunction with SMOTE, Tomek links act as a data-cleaning mechanism within the oversampled training set. Unlike traditional Tomek links usage, where majority class examples are simply removed, SMOTE leverages them to bolster the representation of the minority class, thereby promoting better balance within the dataset.

# SMOTE+TOMEK
from imblearn.combine import SMOTETomek
smt_tmk = SMOTETomek(random_state=45)
X_resampled_smt_tmk, y_resampled_smt_tmk = smt_tmk.fit_resample(X_train, y_train)

Summary

It is important to understand that there is no single methodology that suits all problems. Let’s summarize all that you learnt and some key considerations to keep in mind when managing class imbalance.

There are some cases where undersampling methods perform better than oversampling methods and vice versa. However, it is important to use multiple methods for feature engineering of the data first and then compare the results to select the best possible method.
As undersampling methods remove the majority class from the data set, it can be an issue in many cases because you might lose significant information.
Oversampling methods generate samples from the data set, which can create irrelevant observations and also result in overfitting.
Interpreting and understanding the evaluation metrics and how they can be used to help solve a real-world business problem is extremely important.
Do not use the accuracy score as a metric for model evaluation. In a data set with 96% majority observations, you will likely make correct predictions 96% of the time. A confusion matrix and precision/recall score are better metrics for evaluating the performance of a model.