Handling Class Imbalance in Classification algorithms explained

Oversampling and Undersampling algorithms explained

Mehul Gupta
Data Science in your pocket

--

Taking a break from Generative AI for a while, I recently got a chance to work on a binary classification problem but with a twist i.e. the data was highly imbalanced and required some preprocessing to move ahead. So, in this post, I will try to explain the different imbalance algorithms I tried out.

We will explore the below topics in this post

What is class imbalance & why it is a problem?

Metrics to Consider & Avoid

Techniques to handle imbalance

Oversampling algorithms (SMOTE, ADASYN)

Undersampling algorithms (Tomek’s Link, Nearest Neighbors)

Oversampling+Undersampling (SMOTE Tomek)

Sample code

Before jumping onto the algorithms,

What is class imbalance?

Class imbalance happens when some categories in a dataset have way more examples than others. Basically, one group has loads of entries, while another has just a few.

Take a fraud detection system. If you’re trying to spot fraudulent transactions (the positive class) among regular ones (the negative class), you might find way more non-fraudulent transactions than fraudulent ones. That’s class imbalance in action.

Why class imbalance a problem?

Class imbalance is problematic because most algorithms/ML models expect the classes to be evenly distributed in the dataset. When one class massively outweighs the other, such algorithms usually favor the bigger class and ignore the smaller one. This is a big deal when the minority class is the one you really care about, like detecting rare diseases or spotting fraud.

So, at times,

the model might output just the majority class always

When dealing with class imbalance, certain metrics can be misleading because they don’t account for the skewed distribution. Here are the metrics you should avoid:

Metrics to Avoid

Accuracy: In an imbalanced dataset, accuracy can be high simply because the model is good at predicting the majority class. For example, if 95% of transactions are non-fraudulent, a model that always predicts non-fraudulent will be 95% accurate but useless for detecting fraud.

Error Rate: Like accuracy, error rate (the percentage of incorrect predictions) can also be misleading in imbalanced scenarios because it doesn’t reflect the model’s performance on the minority class.

Mean Squared Error (MSE): For regression problems with imbalanced data, MSE can be dominated by the majority class errors, thus not representing how well the model performs on the minority class.

Recommended Metrics

Instead, you should focus on metrics that give more insight into the performance on the minority class:

Precision: The ratio of true positive predictions to the total predicted positives.

Recall: The ratio of true positive predictions to the actual positives.

F1 Score: The harmonic means of precision and recall, providing a balance between them.

Area Under the Precision-Recall Curve (AUC-PR): A better indicator than ROC-AUC in the case of imbalanced datasets.

Now, as we know enough about class imbalance, let’s jump onto how this can be solved. So, majorly there are 3 major techniques to resolve this problem

Techniques

Under sampling

Under sampling reduces the number of instances in the majority class to balance the dataset. Example:

  • Original: 950 non-fraudulent, 50 fraudulent.
  • After under sampling: 50 non-fraudulent, 50 fraudulent.

Oversampling

Oversampling increases the number of instances in the minority class by duplicating or generating new instances. Example:

  • Original: 950 non-fraudulent, 50 fraudulent.
  • After oversampling: 950 non-fraudulent, 950 fraudulent.

Undersampling+Oversampling

Combining both techniques balances the dataset by reducing the majority class and increasing the minority class. Example:

  • Original: 950 non-fraudulent, 50 fraudulent.
  • After mixing: 300 non-fraudulent (under sampled), 300 fraudulent (oversampled)

Now we will deep dive into some of the most important algorithms in each of these categories

OVERSAMPLING

We will be discussing SMOTE and ADASYN in this section

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a method used to address class imbalance by generating synthetic samples for the minority class. Instead of simply duplicating existing minority samples, SMOTE creates new instances by interpolating between existing ones.

How SMOTE Works?

Choose a Minority Instance: Randomly select a minority class instance.

Find Nearest Neighbors: Identify its k nearest minority class neighbors.

Generate Synthetic Sample: Randomly select one of these neighbors and create a synthetic sample by interpolating between the original instance and the neighbor.

Imagine you have a dataset for detecting fraudulent transactions:

Non-fraudulent transactions: 950

Fraudulent transactions: 50

To balance this using SMOTE:

Choose a fraudulent transaction, e.g., transaction A.

Find its nearest neighbors among the fraudulent transactions, say transactions B, C, and D.

Generate a synthetic sample by interpolating between A and one of its neighbors (say B).

Suppose transaction A has values (x1, y1) and transaction B has (x2, y2).

A synthetic transaction could be created with values (x1 + α*(x2 — x1), y1 + α*(y2 — y1)), where α is a random number between 0 and 1.

After applying SMOTE to generate enough synthetic samples to match the majority class, your dataset might look like this:

Non-fraudulent: 950

Fraudulent: 950 (50 original + 900 synthetic)

ADASYN

Almost similar, ADASYN can be taken as an extension for SMOTE following the same ideology with a minor difference. Instead of choosing a random ‘minority’ sample for generating interpolated samples, it assigns weights to minority samples and based on weights, prioritizes minority samples to consider while generating fake samples. Hence, a sample with higher weight will be more frequently used to generate the fake sample.

How are these weights assigned?

Calculate Density Distribution: ADASYN first calculates the density distribution of the minority class samples. This is done by finding the ratio of majority class samples to minority class samples among the k nearest neighbors of each minority sample.

Assume the below distribution for given minority points

Point A : 2 minority, 5 majority. Ratio=5/2=2.5

Point B : 1 minority, 6 majority=6/1=6

Than weights can be calculated by normalizing these ratio values.

A=2.5/(6+2.5)=0.3

B=6/(6+2.5)=0.7

Assign Weights: Based on the density distribution, ADASYN assigns weights to the minority class samples. Samples with a higher ratio of majority neighbors (i.e., located in sparser regions) are given higher weights, as they are considered harder to learn.

Rest of the process remains similar to SMOTE where K-nearest minority samples around the chosen sample are considered and new samples are interpolated.

Under sampling

Tomek’s Link

It identifies and removes pairs of examples that are very close to each other but belong to different classes, particularly removing from the majority class to help balance the dataset.

How it works in simple terms:

Find Pairs: Look for pairs of examples (data points) that are close to each other in terms of their features (like height, weight, etc.). By close, I mean the distance (be it euclidean, manhattan or any other) is less.

Check Classes: See if these pairs belong to different classes (e.g., one is a “yes” and the other is a “no”).

Remove Majority Class Example: If they do belong to different classes, remove the example from the majority class (the class with more examples).

Nearest Neighbors

In case of Tomek’s Link, we identified similar samples but with different labels. Here, we will identify similar samples from majority class only and remove redundant data.

The idea is simple:

Identify Nearest Neighbors: Find the most similar data points for each example in the dataset for majority class
Remove Less Informative Examples: Remove redundant examples from the majority class.

Oversampling + Undersampling (Mix)

SMOTE Tomek

As the name suggest, it is a combination of SMOTE (Oversampling) & Tomek (Undersampling) where we first

  • Apply SMOTE to minority class
  • Tomek Link’s for majority class

Similarly we can have other combinations as well like ADASYN+Tomek, etc

Sample Codes

Not just the above mentioned algorithms, there exist many other algorithms that one can explore. There is a specialized library as well for implementing these algorithms called imbalanced-learn in python. To use,

pip install imbalanced-learn

Now, we will create a dummy classification data and eventually use SMOTE for oversampling the minority (use scikit-learn==1.4.0)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

# Generate an imbalanced binary classification dataset
X, y = make_classification(n_samples=10000, weights=[0.9, 0.1], random_state=42)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the class distribution before oversampling
print("Before oversampling:", Counter(y_train))
# Before oversampling: Counter({0: 7192, 1: 808})

# Create an instance of SMOTE
smote = SMOTE()

# Apply SMOTE to the training data
X_train_oversampled, y_train_oversampled = smote.fit_resample(X_train, y_train)

# Print the class distribution after oversampling
print("After oversampling:", Counter(y_train_oversampled))
# After oversampling: Counter({0: 7192, 1: 7192})

The code is as easy as it can get where you just need to call fit_resample() using SMOTE() object. Similarly any other algorithm can be used with ease.

With this, I will wrap this long blog post. Hope this is useful

--

--