Oversampling — Handling Imbalanced Data

8 min readDec 23, 2023

Data is the lifeblood of machine learning, but traditional models struggle when fed incomplete information. Imbalanced datasets with disparate class instances are common in fields like fraud detection, disease diagnosis and anomaly identification. But data augmentation techniques aim to remedy these data discrepancies.

Classification algorithms trained on imbalanced data often result in poor predictive quality. Models bias heavily toward the majority class, overlooking minority examples critical to many use cases. This renders models impractical for real-world problems involving rare but high-priority events.

Oversampling provides a method to rebalance classes before model training commences. By replicating minority class data points, oversampling balances the playing field and prevents algorithms from disregarding significant yet sparse classes.

Common oversampling techniques include random oversampling, SMOTE (Synthetic Minority Oversampling Technique), and ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced Learning). Random oversampling naively duplicates minority examples, while SMOTE and ADASYN strategically generate synthetic new data to augment real instances.

In Python, imbalanced learn and Sklearn libraries feature oversampling utilities. Researchers can evaluate performance differences when training distinct classifiers on original versus oversampled data. This sheds light on techniques improving evaluation metrics like recall, precision and F1-scores for minority classes.

While overfitting risks exist, judicious oversampling counteracts the downsides of imbalanced learning. With data-driven balances restored, machine learning models gain the ability to address critical use cases requiring exceptional sensitivity to rare events and outliers. For many real-world problems, oversampling proves a difference-making approach.

What is oversampling ?

Oversampling is a data augmentation technique utilized to address class imbalance problems in which one class significantly outnumbers the others. It aims to rebalance training data distribution by amplifying the volume of instances that belong to the under-represented class.

Specifically, oversampling increases the minority class samples through replication of existing examples or generation of synthetic new data points. This is done by duplicating real minority observations or creating artificial additions modeled after real-world patterns in an attempt to even out class frequencies.

By amplifying underrepresented classes through oversampling prior to model training, the learned patterns more holistically represent all categories versus heavily favoring the dominant one. This improves various evaluation metrics for addressing needs involving detection of important but infrequent events.

Why Do We Need Oversampling?

When working with imbalanced datasets, we are usually interested in classifying the minority classes correctly. Hence, the cost of false negatives (i.e., failing to detect the minority class) is much higher than that of false positives (i.e., wrongly identifying a sample as belonging to the minority class).

However, conventional machine learning algorithms , like logistic regression and random forests, aim to optimize generalized performance metrics that assume balanced class distributions. As a result, models trained on skewed data tend to heavily favor the prevalent class, neglecting to learn patterns indicating the sparse but important classes.

By oversampling the minority class examples, the dataset is rebalanced to reflect more equal misclassification costs across all outcomes. This ensures classifiers can appropriately identify the underrepresented categories with greater accuracy and reduce costly false negatives.

Oversampling VS Undersampling

Oversampling and undersampling are both techniques used to address class imbalance by balancing training data distributions. However, they achieve this balance in opposing ways.

Oversampling resolves imbalance by increasing the minority class through replication or generation of new examples. In contrast, undersampling balances classes by reducing the number of samples in the overrepresented majority class.

Undersampling can be effective when the majority class has many redundant or similar samples or when dealing with huge datasets. But it can also lead to a loss of information, resulting in biased models.

On the other hand, oversampling can be effective when the datasets are small and there are limited available samples of the minority class. But, it can also lead to overfitting due to data duplication or the creation of synthetic data that are not representative of the real data.

In this article, we will explore different types of oversampling approaches.

1. Random oversampling

Random oversampling is a straightforward technique that duplicates minority class examples at random to balance class distributions.

It selects existing instances from the under-represented category in a random manner and replicates them without alteration. This has the benefit of raising minority observations efficiently when datasets are small in size without requiring collection of additional real-world data.

To implement random oversampling, we can utilize the RandomOverSampler tool within the popular imbalanced-learn library. It handles the process of arbitrarily replicating minority class examples to parity with majority cases before model training proceeds.

This provides a simple baseline approach for comparison against more complex oversampling methods. While duplication may risk eventual overfitting, random oversampling remains a conceptually simple initial balance option, especially for small datasets where synthesizing new data is premature.

from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline

X, y = create_dataset(n_samples=100, weights=(0.05, 0.25, 0.7))

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

clf.fit(X, y)
plot_decision_function(X, y, clf, axs[0], title="Without resampling")

sampler = RandomOverSampler(random_state=0)
model = make_pipeline(sampler, clf).fit(X, y)
plot_decision_function(X, y, model, axs[1],
                       f"Using {model[0].__class__.__name__}")

fig.suptitle(f"Decision function of {clf.__class__.__name__}")
fig.tight_layout()

2. Smoothed bootstrap oversampling

Random oversampling with noise is a modified version of simple random oversampling that aims to address its overfitting limitations. Rather than duplicating minority class examples precisely, this method synthesizes new data points by introducing randomness or noise into the features of existing underrepresented observations.

By default, random over-sampling generates a bootstrap. The parameter shrinkage allows adding a small perturbation to the generated data to generate a smoothed bootstrap instead. The plot below shows the difference between the two data generation strategies.

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

sampler.set_params(shrinkage=1)
plot_resampling(X, y, sampler, ax=axs[0], title="Normal bootstrap")

sampler.set_params(shrinkage=0.3)
plot_resampling(X, y, sampler, ax=axs[1], title="Smoothed bootstrap")

fig.suptitle(f"Resampling with {sampler.__class__.__name__}")
fig.tight_layout()

Smoothed bootstrap oversampling generates additional new minority class samples beyond simply duplicating existing instances like random oversampling. This is because the synthesized samples are not direct replications of original observations.

Rather than arbitrarily repeating minority observations, smoothed bootstrap interpolation creates new data points that are combinations or interpolations of feature vectors from real samples. This has the effect of expanding the available minority data beyond the original records through data augmentation instead of direct replication.

The interpolated data points are “smoothed” combinations that occupy the feature space around real instances, rather than overlaying on top of them. Therefore, smoothed bootstrap oversampling produces more novel synthetic minority examples compared to random oversampling alone. This helps address concerns about overfitting from duplicative techniques while still balancing class distributions.

Advantages

Random oversampling has the benefit of being a very straightforward and simple technique to implement. It does not require complex algorithms or assumptions about the data’s underlying distribution. As such, it can be readily applied to any imbalanced dataset without special prior knowledge.

Drawbacks

However, random oversampling is also limited by the potential for overfitting. Since it merely duplicates existing minority class examples rather than generating truly novel samples, the new observations do not provide additional informative details about under-represented classes. Instead of properly characterizing sparse classes more comprehensively, duplication may just amplify noise in the training data.

As a result, models can become overly tailored to the specific nuances of the initial dataset rather than capturing the true underlying patterns. This limits their ability to generalize well when exposed to new unseen data.

More advanced over-sampling using ADASYN and SMOTE

Instead of repeating the same samples when over-sampling or perturbating the generated bootstrap samples, one can use some specific heuristic instead. ADASYN and SMOTE can be used in this case.

3. SMOTE

Which stands for Synthetic Minority Oversampling Technique, is a widely used oversampling method for mitigating class imbalance problems in machine learning.

The key concept underlying SMOTE is that it artificially generates new synthetic data points for underrepresented classes through interpolation rather than duplicating. Specifically, it randomly selects a minority observation and determines its k nearest neighboring minority examples based on feature space distance.

New synthetic sample generation then occurs by interpolating between the initial instance and those k neighbors. This interpolation strategy synthesizes novel datapoints that populate the region between real observations, functionally expanding available minority examples without duplication or replication of original records.

By synthesizing examples that fall within the feature space bounded by real instances, SMOTE aims to portray minority class distribution more comprehensively than simpler duplication techniques. The resulting oversampled training sets leverage neighborhood information to better characterize underrepresented categories for machine learning models.

Advantages of SMOTE

SMOTE can generate new samples based on existing ones, which helps to add more information to the dataset to improve model performance.

Limitations of SMOTE

One of the main drawbacks of SMOTE is that it may introduce noise with the synthetic instances, especially when the number of nearest neighbors is set too high. Additionally, SMOTE may not work well on tightly clustered minority class instances or when there are few instances in the minority class.

4. Adaptive Synthetic Sampling (ADASYN)

ADASYN is an alternative oversampling technique that aims to address the issue of generating synthetic samples in regions of the feature space that are closer to the decision boundary. It works by developing more synthetic samples for minority class samples that are more difficult to learn, i.e., those closer to the decision boundary.

ADASYN uses as templates for the synthetic data samples from the minority class if some of their closest neighbors are from opposite classes. The more neighbors from the opposite class, the more likely it is to be used as template. After selecting the templates, it then generates the examples by interpolation between the template and the closest neighbors from its same class.

SMOTE VS ADASYN

from imblearn import FunctionSampler  # to use a idendity sampler
from imblearn.over_sampling import ADASYN, SMOTE

X, y = create_dataset(n_samples=150, weights=(0.1, 0.2, 0.7))

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))

samplers = [
    FunctionSampler(),
    RandomOverSampler(random_state=0),
    SMOTE(random_state=0),
    ADASYN(random_state=0),
]

for ax, sampler in zip(axs.ravel(), samplers):
    title = "Original dataset" if isinstance(sampler, FunctionSampler) else None
    plot_resampling(X, y, sampler, ax, title=title)
fig.tight_layout()

The following plot illustrates the difference between ADASYN and SMOTE. ADASYN will focus on the samples which are difficult to classify with a nearest-neighbors rule while regular SMOTE will not make any distinction. Therefore, the decision function depending of the algorithm.

X, y = create_dataset(n_samples=150, weights=(0.05, 0.25, 0.7))

fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))

models = {
    "Without sampler": clf,
    "ADASYN sampler": make_pipeline(ADASYN(random_state=0), clf),
    "SMOTE sampler": make_pipeline(SMOTE(random_state=0), clf),
}

for ax, (title, model) in zip(axs, models.items()):
    model.fit(X, y)
    plot_decision_function(X, y, model, ax=ax, title=title)

fig.suptitle(f"Decision function using a {clf.__class__.__name__}")
fig.tight_layout()