Faster Resampling with Imbalanced-learn and cuML

Nick Becker

Published in

RAPIDS AI

6 min readFeb 22, 2023

Authors (alphabetical): Nick Becker (NVIDIA), Dante Gama Dessavre (NVIDIA), and Corey Nolet (NVIDIA)

Introduction

Imbalanced-learn is the most popular open source library for resampling datasets with class imbalance. With more than 3 million downloads in the past month alone, it’s a critical piece of the data analysis and machine learning ecosystem.

RAPIDS cuML is a suite of fast, GPU-accelerated machine learning algorithms designed for data science and analytical tasks. By leveraging the massive parallelism and high-bandwidth memory of NVIDIA GPUs, cuML can provide significant speedups compared to CPU-based machine learning libraries. With an API that mirrors scikit-learn, cuML should feel familiar to users.

In this post, we walk through how you can now use cuML estimators with imbalanced-learn to run many resampling techniques significantly faster on large datasets. Based on our results, cuML often provides 5–15x speedups for techniques like SMOTE, ADASYN, and EditedNearestNeighbours and 100–300x speedups for techniques like SVMSMOTE and CondensedNearestNeighbour.

Why GPUs?

Resampling techniques, which often rely on nearest neighbors algorithms in addition to algorithm-specific logic, can be extraordinarily computationally expensive. Among many performance improvements over time, the recent enhancements to pairwise distance primitives in scikit-learn have really sped up imbalanced-learn’s out-of-the-box results.

While these enhancements have improved performance, on larger datasets many resampling techniques remain fundamentally bottlenecked by the core computational primitives underpinning them. As an example, the following code took about 45 minutes on a system with dual Intel Xeon E5–2698 20-core CPUs (80 logical cores).

from imblearn.over_sampling import SVMSMOTE
from sklearn.datasets import make_classification

X, y = make_classification(
  n_samples=100000,
  n_features=100,
  n_redundant=0,
  n_informative=100,
  n_classes=5,
  n_clusters_per_class=1,
  weights=[0.8, 0.05, 0.05, 0.05, 0.05]
)

X_resampled, y_resampled = SVMSMOTE().fit_resample(X, y)

If we wanted to include this resampling step in a broader cross validation machine learning pipeline, we’d be waiting quite a long time. As scaling is often non-linear, larger datasets pose even larger problems. To run these kinds of techniques at scale, we need to do something differently.

Using Imbalanced-learn with cuML

cuML makes it possible to use these techniques at scale. You can install cuML and imbalanced-learn in the same environment using conda or pip.

# conda/mamba
mamba create -n rapids-imblearn -c rapidsai -c conda-forge -c nvidia rapids=23.02 python=3.8 cudatoolkit=11.5 imbalanced-learn
mamba activate rapids-imblearn

# pip
python3.8 -m venv rapids-imblearn
source rapids-imblearn/bin/activate
pip install --ugprade pip
pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
pip install imbalanced-learn

Using cuML with imbalanced-learn is easy — just pass the cuML estimator as an argument while instantiating your resampler of choice. In the following sections, we share several examples that highlight both the ease of use and benchmark the impact of using cuML with datasets of different sizes and shapes.

For the benchmark datasets, we use scikit-learn’s dataset generators and vary the number of rows, number of features, and number of classes. Class weights are set to [0.9, 0.1] in the two-class scenario and [0.8, 0.05, 0.05, 0.05, 0.05] in the five-class scenario, to represent a plausible imbalance. CPU benchmarks were run on a system with dual Intel Xeon E5–2698 20-core CPUs (80 logical cores) and GPU benchmarks were run on a Google Cloud a2-highgpu-1g instance with an A100 40 GB GPU and 12 vCPUs.

Oversampling

cuML can accelerate a wide set of oversampling methods in imbalanced-learn. Below, we highlight three popular techniques.

SMOTE

SMOTE is an oversampling technique that uses a nearest neighbor algorithm to oversample and create synthetic data based on the nearest neighbors of records within the underrepresented classes. This means that computational performance scales with the number of samples per underrepresented class in the dataset, not the overall dataset size.

On a 10M x 20 dataset with [0.9, 0.1] class weights, standard SMOTE took about 150 seconds. Using cuML’s NearestNeighbors brings this down to about 18 seconds, an 8x speedup.

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from cuml.neighbors import NearestNeighbors

X, y = make_classification(
  n_samples=10000000,
  n_features=20,
  n_redundant=0,
  n_informative=100,
  n_classes=2,
  n_clusters_per_class=1,
  weights=[0.9, 0.1]
)

nn = NearestNeighbors(n_neighbors=6)
X_resampled, y_resampled = SMOTE(n_neighbors=nn).fit_resample(X, y)

On the 10M row datasets, the benchmark results suggest that cuML can provide speedups ranging from 2x to 8x.

SVMSMOTE

SVMSMOTE is similar to SMOTE, but uses a Support Vector Machine classifier to select candidates from which to generate synthetic sample data. Using an SVM makes it much more computationally expensive, explaining the 45 minute runtime we saw in the original example above.

By using cuML’s NearestNeighbors and SVC estimators, we can run the original example in 9 seconds rather than 45 minutes (a 285x speedup).

from imblearn.over_sampling import SVMSMOTE
from sklearn.datasets import make_classification
from cuml.neighbors import NearestNeighbors
from cuml.svm import SVC

X, y = make_classification(
  n_samples=100000,
  n_features=100,
  n_redundant=0,
  n_informative=100,
  n_classes=5,
  n_clusters_per_class=1,
  weights=[0.8, 0.05, 0.05, 0.05, 0.05]
)

nn = NearestNeighbors(n_neighbors=6)
svm = SVC()
X_resampled, y_resampled = SVMSMOTE(
  k_neighbors=nn,
  m_neighbors=nn,
  svm_estimator=svm
).fit_resample(X, y)

Performance gains are significant across a varied number of classes and features, ranging from 50x to 340x.

ADASYN

ADASYN is also similar to regular SMOTE, but it uses a two step nearest neighbors process to generate synthetic samples (first using the full dataset to find nearest neighbors for the underrepresented class and then doing a second pass within just the underrepresented class).

Using the full dataset and an additional step makes it significantly more computationally expensive than regular SMOTE, though still less than SVMSMOTE.

from imblearn.over_sampling import ADASYN
from sklearn.datasets import make_classification
from cuml.neighbors import NearestNeighbors

X, y = make_classification(
  n_samples=10000000,
  n_features=20,
  n_redundant=0,
  n_informative=100,
  n_classes=2,
  n_clusters_per_class=1,
  weights=[0.9, 0.1]
)

nn = NearestNeighbors(n_neighbors=6)
X_resampled, y_resampled = SMOTE(n_neighbors=nn).fit_resample(X, y)

On the 10M x 20 datasets, the benchmark results suggest that cuML can provide up to 13x speedups.

Undersampling

cuML can also accelerate undersampling methods. Below, we highlight two canonical techniques.

CondensedNearestNeighbour

CondensedNearestNeighbour is an iterative undersampling technique that classifies the nearest neighbor for each record in the overrepresented class to help decide whether to keep the record. As a result, it’s very slow. We use much smaller datasets for this benchmark.

As usual, cuML can be dropped right in (with one extra step for this resampler):

from imblearn.under_sampling import CondensedNearestNeighbour
from sklearn.datasets import make_classification
from cuml.neighbors import KNeighborsClassifier

X, y = make_classification(
  n_samples=50000,
  n_features=20,
  n_redundant=0,
  n_informative=100,
  n_classes=2,
  n_clusters_per_class=1,
  weights=[0.9, 0.1]
)

knn = KNeighborsClassifier(n_neighbors=1)
cnn = CondensedNearestNeighbour(n_neighbors=knn)
cnn.estimator_ = knn # extra step for this resampler
X_res, y_res = cnn.fit_resample(X, y)

On only 50,000 rows, the benchmark results suggest that cuML can provide 50–100x speedups, turning about an hour of waiting into just seconds in some cases.

EditedNearestNeighbours

EditedNearestNeighbours is an undersampling technique that identifies the nearest neighbors of each record in the overrepresented class and removes “samples which do not agree “enough” with their neighborhood”.

from imblearn.under_sampling import EditedNearestNeighbours
from sklearn.datasets import make_classification
from cuml.neighbors import NearestNeighbors

X, y = make_classification(
  n_samples=50000,
  n_features=20,
  n_redundant=0,
  n_informative=100,
  n_classes=2,
  n_clusters_per_class=1,
  weights=[0.9, 0.1]
)

nn = NearestNeighbors(n_neighbors=4)
enn = EditedNearestNeighbours(n_neighbors=nn)
X_res, y_res = enn.fit_resample(X, y)

Depending on dataset characteristics, cuML speedups range between 3–13x.

Conclusion

Imbalanced-learn is the canonical tool for resampling imbalanced datasets in Python. With support for using RAPIDS cuML estimators, it’s now possible to use imbalanced-learn on larger datasets than ever before. In the benchmarks above, we saw more than 100x speedups using cuML on an NVIDIA GPU.

To learn more, visit the imbalanced-learn and RAPIDS cuML documentation.

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.

Acknowledgements

During his NVIDIA internship, James Thomson (Stanford) contributed to RAPIDS cuML and scikit-learn-contrib/imbalanced-learn to help make this integration possible.