Published in


Faster Resampling with Imbalanced-learn and cuML

Authors (alphabetical): Nick Becker (NVIDIA), Dante Gama Dessavre (NVIDIA), and Corey Nolet (NVIDIA)


Imbalanced-learn is the most popular open source library for resampling datasets with class imbalance. With more than 3 million downloads in the past month alone, it’s a critical piece of the data analysis and machine learning ecosystem.

RAPIDS cuML is a suite of fast, GPU-accelerated machine learning algorithms designed for data science and analytical tasks. By leveraging the massive parallelism and high-bandwidth memory of NVIDIA GPUs, cuML can provide significant speedups compared to CPU-based machine learning libraries. With an API that mirrors scikit-learn, cuML should feel familiar to users.

In this post, we walk through how you can now use cuML estimators with imbalanced-learn to run many resampling techniques significantly faster on large datasets. Based on our results, cuML often provides 5–15x speedups for techniques like SMOTE, ADASYN, and EditedNearestNeighbours and 100–300x speedups for techniques like SVMSMOTE and CondensedNearestNeighbour.

Why GPUs?

Resampling techniques, which often rely on nearest neighbors algorithms in addition to algorithm-specific logic, can be extraordinarily computationally expensive. Among many performance improvements over time, the recent enhancements to pairwise distance primitives in scikit-learn have really sped up imbalanced-learn’s out-of-the-box results.

While these enhancements have improved performance, on larger datasets many resampling techniques remain fundamentally bottlenecked by the core computational primitives underpinning them. As an example, the following code took about 45 minutes on a system with dual Intel Xeon E5–2698 20-core CPUs (80 logical cores).

If we wanted to include this resampling step in a broader cross validation machine learning pipeline, we’d be waiting quite a long time. As scaling is often non-linear, larger datasets pose even larger problems. To run these kinds of techniques at scale, we need to do something differently.

Using Imbalanced-learn with cuML

cuML makes it possible to use these techniques at scale. You can install cuML and imbalanced-learn in the same environment using conda or pip.

Using cuML with imbalanced-learn is easy — just pass the cuML estimator as an argument while instantiating your resampler of choice. In the following sections, we share several examples that highlight both the ease of use and benchmark the impact of using cuML with datasets of different sizes and shapes.

For the benchmark datasets, we use scikit-learn’s dataset generators and vary the number of rows, number of features, and number of classes. Class weights are set to [0.9, 0.1] in the two-class scenario and [0.8, 0.05, 0.05, 0.05, 0.05] in the five-class scenario, to represent a plausible imbalance. CPU benchmarks were run on a system with dual Intel Xeon E5–2698 20-core CPUs (80 logical cores) and GPU benchmarks were run on a Google Cloud a2-highgpu-1g instance with an A100 40 GB GPU and 12 vCPUs.


cuML can accelerate a wide set of oversampling methods in imbalanced-learn. Below, we highlight three popular techniques.

SMOTE is an oversampling technique that uses a nearest neighbor algorithm to oversample and create synthetic data based on the nearest neighbors of records within the underrepresented classes. This means that computational performance scales with the number of samples per underrepresented class in the dataset, not the overall dataset size.

On a 10M x 20 dataset with [0.9, 0.1] class weights, standard SMOTE took about 150 seconds. Using cuML’s NearestNeighbors brings this down to about 18 seconds, an 8x speedup.

On the 10M row datasets, the benchmark results suggest that cuML can provide speedups ranging from 2x to 8x.

SVMSMOTE is similar to SMOTE, but uses a Support Vector Machine classifier to select candidates from which to generate synthetic sample data. Using an SVM makes it much more computationally expensive, explaining the 45 minute runtime we saw in the original example above.

By using cuML’s NearestNeighbors and SVC estimators, we can run the original example in 9 seconds rather than 45 minutes (a 285x speedup).

Performance gains are significant across a varied number of classes and features, ranging from 50x to 340x.

ADASYN is also similar to regular SMOTE, but it uses a two step nearest neighbors process to generate synthetic samples (first using the full dataset to find nearest neighbors for the underrepresented class and then doing a second pass within just the underrepresented class).

Using the full dataset and an additional step makes it significantly more computationally expensive than regular SMOTE, though still less than SVMSMOTE.

On the 10M x 20 datasets, the benchmark results suggest that cuML can provide up to 13x speedups.


cuML can also accelerate undersampling methods. Below, we highlight two canonical techniques.

CondensedNearestNeighbour is an iterative undersampling technique that classifies the nearest neighbor for each record in the overrepresented class to help decide whether to keep the record. As a result, it’s very slow. We use much smaller datasets for this benchmark.

As usual, cuML can be dropped right in (with one extra step for this resampler):

On only 50,000 rows, the benchmark results suggest that cuML can provide 50–100x speedups, turning about an hour of waiting into just seconds in some cases.

EditedNearestNeighbours is an undersampling technique that identifies the nearest neighbors of each record in the overrepresented class and removes “samples which do not agree “enough” with their neighborhood”.

Depending on dataset characteristics, cuML speedups range between 3–13x.


Imbalanced-learn is the canonical tool for resampling imbalanced datasets in Python. With support for using RAPIDS cuML estimators, it’s now possible to use imbalanced-learn on larger datasets than ever before. In the benchmarks above, we saw more than 100x speedups using cuML on an NVIDIA GPU.

To learn more, visit the imbalanced-learn and RAPIDS cuML documentation.

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter. The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.


During his NVIDIA internship, James Thomson (Stanford) contributed to RAPIDS cuML and scikit-learn-contrib/imbalanced-learn to help make this integration possible.



RAPIDS is a suite of software libraries for executing end-to-end data science & analytics pipelines entirely on GPUs.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store