Never Leave the GPU: End-to-end Machine Learning Pipelines with RAPIDS Preprocessing

Published in

RAPIDS AI

5 min readNov 23, 2020

Man jumping on moving sidewalk — Photo by Andy Beales on Unsplash

Since its inception, RAPIDS cuML has offered dramatically faster training and inference of machine learning (ML) models through GPU acceleration. However, as any data scientist can tell you, the model itself is only part of what it takes to succeed when it comes to machine learning. Frequently, the best solutions to a machine learning problem involve extensive preprocessing of the input data to promote faster convergence or improve performance, and finding the perfect preprocessing approach requires the ability to quickly iterate through possibilities and test their performance impact.

With the v0.16 release of RAPIDS, cuML now includes a whole slate of new tools to speed up common preprocessing tasks, allowing you to try more preprocessing pipelines faster and narrow in on that optimal solution. Importantly, these new tools mean that for most ML pipelines, your data never has to leave the GPU, eliminating the overhead of device-to-host copies. On an NVIDIA V100 with a local GPU memory bandwidth of about 900 GB/s but a PCIe bandwidth of 32 GB/s, these device-to-host copies can be about 30 times slower than an equivalent on-device copy. With the new A100, GPU memory bandwidth stands at 2 TB/s, making for even faster on-device copies. The increased total memory of the A100 also means that you can have up to 640 GB of data loaded in GPU memory on a single server, allowing even relatively large datasets to stay on-device through an entire ML workflow.

In the following Jupyter notebook, we’ll do a walkthrough of several of cuML’s new preprocessing algorithms and demonstrate the significant speedups of keeping your data exclusively on the GPU.

Standing on the Shoulders of Giants

Medieval picture of dwarf (cuML) on the shoulder of a giant (Scikit-Learn)

Before digging into too much technical detail, it is worth acknowledging the tremendous debt of gratitude that cuML owes to Scikit-Learn. Thanks to its gentle learning curve, thoughtful design, and enthusiastic community, the Scikit-Learn API has become the defacto standard for machine learning implementations. It is no wonder that cuML made the decision early on to retain as much API compatibility as possible with Scikit-Learn, primarily trying to offer GPU acceleration to an already well-developed ecosystem of Scikit-Learn-based tools.

Because of this decision, the RAPIDS team was able to benefit even more directly from Scikit-Learn than we have in the past as we developed the new preprocessing tools we are demonstrating here. For the new “cuml.experimental.preprocessing” module, we have been able to directly incorporate Scikit-Learn code (still distributed under the Scikit-Learn license, of course) into cuML with only a small amount of adapter logic to gain the benefits of end-to-end GPU acceleration. So if you use these features in your work, remember that they are available thanks to the dedicated work of Scikit-Learn developers, and don’t forget to cite them.

The Preprocessing Menu

Closeup of a buffet — (Photo by Ulysse Pointcheval on Unsplash)

Preprocessing algorithms available in cuML can be divided into five broad categories: imputers, scalers, discretizers, encoders, and feature generators.

Imputers are responsible for filling in missing data based on the data that is available.
Scalers can be used to recenter a feature by stripping off its average (typically mean or median) value and/or scale a feature by standardizing its spread (typically variance or interquartile range) or total range.
Discretizers take quantitative features and assign them to discrete categories or bins.
Encoders generally do something of the opposite — taking categorical features and assigning some more useful numeric representation to them.
Feature generators create new features by combining existing ones.

The entire array of preprocessing algorithms available in cuML is shown below:

Table showing the names of cuML preprocessors of each type. You can find these names in the cuML preprocessing docs — Currently-available cuML preprocessors

Details on all of these can be found here and here in the cuML docs. Algorithms marked with an asterisk are currently classified as experimental. The RAPIDS team is looking for feedback and will continue to improve on this code in upcoming releases. If you have questions about these new features or run into issues, please do reach out to us through our issue tracker.

An Example

To showcase a few of these newly-accelerated algorithms, we have created a Jupyter notebook, which walks through their application to the BNP Paribas Cardif Claims Management dataset. This dataset contains a little over 100,000 samples, each with about 130 features, both categorical and quantitative. We chose this dataset both because it is somewhat challenging (many missing values and no information on the actual meaning of features) and because it is a good size to show off the runtime improvements of cuML without making it tedious to run the Scikit-Learn equivalent. The goal of this demo is primarily to show off the process of creating an end-to-end GPU-accelerated pipeline; for a detailed look at maximizing classification performance on this dataset, check out the leading solutions for the associated Kaggle challenge.

Jupyter notebook walkthrough of cuML preprocessing tools

If you would like to try running this notebook locally, you can do so by first installing RAPIDS v0.16 and then downloading the notebook from here.

Results

The final pipeline in our demo notebook included examples of all five major categories of preprocessing algorithms. Below are average timing results for the pipeline with both cuML and Scikit-Learn on an NVIDIA DGX-1, using a single V100 GPU.

Table breaking down the 4.3x improvement in runtime for preprocessing and 2.5x for the overall pipeline when using cuML

Note that times for all of the preprocessing steps listed do not add up to the total preprocessing time because we have omitted some smaller and less interesting parts of the pipeline in the breakdown.

Conclusions

With v0.16 of RAPIDS, it is now practical to perform every step of complex ML pipelines on the GPU, resulting in a significant speedup for feature engineering tasks. End-to-end GPU acceleration with cuML enables us to iterate on preprocessing approaches more quickly and hone in on exactly the right features to maximize performance or convergence time.

For a deeper dive into preprocessing with cuML for natural-language processing, check out our two previous posts on the subject. For a closer look at how one Kaggler has taken advantage of end-to-end GPU processing with RAPIDS, check out Louise Ferbach’s recent article on using RAPIDS for the ongoing Mechanisms of Action competition.

If you like what you’ve seen here, you can easily install RAPIDS via conda or Docker and try it out on your own data. If you run into any issues while these preprocessing features make their way out of experimental, please do let us know via the cuML issue tracker. You can also get in touch with us via Slack, Google Groups, or Twitter. We always appreciate user feedback.