Input and Output Configurability in RAPIDS cuML

Image for post
Image for post
An example optimized cuML workflow

The RAPIDS machine learning library, cuML, supports several types of input data formats while attempting to return results in the output format that fits best into users’ workflows. The RAPIDS team has added functionality to cuML to support diverse types of users:

Maximize Compatibility: Users with existing NumPy, Scikit-learn, and traditional PyData libraries based workflows: cuML’s default behavior, allowing as many formats as possible, and its Scikit-learn based API design, allows for porting parts of these workflows with very little effort and no disruptions. You can use NumPy arrays for input and get back NumPy arrays as output, exactly as you expect, just much faster.

Maximize Performance: Users who want ultimate performance by keeping everything in the GPU’s memory: cuML’s use of open-source standards and configurability of behavior allows users to achieve maximum performance with low effort.

This blog will go into the details of how users can leverage this work to get the most benefits from cuML and GPUs.

Note: In cuML release 0.13 all the needed components are included, and the following models have the input and output configurability functionality enabled: KMeans, DBSCAN, OLS, Ridge and Lasso Regression. In the upcoming cuML 0.14 release, all models will have this input and output configurability functionality enabled. Multinode, multiGPU models have some differences that are talked briefly about in the last section of this blog.

Compatible Input Formats: The Wonders of the CUDA Array Interface

Thanks in great part to the __cuda_array_interface__, referred to as CAI, cuML accepts a multitude of data formats:

This list is constantly expanding based on user demand. For example, the cuML team is working on direct support for the dlpack array standard, which coincides nicely with TensorFlow’s new support for it. This can also be done by going through either cuDF or CuPy, which also have dlpack support. If you have a specific data format that is not currently supported, please submit an issue or pull request on Github.

Default Behavior: How Does cuML Work Out of the Box?

cuML’s default behavior is designed to mirror the input as much as possible. For example, if you are doing your ETL in cuDF, which is very typical for RAPIDS users, you would see something like:

Default input format type mirroring behavior of cuML

When you use cuDF DataFrames, cuML gives you back cuDF objects (in this case a Series) as a result. But, as mentioned above, cuML also allows you to use NumPy arrays without changing the cuML call:

Default input format type mirroring behavior of cuML mirroring NumPy arrays

In this case, now cuML gives back the results as NumPy arrays. Mirroring the input data type format is the default behavior of cuML, and in general, the behavior is:

Image for post
Image for post
Figure 1: List of acceptable input formats and default output behavior

This list is constantly growing, so expect to see things like dlpack compatible libraries in that table in the near future.

Configurability: How Do I Make cuML Behave My Way?

cuML allows users to configure output types globally. For example, if your ETL and Machine Learning workflow is GPU-based, but you rely on a NumPy based visualization framework, try this:

Usage of cuML’s `set_global_output_type`

Using the `set_global_output_type` instruction affects all subsequent calls to cuML.

In case users want finer-grained control (for example, your models are processed by GPU libraries, but only one model needs to be NumPy arrays for your specialized visualization), the following mechanisms are available:

1. cuML’s context manager `using_output_type`:

Usage of cuML’s context manager `using_output_type`

2. Setting the output type of individual models:

This new functionality automatically converts data into convenient formats without the need for manual data conversion from multiple types.

Here are the rules that the models follow to understand what to return:

Efficiency: What Formats Should I Use?

Now that you know how to use cuML’s input and output configurabilty, the question is, what are the best formats to use? It will depend on your needs and priorities since all formats have trade-offs. Let’s consider a simple workflow:

Image for post
Image for post
Figure 2: Simple Data Science workflow using ML

Using NumPy Based Objects:

In Figure 3 below, the transfers (pink boxes) limit the amount of speedup that cuML can give you, since the communications use the slower system memory and you have to go through the PCI express bus. Every time you use a NumPy array as input to a model, or ask a model to give you back NumPy arrays, there is at least one memory transfer between main system memory and the GPU.

At first glance, one might imagine that doesn’t impact much. Yet keeping data as much as possible in the GPU is one of the, if not the biggest reason, RAPIDS achieves its lightning speed.

Image for post
Image for post
Figure 3: Workflow to illustrate what happens when using NumPy arrays for input or output

Using cuDF Objects:

Using GPU objects as opposed to NumPy arrays has significant implications. For example, using cuDF objects is illustrated in Figure 4 below, the orange boxes represent conversions that happen entirely on the fast GPU memory. Unfortunately, this means an extra copy of the data will be done during the cuML algorithm processing, which can limit the size of the dataset that can be processed in a particular GPU.

Image for post
Image for post
Figure 4: Workflow illustrating conversions occurring in GPU memory

DataFrames (and Series) are very powerful objects that allow users to do ETL in an approachable and familiar manner. But to offer this, they are complex structures with significant amounts of complexity to enable this functionality.

A few examples of this are:

However, these constraints present some difficulties for some analytics workflows:

Using Device Arrays

Figure 5 below illustrates how CAI arrays for input or output has the lowest overhead for processing data in cuML. By using the CAI, no memory transfers nor conversions occur. cuML uses the attributes of the CAI directly to access the data and then return a CAI array. There is virtually no overhead for these formats. Device arrays, such as those from CuPy or Numba, are significantly simpler structures than the DataFrame/Series equivalents. Similar to NumPy, they are designed to be contiguous blocks of memory that are described by metadata. This design decision is why NumPy was revolutionary for the original Python ecosystem. Given all of this, it shouldn’t be a surprise that device arrays are the most efficient way of using cuML!

As mentioned above, all CAI arrays are essentially the same from cuML’sperspective, so your workflows could combine functions of Numba, CuPy, cuML, etc. without needing to do expensive memory copying operations.

Image for post
Image for post
Figure 5: Workflow illustrating how CAI arrays for input or output has the lowest overhead for processing data in cuML

Tips for Selecting Data Types:

So what data type should you use? As mentioned before, it depends on the scenario, but here are a few suggestions:

With all of these tips, you can configure cuML to optimize your needs as well as better estimate the impacts and bottlenecks of workflows. Your new workflow may now look something like:

Image for post
Image for post
Figure 6: Optimized workflow in cuML by the user

What’s Next?

Here are some active areas we are excited to share in upcoming posts:

MultiNode MultiGPU (MNMG) cuML: There is much additional work being done. Many engineers on the RAPIDS cuML team are currently building MultiNode MultiGPU (MNMG) implementations of leading algorithms to enable distributed machine learning at scale. Distributed data is an entire topic by itself, with more posts coming soon. But as of version 0.13, MNMG cuML accepts Dask-cuDF objects (the distributed equivalent of cuDF using Dask) and CuPy backed Dask Arrays. cuML produces results in MNMG algorithms that mirror the input you use, similar to the default behavior of cuML for a single GPU. We are working on adding more configurability to the MNMG cuML algorithms. We will talk about how your data is distributed, and what formats you use, impact cuML.

Lower-level details about your data and its implications: Many details, like datatypes or the ordering of the data in memory can affect cuML. We will talk about how those details affect cuML, and how it compares and differs to traditional PyData libraries.

Abstractions and design: Recently introduced abstractions and mechanisms in the RAPIDS software stack, like the CumlArray, allow cuML to provide this functionality while reducing code complexity and the number of tests needed to guarantee results. We will talk about how this, alongside the CAI, gives users the ability to use multiple libraries like CuPy, cuDF, cuML together with very little effort.


This blog discussed the input and output configurability capabilities of cuML, the different data formats supported, and the advantages and disadvantages of each format in cuML. The blog shows how easy it is to adopt cuML into existing workflows. cuML’s scikit-learn API and output mirroring of formats allow you to use it as a drop-in replacement for existing libraries. To extract the maximum performance, users should try using GPU specific formats as much as possible, and CAI arrays like CuPy or Numba. The RAPIDS team is working on improving cuML’s capabilities and supported data formats. If you have an interest in some particular format or some functionality that would improve cuML for your use-cases, raise an issue in the cuML Github repository, or come chat with the team in the RAPIDS slack channel.

About the Author

Image for post
Image for post

Dante Gama Dessavre is a Senior Data Scientist in the RAPIDS team at NVIDIA. His focus has been on full-stack and interoperability engineering of RAPIDS tools and its interactions with the Python Data Science ecosystem. Prior to joining NVIDIA Dante pursued his Ph.D. from Stevens Institute of Technology developing mathematical and visualization tools for extracting narratives from text. He also has prior experience in cybersecurity while doing an internship at the AT&T Security Research Lab, as well as holding an MSc. In High-Performance Computing from the University in Edinburgh, where he researched GPU acceleration of texture-based image segmentation algorithms for cancer tumor detection.

Website | GitHub | Twitter | LinkedIn


RAPIDS Everywhere

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store