Input and Output Configurability in RAPIDS cuML

Published in

RAPIDS AI

9 min readApr 29, 2020

The RAPIDS machine learning library, cuML, supports several types of input data formats while attempting to return results in the output format that fits best into users’ workflows. The RAPIDS team has added functionality to cuML to support diverse types of users:

Maximize Compatibility: Users with existing NumPy, Scikit-learn, and traditional PyData libraries based workflows: cuML’s default behavior, allowing as many formats as possible, and its Scikit-learn based API design, allows for porting parts of these workflows with very little effort and no disruptions. You can use NumPy arrays for input and get back NumPy arrays as output, exactly as you expect, just much faster.

Maximize Performance: Users who want ultimate performance by keeping everything in the GPU’s memory: cuML’s use of open-source standards and configurability of behavior allows users to achieve maximum performance with low effort.

This blog will go into the details of how users can leverage this work to get the most benefits from cuML and GPUs.

Note: In cuML release 0.13 all the needed components are included, and the following models have the input and output configurability functionality enabled: KMeans, DBSCAN, OLS, Ridge and Lasso Regression. In the upcoming cuML 0.14 release, all models will have this input and output configurability functionality enabled. Multinode, multiGPU models have some differences that are talked briefly about in the last section of this blog.

Compatible Input Formats: The Wonders of the CUDA Array Interface

Thanks in great part to the __cuda_array_interface__, referred to as CAI, cuML accepts a multitude of data formats:

cuDF objects (DataFrames and Series)
Numpy arrays
CuPy and Numba device arrays
As of cuML 0.14: any CAI compliant object, like PyTorch and CuPy arrays. This group is referred to as CAI arrays.

This list is constantly expanding based on user demand. For example, the cuML team is working on direct support for the dlpack array standard, which coincides nicely with TensorFlow’s new support for it. This can also be done by going through either cuDF or CuPy, which also have dlpack support. If you have a specific data format that is not currently supported, please submit an issue or pull request on Github.

Default Behavior: How Does cuML Work Out of the Box?

cuML’s default behavior is designed to mirror the input as much as possible. For example, if you are doing your ETL in cuDF, which is very typical for RAPIDS users, you would see something like:

Default input format type mirroring behavior of cuML

When you use cuDF DataFrames, cuML gives you back cuDF objects (in this case a Series) as a result. But, as mentioned above, cuML also allows you to use NumPy arrays without changing the cuML call:

Default input format type mirroring behavior of cuML mirroring NumPy arrays

In this case, now cuML gives back the results as NumPy arrays. Mirroring the input data type format is the default behavior of cuML, and in general, the behavior is:

Figure 1: List of acceptable input formats and default output behavior

This list is constantly growing, so expect to see things like dlpack compatible libraries in that table in the near future.

Configurability: How Do I Make cuML Behave My Way?

cuML allows users to configure output types globally. For example, if your ETL and Machine Learning workflow is GPU-based, but you rely on a NumPy based visualization framework, try this:

Usage of cuML’s `set_global_output_type`

Using the `set_global_output_type` instruction affects all subsequent calls to cuML.

In case users want finer-grained control (for example, your models are processed by GPU libraries, but only one model needs to be NumPy arrays for your specialized visualization), the following mechanisms are available:

1. cuML’s context manager `using_output_type`:

Usage of cuML’s context manager `using_output_type`

2. Setting the output type of individual models:

This new functionality automatically converts data into convenient formats without the need for manual data conversion from multiple types.

Here are the rules that the models follow to understand what to return:

If output_type was specified when building the model, for example `cuml.KMeans(n_clusters=2, output_type=’numpy’)`, then it will give results in that type.
If the model was built inside a context manager `with` using `cuml.using_output_type`, then the model uses the output_type of that context.
If the output_type was set using set_global_output_type, then it will give back that type of results.
If none of the above are specified, then the model will mirror the type of the objects used for input, as described in the default behavior section.

Efficiency: What Formats Should I Use?

Now that you know how to use cuML’s input and output configurabilty, the question is, what are the best formats to use? It will depend on your needs and priorities since all formats have trade-offs. Let’s consider a simple workflow:

Figure 2: Simple Data Science workflow using ML

Using NumPy Based Objects:

In Figure 3 below, the transfers (pink boxes) limit the amount of speedup that cuML can give you, since the communications use the slower system memory and you have to go through the PCI express bus. Every time you use a NumPy array as input to a model, or ask a model to give you back NumPy arrays, there is at least one memory transfer between main system memory and the GPU.

At first glance, one might imagine that doesn’t impact much. Yet keeping data as much as possible in the GPU is one of the, if not the biggest reason, RAPIDS achieves its lightning speed.

Figure 3: Workflow to illustrate what happens when using NumPy arrays for input or output

Using cuDF Objects:

Using GPU objects as opposed to NumPy arrays has significant implications. For example, using cuDF objects is illustrated in Figure 4 below, the orange boxes represent conversions that happen entirely on the fast GPU memory. Unfortunately, this means an extra copy of the data will be done during the cuML algorithm processing, which can limit the size of the dataset that can be processed in a particular GPU.

Figure 4: Workflow illustrating conversions occurring in GPU memory

DataFrames (and Series) are very powerful objects that allow users to do ETL in an approachable and familiar manner. But to offer this, they are complex structures with significant amounts of complexity to enable this functionality.

A few examples of this are:

Every column can have, besides its data, a bitmask array (essentially an added array of zeros and ones) that allows users to have missing entries in their data.
Due to the flexibility that DataFrames need to provide for adding/removing rows and columns, each column might be far away from each other in memory.
And of course, there are added structures for things like indexes and column names.

However, these constraints present some difficulties for some analytics workflows:

First, many algorithms work significantly better when all your data is contiguous, e.g. all the bytes are grouped together in the same memory region, since accessing memory efficiently is a huge component of processing data fast (particularly for GPUs!).
Memory is a limited resource (in general, but even more so for GPUs and accelerators), so the added overheads can have a very significant impact.

Using Device Arrays

Figure 5 below illustrates how CAI arrays for input or output has the lowest overhead for processing data in cuML. By using the CAI, no memory transfers nor conversions occur. cuML uses the attributes of the CAI directly to access the data and then return a CAI array. There is virtually no overhead for these formats. Device arrays, such as those from CuPy or Numba, are significantly simpler structures than the DataFrame/Series equivalents. Similar to NumPy, they are designed to be contiguous blocks of memory that are described by metadata. This design decision is why NumPy was revolutionary for the original Python ecosystem. Given all of this, it shouldn’t be a surprise that device arrays are the most efficient way of using cuML!

As mentioned above, all CAI arrays are essentially the same from cuML’sperspective, so your workflows could combine functions of Numba, CuPy, cuML, etc. without needing to do expensive memory copying operations.

Figure 5: Workflow illustrating how CAI arrays for input or output has the lowest overhead for processing data in cuML

Tips for Selecting Data Types:

So what data type should you use? As mentioned before, it depends on the scenario, but here are a few suggestions:

If you have an existing PyData workflow, take advantage of cuML’s NumPy functionality to try different models piece by piece. Start by accelerating the slowest parts of your workflows. DBSCAN and UMAP are great examples of models in cuML that even when used by themselves, without full RAPIDS acceleration, provide huge speedups and improvements.
Potential pitfall: This could create a communication bottleneck between the main system memory and the GPU memory.
If your workflow is very ETL-heavy with lots and lots of cuDF work, where the bulk of the processing and development time is in data loading or transformation, keep things as cuDF objects and let cuML manage conversions.
Potential pitfall: This might limit how much data you can fit for a single model in a GPU.
If ultimate speed of training or inference is the key part, then adapt your workflow to use CUDArray Interface libraries as much as possible.

With all of these tips, you can configure cuML to optimize your needs as well as better estimate the impacts and bottlenecks of workflows. Your new workflow may now look something like:

Figure 6: Optimized workflow in cuML by the user

What’s Next?

Here are some active areas we are excited to share in upcoming posts:

MultiNode MultiGPU (MNMG) cuML: There is much additional work being done. Many engineers on the RAPIDS cuML team are currently building MultiNode MultiGPU (MNMG) implementations of leading algorithms to enable distributed machine learning at scale. Distributed data is an entire topic by itself, with more posts coming soon. But as of version 0.13, MNMG cuML accepts Dask-cuDF objects (the distributed equivalent of cuDF using Dask) and CuPy backed Dask Arrays. cuML produces results in MNMG algorithms that mirror the input you use, similar to the default behavior of cuML for a single GPU. We are working on adding more configurability to the MNMG cuML algorithms. We will talk about how your data is distributed, and what formats you use, impact cuML.

Lower-level details about your data and its implications: Many details, like datatypes or the ordering of the data in memory can affect cuML. We will talk about how those details affect cuML, and how it compares and differs to traditional PyData libraries.

Abstractions and design: Recently introduced abstractions and mechanisms in the RAPIDS software stack, like the CumlArray, allow cuML to provide this functionality while reducing code complexity and the number of tests needed to guarantee results. We will talk about how this, alongside the CAI, gives users the ability to use multiple libraries like CuPy, cuDF, cuML together with very little effort.

Conclusion:

This blog discussed the input and output configurability capabilities of cuML, the different data formats supported, and the advantages and disadvantages of each format in cuML. The blog shows how easy it is to adopt cuML into existing workflows. cuML’s scikit-learn API and output mirroring of formats allow you to use it as a drop-in replacement for existing libraries. To extract the maximum performance, users should try using GPU specific formats as much as possible, and CAI arrays like CuPy or Numba. The RAPIDS team is working on improving cuML’s capabilities and supported data formats. If you have an interest in some particular format or some functionality that would improve cuML for your use-cases, raise an issue in the cuML Github repository, or come chat with the team in the RAPIDS slack channel.

About the Author

Dante Gama Dessavre is a Senior Data Scientist in the RAPIDS team at NVIDIA. His focus has been on full-stack and interoperability engineering of RAPIDS tools and its interactions with the Python Data Science ecosystem. Prior to joining NVIDIA Dante pursued his Ph.D. from Stevens Institute of Technology developing mathematical and visualization tools for extracting narratives from text. He also has prior experience in cybersecurity while doing an internship at the AT&T Security Research Lab, as well as holding an MSc. In High-Performance Computing from the University in Edinburgh, where he researched GPU acceleration of texture-based image segmentation algorithms for cancer tumor detection.

Website | GitHub | Twitter | LinkedIn