Efficient Data Sharing between CuPy and RAPIDS

John Kirkham
RAPIDS AI
Published in
4 min readMay 8, 2020

Coauthored by John Kirkham and Mads R. B. Kristensen

In Python workflows, it’s common to use multiple libraries based on their strengths. For example, you might start with the RAPIDS GPU Dataframe library, cuDF, to load data from disk and do some preprocessing before handing it off to a machine learning library like RAPIDS cuML to train a classifier or apply an existing one. While it is natural to work with Dataframes and Series in cuDF, many machine learning algorithms think about data in vectors and matrices, or more generally, arrays. Just as it is common to move back and forth between Pandas DataFrame and NumPy objects on CPUs, especially when handing data to scikit-learn, in the GPU world, we also want to convert between cuDF objects and on-device arrays when passing data to cuML.

Benefits of CUDA Array Interface

A lot of engineering work has gone on under-the-hood to make this process as seamless as possible. For example, __cuda_array_interface__ helps share data between libraries without copying.

To handle on-device arrays in Python, we use CuPy, which is a popular NumPy-compatible CUDA library that supports __cuda_array_interface__ and plays well with cuDF. Data scientists can now move between cuDF and CuPy without paying the price of a cudaMemcpy. Thus, avoiding doubling the memory footprint and also increasing performance.

Benefits of RAPIDS Memory Manager

However, a challenge emerges when users want to allocate new GPU memory across multiple libraries. Because device memory allocations are a common bottleneck in GPU-accelerated code, most libraries introduce a custom allocator to reuse allocations. Two libraries can end up competing for memory on the GPU as each one’s allocator consumes more of the available GPU memory. RAPIDS solved this problem by creating a library called Rapids Memory Manager (RMM), which dynamically manages memory allocations. This way, when using cuDF and cuML, users can rely on RMM to efficiently manage the memory used by both libraries under the hood.

How to configure CuPy to use RMM

CuPy supplies its own allocator, and we want to ensure that applications that use both CuPy and cuDF can share memory effectively. How do we make sure they don’t conflict? Well, it turns out CuPy can use an external allocator instead. To solve this conflict, we wrote a little code for users to plug RMM into CuPy and extend RMM beyond RAPIDS, improving ecosystem interoperability.

At the core, we provide a function rmm_cupy_allocator, which just allocates a DeviceBuffer (like a bytearray object on a GPU) and wraps this in a CuPy UnownedMemory object; returned to the caller. The code is shown below:

try:
import cupy
except Exception:
cupy = None
def rmm_cupy_allocator(nbytes):
"""
A CuPy allocator that makes use of RMM.
Examples
--------
>>> import rmm
>>> import cupy
>>> cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
"""
if cupy is None:
raise ModuleNotFoundError("No module named 'cupy'")
buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
dev_id = -1 if buf.ptr else cupy.cuda.device.get_device_id()
mem = cupy.cuda.UnownedMemory(
ptr=buf.ptr, size=buf.size, owner=buf, device_id=dev_id
)
ptr = cupy.cuda.memory.MemoryPointer(mem, 0)
return ptr

This makes it easy to tell CuPy to use the RMM allocator for all device allocations, just by calling cupy.cuda.set_allocator(rmm.rmm_cupy_allocator) in your code. Alternatively, if you prefer to do this within a limited scope, you can use the context manager cupy.cuda.using_allocator.

Examples

Here’s how to usecupy.cuda.set_allocator .

import cupy
import rmm
# Using CuPy’s default allocator
a1 = cupy.arange(5)
# Switch to RMM allocator
cupy.cuda.set_allocator(rmm.rmm_cupy_allocator)
# Using RMM
a2 = cupy.arange(5)

Here’s how to usecupy.cuda.using_allocator.

import cupy
import rmm
# Using CuPy’s default allocator
a1 = cupy.arange(5)
# Use RMM allocator in this block
with cupy.cuda.using_allocator(rmm.rmm_cupy_allocator):
# Using RMM
a2 = cupy.arange(5)
a3 = cupy.arange(5) # Use CuPy’s default allocator

Implementation Suggestions

Those examples are pretty straightforward. So, how do you decide which to use? If you have a global allocator that you want to use for a workflow or an opinionated library, you may choose set_allocator to simply make that choice at a global level. Alternatively, if you need a specific allocator in a function or block of code, you may choose using_allocator to ensure that function or block uses the specified allocator in that code without disrupting global settings.

Conclusion

Tying this back to RAPIDS, we now have a way to ensure CuPy allocations made by cuDF or cuML also use RMM under-the-hood. In the recent RAPIDS 0.13 release, cuDF calls set_allocator as part of import to ensure RMM is automatically used on any CuPy allocations. This enables smoother interoperability between CuPy and RAPIDS. Also, for other libraries considering how to make memory management easier, hopefully, this provides some ideas about how it may be accomplished.

About the Authors

John Kirkham is a long time Python enthusiast with a background in Physics. He contributes to open-source projects like conda-forge, Dask, and RAPIDS.

LinkedIn | GitHub

Mads R. B. Kristensen is a senior software engineer in NVIDIA’s RAPIDS group with a focus on distributed GPU computing. Before joining NVIDIA, he was an Assistant Professor at the University of Copenhagen.

LinkedIn | GitHub

--

--