RAPIDS Memory Manager Pool: Speed up your memory allocations

Vibhu Jawa
RAPIDS AI
Published in
3 min readDec 12, 2022

By: Vibhu Jawa and Randy Gelhausen

Go Faster with RMM

What is the RAPIDS Memory Manager (RMM)?

When you’re processing large amounts of data, memory management is crucial. It’s true for CPUs and GPUs alike. In the RAPIDS ecosystem, we created the open source “RAPIDS Memory Manager” (RMM) package to manage memory on the GPU throughout an end to end workflow.

Using the RMM Pool Allocator

In RAPIDS and PyData libraries (cuDF, cuML, cuGraph, CuPy, Numba) we often use RMM to create and share a single pool (per GPU). This is because device memory allocations are relatively slow and RMM performs a single, large allocation of memory “up front,” avoiding the latency of per-computation memory allocations mid-workflow.

Check out the single GPU example below where we see a 4.6 x speedup just by switching on the RMM pool.

Without RMM POOL

import cudf
%%timeit 
def create_dummy_frame(n_rows, n_val_columns):
df = cudf.DataFrame({'group': ['A', 'B',
'C', 'D',
'E', 'F',
'G', 'H']})
# here we create new columns
# for each operations we make 1 or more
# cudaMalloc calls
for n in range(n_val_columns):
df[f'val_{n}'] = [1, 2, 3, 4,
5, 6, 7, 8]
df = df.repeat(n_rows//8)
return df

n_rows = 10_000_000
n_val_columns = 100
df = create_dummy_frame(n_rows, n_val_columns)
# Use the `groupby()`
# method to find the mean value for each group
mean_values = df.groupby('group').mean()
759 ms ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With RMM POOL

# Setting up RMM POOL
import rmm
rmm.reinitialize(pool_allocator = True,
initial_pool_size = 10e9)
import cudf
# No code blocks changed below
%%timeit
def create_dummy_frame(n_rows, n_val_columns):
df = cudf.DataFrame({'group': ['A', 'B',
'C', 'D',
'E', 'F',
'G', 'H']})
# here we create new columns
# rmm has pre-allocated memory so
# cudf uses memory readily available
for n in range(n_val_columns):
df[f'val_{n}'] = [1, 2, 3, 4,
5, 6, 7, 8]
df = df.repeat(n_rows//8)
return df

n_rows = 10_000_000
n_val_columns = 100
df = create_dummy_frame(n_rows, n_val_columns)
# Use the `groupby()`
# method to find the mean value for each group
mean_values = df.groupby('group').mean()
168 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Want to use RMM in a cluster of GPUs? Dask-CUDA makes that easy. Here’s an example.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
# RMM pool size to initialize each worker with
# Can be an integer (bytes), float (fraction of total device memory),
# string (like "5GB" or #"5000M")
# or None to disable RMM pools.
cluster = LocalCUDACluster(rmm_pool_size='5GB')
client = Client(cluster)

In short, RMM helps RAPIDS manage GPU memory as efficiently as possible, but there’s much more than “it’s a pool of memory” to RMM. RMM allows you to allocate device memory in a highly configurable way. (Check out rmm#available-resources for information about all the memory resources that RMM can use). Next week, we’ll walk you through how to manage spilling and using RMM on a cluster of GPUs. In the meantime, reach out to us on Twitter @RAPIDSAI, check the RMM API docs, our GitHub page, or kick the tires on RAPIDS over at Paperspace or SageMaker Studio.

We recommend reading this excellent blog post by Mark Harris for a deep technical dive about the internals RMM allocator.

--

--