PyTorch + Rapids RMM: Maximize the Memory Efficiency of your Workflows

Published in

RAPIDS AI

3 min readJan 12, 2023

You can make beautiful things with an efficient torch.

GPU-accelerated machine learning is producing fascinating results across a wide range of fields on a seemingly daily basis. While neural networks are usually the focus of attention, data pre-processing and preparation is just as important and often a bottleneck in machine learning pipelines. RAPIDS libraries like cuDF and cuML can accelerate these stages using the GPU.

However, a major pain point of using RAPIDS with deep learning frameworks has been the need to tune GPU memory usage separately for each library. In this post, we’ll show you how to use RAPIDS’ new PyTorch memory allocator and how to share GPU memory effectively between RAPIDS and PyTorch, making your machine learning pipelines much faster and more memory efficient!

Motivation

Today, using RAPIDS libraries such as cuDF and PyTorch together on the GPU can lead to unexpected out-of-memory errors. This is because cuDF and PyTorch allocate memory in separate “memory pools”. cuDF uses a memory pool via the RAPIDS Memory Manager (RMM) while PyTorch uses an internal caching memory allocator. While it is possible to move data back and forth between cuDF and PyTorch without copying, you still need to configure each library to use ½ the amount of available GPU memory. This leaves a lot of GPU memory unused as machine learning workflows are staged: ETL steps first, feature engineering next, and finally training and inference. Each of these stages can only use 50% of the GPU.

Beginning with RAPIDS 23.02, you can configure PyTorch to use RMM for GPU memory allocation, via the RMM PyTorch Allocator. This means that you can use RAPIDS libraries and PyTorch in the same pipeline, without worrying about configuring memory separately for each library.

How To Use It

Note: To run the code examples in this blog post, you need at least RMM version 23.02. At the time of writing, that is the nightly version of RMM. Please visit our Getting Started page for installation instructions.

Instructing PyTorch to use RMM is easy:

>>> from rmm.allocators.torch import rmm_torch_allocator
>>> torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

To see the allocator in action, let’s start by creating a pool of memory on the GPU with RMM:

>>> import rmm, pprint
>>> import cudf
>>> rmm.reinitialize(pool_allocator = True,
...                  initial_pool_size = 10e9,
...                  maximum_pool_size= 11e9
...                  )
>>>
>>> mr = rmm.mr.get_current_device_resource()
>>> stats_pool_memory_resource = rmm.mr.StatisticsResourceAdaptor(mr)
>>> rmm.mr.set_current_device_resource(stats_pool_memory_resource)

The above code creates a pool with an initial size of ~10GB, and a maximum size of 11GB. By default, allocation tracking is off; here we enable allocation tracking because we want to demonstrate that each allocation we make is owned by RMM. For example, let’s first create a cuDF Series of 16 bytes:

>>> s = cudf.Series([1, 2])
>>> pprint.pprint(stats_pool_memory_resource.allocation_counts)
{'current_bytes': 16,
 'current_count': 1,
 'peak_bytes': 16,
 'peak_count': 1,
 'total_bytes': 16,
 'total_count': 1}

allocation_counts confirms we have made one allocation with 16 bytes. We can now use the same RMM pool with PyTorch by using the torch.cuda.memory.change_current_allocator function:

>>> import torch
>>> from rmm.allocators.torch import rmm_torch_allocator
>>> torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

Now, any time PyTorch allocates CUDA memory, it will use RMM:

>>> w = torch.tensor([1, 2, 3]).cuda()  # 24 bytes
>>> pprint.pprint(stats_pool_memory_resource.allocation_counts)
{'current_bytes': 40,
 'current_count': 2,
 'peak_bytes': 40,
 'peak_count': 2,
 'total_bytes': 40,
 'total_count': 2}

Querying the stats_pool_memory_resource we can see that there are two allocations totalling 40 bytes (16+24) of memory. If we delete the cuDF Series we created before, RMM will reclaim the unused memory and we will be left with only 24 bytes of used CUDA Memory:

>>> del s
>>> stats_pool_memory_resource.allocation_counts
{'current_bytes': 24,
 'current_count': 1,
 'peak_bytes': 40,
 'peak_count': 2,
 'total_bytes': 40,
 'total_count': 2}

Conclusion

In addition to PyTorch, you can already use RMM with several Python GPU libraries including: Numba, CuPy, XGBoost, and of course all of RAPIDS. The GPU ecosystem is both diverse and specialized. This is just one example of how we are working to make the GPU data science ecosystem work seamlessly across all the components.

The RAPIDS team consistently works with the open-source community to understand and address emerging needs. If you’re an open-source maintainer interested in bringing GPU-acceleration to your project, please reach out on Github or Twitter @rapidsai! The RAPIDS team would love to learn how potential new algorithms or toolkits would impact your work.

PyTorch + Rapids RMM: Maximize the Memory Efficiency of your Workflows

Motivation

How To Use It

Conclusion

Written by Ashwin Srinath