Tackling Large Graphs with RAPIDS cuGraph and CUDA Unified Memory on GPUs
By: Alex Fender & Brad Rees
Out of Memory, @#$!!
The Out of Memory (OOM) error has to be one of the most frustrating errors to encounter. Unfortunately, when analyzing large graphs on GPUs using RAPIDS cuGraph, it’s one of the errors most often encountered. Migrating to a GPU with more memory or onto a multi-GPU environment is not always possible. So, what can be done?
The answer — leverage unified memory to transparently oversubscribe GPU memory.
In this blog, we discuss how to use unified memory and oversubscription to scale cuGraph’s single-GPU mode to datasets four times larger than what fits into available GPU memory. You might be worrying about performance at this point, but fear not, we are going to show that you can run on bigger datasets while still achieving high throughput.
30-second Intro to Unified Memory and Oversubscription
Unified memory is a single memory address space accessible from any processor in a system. The data movement is done automatically.
Pascal and later GPU architectures include hardware to accelerate unified memory data movement (paging), and the ability to scale unified memory to the size of system memory rather than limited to just the amount of a GPU memory. Oversubscription is simply the ability to allocate GPU memory larger than what is physically available on the device, and have the GPU automatically page in data as needed. For example, with Oversubscription, you could allocate 64GB of space on a 16GB GPU — how cool is that!
cuGraph accesses unified memory through the RAPIDS Memory Manager (RMM), which is a central place for all device memory allocations in RAPIDS libraries. Unified memory waives the device memory size limit and expands potential memory to be that of the whole system (host+device memory).
Now Unified Memory is not a solution to every out of memory problem. If your data set is larger than host + GPU memory then nothing here will help. Additionally, Unified Memory does impose a small performance penalty as data is paged in and out of the GPU over the PCIe bus. However, the benefit of not crashing with OOM far exceeds the minor slowdown in performance.
An NVIDIA GV100 GPU uses HBM2 memory that has an internal memory speed of 900 GB/s. The newest NVIDIA A100 GPU will use HBM3 that boosts bandwidth to 1,555 GB/s. That bandwidth allows graph algorithms to analyze millions of relationships in near real-time. Likewise, it will enable cuGraph to rapidly switch the data structure to best suit the needs of the analytic, rather than being restricted to a single data structure. While the GPU architecture provides high-speed memory access, it still has limited capacity (up to 32GB on a GV100 and up to 48GB on a Quadro RTX 8000). A 32GB GPU typically constrained cuGraph to graph sizes up to 500 million edges. For many users, this just is not big enough.
Try It Out
To try this out at home, we have released a Jupyter notebook to get you started in a few minutes on a real-world example, so let’s jump right in and see how using Unified Memory solves many of the out of memory errors. In this notebook, we analyze a Twitter dataset that is comprised of:
- 1.47 billion social relations (edges)
- 41.7 million users (vertices)
- 26 GB on-disk file size
This Twitter dataset is well past the point of causing trouble on a single GPU, and therefore, a good test for oversubscription. We will load the data and then run the PageRank algorithm to identify the most influential profiles.
We ran the notebook on several GPUs of different memory sizes. The following table shows the output of the four timed code blocks across the various GPUs.
In this example, cuGraph’s Pagerank takes 24 iterations and traverses the graph at a speed of over 8.7 billion traversed edges per second (8.7 GTEPS) on a workstation with a single V100, which is equivalent to 0.17s per iteration. The P100 with half the memory and few cores, requires a lot more page in and out of data but still achieves a respectable 0.25 seconds per iteration.
Unified Memory, and oversubscription, can be applied to all features in cuGraph and RAPIDS in general, and we encourage users to try it out with other graph analytics.
A Closer Look at Memory Usage
Let’s take a look at what is going on under the hood to understand better how memory is utilized. Remember that everything you do with data takes up memory. For this discussion, we are assuming that the data is a simple graph data set (three columns: source ID, destination ID, edge weight).
Within a typical RAPIDS workflow, the first step is to load the data into a DataFrame. That process, on average, requires about twice the data size in memory; half for the raw data, and a half for the extracted columns. If we had a more complex dataset, say cyber Netflow, we would then need to run several ETL steps for converting IP addresses to integers. That step would produce the additional columns needed to run graph analytics, but also use up more memory space.
cuGraph’s first operation after accepting an edge list is to transparently map it to a continuous set of integer coordinates. This step is called Renumbering. The resulting data, now in sparse matrix coordinate format (COO), can then be passed to parallel GPU math primitives. However, the conversion to COO does consume more memory.
Next, the COO data is converted into Compressed Sparse Row (CSR) to compress redundant information. On GPUs, CSR leads to a more predictable memory access pattern with fewer data loads per element and provides room for vectorization. This uses additional memory, about 2/3 of what COO required.
Lastly, graph algorithms need temporary space to store intermediate results, the amount of memory needed greatly varies from one algorithm to another but rarely exceed ETL’s memory cost.
Finally, storing the results can be a significant part of the memory footprint for some analytics.
While most of the memory is used for ETL, the most time-consuming part is the graph algorithm (“solver”) part. Fortunately, the solver only operates on the CSR structure. Leveraging the Least Recently Used (LRU) eviction policy, all other data structures can be transparently evicted from the GPU without a performance impact. In practice, this allows freeing up over half of the memory for analytics.
What’s Next
The cuGraph team will further improve the oversubscription pipelines. Currently, beyond 2B edges per GPU, cuGraph will hit the 32-bit integer limit, and overflows occur. The team is working on adding 64-bit indexing to address this. Beyond 2 billion edges, the CSR matrix will not fit entirely into GPU memory, so costly transfers may occur inside the solver. Fortunately, some of this cost can be overlapped thanks to prefetching with cudaMemPrefetchAsync
and usage hints with cudaMemAdvise
.
In parallel, cuGraph continues its mission to provide multi-GPU graph analytics to allow our users to scale to billion and even trillion scale graphs. We are actively working on a “one process per GPU” paradigm, allowing cuGraph to scale across multiple GPU and across multiple nodes.
Please feel free to reach and let us know your thoughts on this proposal or any other suggestions. We are available on Google Group and Slack. Alternatively, you can file a GitHub issue with suggested enhancements. We would also appreciate a star on GitHub.
About the Authors:
Alex Fender is a Senior Engineer at NVIDIA (Data Analytics, AI, HPC) and the co-lead of RAPIDS cuGraph. Alex holds a Ph.D. in Computer Science.
LinkedIn |GitHub | Medium
Brad Rees is a Senior Manager at NVIDIA and lead of the RAPIDS cuGraph team. Brad has been focusing on Data Analytics, AI, and HPC for over 30 years. Brad holds a Ph.D. in CS.
LinkedIn |GitHub | Medium