RAPIDS cuGraph is on a mission to provide multi-GPU graph analytics to allow our customers to scale to billion and even trillion scale graphs. The first step along that path is the release of a single-node multi-GPU version of PageRank.
Experimental results show that an end-to-end pipeline involving the new multi-GPU PageRank is on average 80x faster than Apache Spark when comparing one DGX-2 vs 100 Spark nodes on a 300GB dataset. At the CUDA level, a graph of this size is traversed at a speed of 38 billion edges per second on a single node.
PageRank measures the relative importance of elements in a graph by creating a score based on the propagation of influence between nodes. The underlying assumption is that important nodes are linked to from other important nodes. A node that is linked to by many other with high PageRank receives a high rank itself.
One interesting aspect of PageRank is its ability to account for the probability of not following links. For example, on the graph that represents the web, it is estimated that a user has an 85% chance to click on a link on the current webpage (continue surfing) and a 15% to simply open a new window and jump to a random web page. In the algorithm, this is the alpha parameter, which is also referred to as the teleport or damping factor parameter.
PageRank is most famously used in search engines; however, it has applications in social network analysis, recommendation systems, and novel uses such as vaccination during epidemics. PageRank is related to Eigen and Katz centrality.
cuGraph’s Multi-GPU software stack
At a high-level cuGraph exposes the new Multi-GPU PageRank feature through a python API that leverages Dask cuDF distributed DataFrames. Dask is a flexible library for parallel computing in Python which makes scaling out your workflow smooth and simple.
Additionally, cuGraph uses other Dask-based projects from RAPIDS such as dask-cuda. The backend is a multi-GPU Power Iteration Eigensolver implemented in CUDA / C++. For this initial single-server release, we leverage NVLINK for fast GPU-to-GPU communications. PCIe is also supported with a small hit to performance.
The current solution is able to scale across multiple GPUs on a single machine. By distributing the graph and computation, users are able to analyze datasets far larger than a single GPU’s memory. With cuGraph and Dask, whether you’re using a single NVIDIA GPU or using all 16 NVIDIA V100 GPUs on a DGX-2, your RAPIDS workflow will run smoothly, intelligently distributing the workload across the available resources.
The primary motivation for the multi-GPU implementation is to handle problems that cannot fit into the memory of a single GPU. In order to bypass this limitation, the graph is split into several partitions. If your graph comfortably fits in memory on a single GPU, you would want to use the single-GPU version of cuGraph’s PageRank instead.
Try it out!
To illustrate how to use the new multi-GPU PageRank analytics, we have released two new notebooks. The first notebook is intended to get you started in a few minutes on a real-world example. In this notebook, we analyze a Twitter dataset (26GB on disk) with 41.7 million users with 1.47 billion social relations (edges) to find out the most influential profiles. The example uses two V100 32GB GPUs, but we encourage users to experiment with different configurations, as long as the sum of GPU memory is 64GB or more. The notebook walks through reading the CSV file using cuDF, then computing the PageRank score, and then sorting the PageRank scores using cuDF. Finally, we use a map to convert Vertex IDs into Twitter’s numeric IDs and retrieve the Twitter username.
The second notebook requires a DGX-2 or comparable server with 16 fully-connected 32GB V100s. The reason is that the notebook explores processing a 300GB (on disk) HiBench dataset (Zipfian distribution). A good rule of thumb is to have twice the amount of GPU memory as the file size, so the CSV parser has room for the raw data and extracted columns. This notebook shows how to load a large input split across multiple files (32 partitions) and across multiple GPUs so that the overhead per file is small enough to process 300GB of raw data in only 512GB of GPU memory. Once the data is loaded, then the Multi-GPU PageRank analytic is called as in the first notebook.
Based on our experimental results, the novel multi-GPU PageRank feature can analyze over 16 billion links to rank half a billion nodes in just a few seconds on a DGX2.
Profile and architecture
The profile of the pipeline for processing 146GB of data and running 20 PageRank iterations is shown in the following pie chart. Notice that that most of the time is spent performing I/O and ETL. Only the purple parts represent the steps behind cuGraph’s PageRank API. Using the GPU-accelerated CSV I/O from cuDF does greatly accelerate the process, but I/O still consume 72% of the time. Dask and python bindings overheads have been reduced to a minimum of 7% of the total time. Once data is ready, the actual PageRank computation takes only 9.8% of the end-to-end time.
The conversion from edge list (COO format) to the adjacency list (CSR format) is a one-time operation happening at the CUDA level. This allows having better vectorized accesses and thus faster iterations later on. Notice that some workloads run more than 20 iterations. In that case, only the “PageRank CSR” part would increase.
In the current implementation, the graph is split so that each GPU carries a chunk of the adjacency list (where each one has about the same number of edges) and a copy of the vertex sets as shown below.
We compared cuGraph against Spark on artificial graphs from the WebSearch PageRank benchmark in Intel-bigdata/HiBench. HiBench is a Big Data benchmark suite that helps evaluate different big data frameworks. In our experimental setting, sizes vary from 3GB (2 billion edges) to 300GB (16 billion edges). In the figure below, the column on the right shows the number of GPUs needed to run the multi-GPU PageRank feature. cuGraph results were obtained on a DGX-2 (1 node, 16 GPUs) and Spark results were obtained on GCP dataproc, n1-standard-8 instance with up to 100 nodes.
The figure below compares the performance of the new multi-GPU PageRank algorithm to Spark by looking at the end-to-end pipeline. In this pipeline, I/O times (CSV read/write) are included. We set the number of PageRank iterations to be 20 to reflect real use cases.
On the largest dataset (16 billion edges) using 100 nodes, it takes Spark 3.5 hours to complete this pipeline while cuGraph takes only 2.5 minutes on a single DGX-2 server. This equates to an 80x improvement in performance.
Speedup increases as the size and number of iterations increases. This is due to the fact that I/O times amount for the bulk of the end-to-end runtimes and remain constant as the number of iterations is changed. For instance, the speedup would be 19x, instead of 80x, if we were to run 3 iterations. On the largest dataset, one PageRank iteration takes 1 second when looking only at the python call and excluding I/O time.
The image below shows cuGraph’s multi-GPU PageRank Solver runtime (excluding I/O) for various sizes and iterations.
As shown in the previous section, the PageRank solver time includes constant overheads and transformations when looking at it from the Python perspective. One iteration of the CUDA solver is actually only 0.4s on the largest dataset (16 billion edges) which corresponds to 38 billion traversed edges per second.
What comes next
The roadmap is for additional analytic and constant improvement in performance and user experience, with the goal being that more cuGraph algorithms support multi-GPU. The cuGraph team is eager to hear feedback and impressions of this first multi-GPU feature. A series of blogs on graph analytics is planned, so look for more blogs on cuGraph to come.