The Data Scientist has a collection of techniques within their proverbial toolbox. Data engineering, statistical analysis, and machine learning are among the most commonly known. However, there are numerous cases where the focus of the analysis is on the relationship between data elements. In those cases, the data is best represented as a graph. Graph analysis, also called network analysis, is a collection of algorithms for answering questions posed against graph data. Graph analysis is not new.
The first graph problem was posed by Euler in 1736, the Seven Bridges of Konigsberg, and laid the foundation for the mathematical field of graph theory. The application of graph analysis covers a wide variety of fields, including marketing, biology, physics, computer science, sociology, and cyber to name a few. RAPIDS version 0.6 includes the first official release of cuGraph.
RAPIDS cuGraph is a library of graph algorithms that seamlessly integrates into the RAPIDS data science ecosystem and allows the data scientist to easily call graph algorithms using data stored in a GPU DataFrame. For the initial release, graph data needs to be in Coordinate Format (COO), also known as an edge list, and is represented by two columns, Source and Destination. Consider the following example of NetFlow data and looking at just the first five columns. The “id.orig_h” and “id.resp_h” columns identify an event between two IP addresses. Those two columns represent an edge list that can be used for graph processing.
Once the two edge columns are identified the data scientist can then leverage any of the cuGraph analytics. To run the PageRank algorithm the user would perform something like:
G = cugraph.Graph()
df = cugraph.pagerank(G)
One of the design goals of cuGraph is to have an API familiar to the data scientist used to performing graph analytics. Therefore, the data scientist with experience with NetworkX will recognize the process of creating a graph object and then running an analytic against that graph object. A more in-depth example is given at the bottom of this blog.
RAPIDS cuGraph 0.6 release
RAPIDS version 0.6 represents the first official release of cuGraph and the first step towards a complete graph analysis package. This initial release focuses on providing a foundation and includes several algorithms optimized for single-GPU analytics.
- Jaccard Similarity — a measure of neighborhood similarity between connected vertices. Within recommendations systems, this is very useful for finding customers with similar behavior.
- Weighted Jaccard — this is similar to Jaccard except that the algorithm sums the vertex weights.
- Page Rank — this is a measure of relative importance, most famously used in search engines, however it has applications in social network analysis, recommendation systems, and for novel uses in natural science when studying the relationship between proteins and in ecological networks.
- Single Source Shortest Path (SSSP) — is used to identify the shortest path between a pair of vertices. Within a road network it can be used to find the fastest path from A to B. Moreover, SSSP can be used for optimizing a wide range of logistics problems.
- Breadth-First Search (BFS) — this is a classic search algorithm that iteratively explores the graph. Starting at a seed point, the algorithm steps out one hops per iteration. As shown in the image to the left.
- Spectral Clustering — graph clustering consists of grouping vertices based on their characteristics such that there is high intra‐cluster similarity and low inter‐cluster similarity. There are many ways of determining these groups. The spectral clustering scheme constructs a matrix, solves an associated eigenvalue problem, and extracts splitting information from the calculated eigenvectors. Both a Modularity Maximization and Min-Cut version are included.
- Louvain Clustering — is another graph clustering technique. Louvain uses Modularity as the metric for iteratively combining vertices into clusters. Louvain starts with each vertex in its own cluster and iteratively mergers clusters based on modularity.
The above listed algorithms are designed for execution on a single-GPU with data sets around 500 million edges or less. When compared against a single-node NetworkX analytic in Python, the data scientist can expect performance improvement of 50–500x on average.
As a means of measuring performance, the PageRank websearch benchmark from the HiBench Suite was used. The HiBench website describe the suite as “… a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations”. One feature of HiBench is that it measures the end-to-end performance that a user will experience. That includes, data loading, data prep, and running of an analytic. The PageRank benchmark consist of processing a graph based on webpages (vertices) and the connections between the webpages (edges).
The biggest challenge faced when executing the benchmark was with the data readers. The CSV reader within cuDF is significantly faster than the one found in Pandas, however we encountered scalability issues with large datasets. To ensure that large data sets were processed, the CSV reader from cuDF was used along with the CSV reader found in Pandas. Additionally, the benchamrk was run on a 10-node Spark cluster and on a high power workstation running NetworkX . Since NetworkX is the most popular graph framework used by data scientists, those results will be used as the baseline for performance evaluation.
RAPIDS cuDF + cuGraph is 300x faster than NetworkX for the huge dataset. If data loading times are removed and just the PageRank performance numbers compared, then cuGraph is 3400x faster than NetworkX. As dataset sizes increase, data reading start to become the limiting factor with data reading contributing to over 92% of the runtime.
Finding external comparable benchmarks for Spark is non-trival since the number of nodes impacts performance. A benchmark result from Mellanox using 6 dual-Xeon nodes and RDMA run the gigantic dataset in 1,087 seconds . Lu Liu reported gigantic results of 9,953 seconds for Spark and 4,078 seconds for Hadoop on a four node cluster. Lastly, Hameeza Ahmed et al, measured just the PageRank portion on 4 nodes against the gigantic dataset with a runtime of 346.85 seconds. In comparison cuGraph PageRank on the gigantic dataset took 3.7 seconds, a speedup of 103x.
The cuGraph GitHub repository has detailed instructions on how to access the cuGraph library. See https://github.com/rapidsai/cugraph
Additionally, a collection of sample cuGraph notebooks can be found within the RAPIDS notebook repository under cugraph: https://github.com/rapidsai/notebooks/tree/branch-0.6/cugraph.
The following example runs through how to execute Weighted Jaccard on a graph where the weights are the PageRank scores.
What comes next
RAPIDS is a set of open source libraries for GPU accelerating data preparation, machine learning, and now graph analytics. The roadmap is for additional analytic and constant improvement in performance, with the goal being that every cuGraph algorithms supports multi-GPU. The cuGraph team is eager to hear feedback and impressions of this initial release. A series of blogs on graph analytic is planned, so look for more blogs on cuGraph to come.