Status of RAPIDS cuGraph — Refactoring Code And Rethinking Graphs
By: Brad Rees and Alex Fender
TL;DR: cuGraph has refactored code to fix error handling, improve low-level API, add multi-column renumbering, auto-renumbering, and new graph classes. Additionally, there is a discussion on moving beyond the NetworkX style API.
The RAPIDS cuGraph library has been quiet over the past few months. But do not worry, we have not gone away. A lot has been going on behind the scenes with cuGraph over the RAPIDS 0.11 and RAPIDS 0.12 releases. We are finishing up a major refactoring which has touched every piece of code, and which the final few items will be completed in the 0.13 release.
A lot of the refactoring work might go unnoticed unless you are an algorithm developer or like looking at the underlying C/CUDA code. However, a lot of the refactoring work will be directly seen, and hopefully appreciated by the Python user. We are rolling out new Graph classes, improving error handling, and providing a more consistent API. Additionally, we aim to improve the Python user experience through more automation, better DataFrame support, and richer API. These are the first steps toward converting the former low-level C-like APIs into C++ APIs. Developers are encouraged to review the transition guide.
Moreover, when we started refactoring, we spent time thinking about graphs in general and how data scientists access and use graphs. We asked ourselves if our current APIs and access patterns were the most intuitive and best that we could do. This article includes a description of what we have accomplished and a discussion of where we are going. We are eager to hear your opinion on if we are going in the right direction.
cuGraph C++ API Changes
We have been progressively dropping the former C API and replacing it with a more modern C++ API. Most important upgrades are:
- All C++ API functions are encapsulated in the
cugraph
namespace - All functions now process errors through exceptions (see below), hence functions now return
void
instead ofgdf_error
gdf_graph
is nowGraph
(see below).- Pagerank, BFS, and SSSP have dropped the
gdf_column
dependency in favor of basic types and templates. The other analytics will be updated in the 0.13 release
Additionally, all automatic conversions and decision making were removed from the C++ layer and push into the Python layer. The C++ API provides functions that efficiently convert between data formats and access to efficient CUDA algorithms. Example of C++ API before/after refactoring:
// 0.10 C API
gdf_error err = gdf_pagerank(gdf_graph, ...)// 0.12 C++ API
cugraph::pagerank<VT,WT>(cugraph::Graph, ...)
Error Handling
Let’s face it; before release 0.12, we did a poor job of error handling. The C code simply returned an error code that was very generic and, in most cases, meaningless — like code 18 for ‘GDF_C_ERROR’, which provided no indication wherein the C code something went wrong. That type of error handling is not helpful for users or developers. We have switched over to having the C code throw an exception with an error string, thereby allowing us to write better, more meaningful messages. The 0.12 release replaced all the C code that returned an error with code to throw an exception. Over the next few releases, we will be going through all the error messages and updating the text to be as explicit and meaningful as possible.
Algorithm Updates
Refactoring has been the primary focus of the past two releases, but there was some time to add new analytics and features to cuGraph. We are happy to announce:
Ensemble Clustering for Graphs, or ECG -> This is a variant of the Louvain community detection algorithm to address the resolution limit and stability shortcomings of Louvain through the use of co-association consensus. Louvain is a modularity-based algorithm, and modularity has a known tendency towards producing large clusters. By executing several truncated runs of Louvain on an ensemble of permutations of the graph, and taking the consensus, new affinity weights are generated. The final result is found by running a full Louvain using the determined weights, resulting in the detection of higher quality communities.
Louvain Max Iteration (max_iter) -> this simple modification allows a user to specify how many iterations of Louvain to run. The Louvain algorithm is a multi-level algorithm, meaning that it finds a locally optimal clustering of the graph and then contracts the graph and repeats the process on the contracted graph. Some algorithms, such as Ensemble Clustering for Graph, need the intermediate result of Louvain at a specific level. In CuGraph we expose this capability with the ‘max_iter’ parameter to Louvain. When specified Louvain will terminate early returning the result from the specified level.
Auto-Renumbering and Multi-Column Vertex IDs
There is a long legacy of graph data needing to be expressed as integers. That requirement stems from graphs being internally represented as a matrix and using the vertex IDs as indexes into the matrix that is leveraged later on for sparse matrix compression and vectorized accesses on the GPU. The representation dictates that vertex IDs start at zero and are contiguous. Missing vertex IDs are still captured in the data, so compressing down to the smallest contiguous space is beneficial.
In my years of analyzing data, I rarely came across a real-world dataset where the vertex IDs started at zero or were even integers. In social networking data, vertices are commonly people’s names. In Financial data, vertices can be account numbers. Recommendation systems use product numbers. That is just a quick sampling of the many situations where the data does not fit what our low-level graph algorithms expect. In a lot of cases, the vertex keys are not a single field. For example, looking at cyber data, IP addresses are typically used as vertex ID. While it is easy to convert an IP from dotted-decimal notation to a 64-bit integer, the problem is that even small IP addresses such as “1.0.0.0”, equates to 16,777,216. That is a long way from being zero and would fill out our matrix with a lot of wasted space. Additionally, IP values in cyber datasets are typically not contiguous.
Renumbering is relatively fast, and in a lot of cases, mandatory. The question that we pondered was since renumbering is mandatory, why not just automatically renumber and auto un-renumber, hiding that step from the user? The good news is that we have added an auto-renumbering function to help users get their data into the required format. The auto-renumbering feature converts the data into the proper data type required by the underlying implementation. It will then transparently un-renumber results, basically convert the internal IDs back to the specified columns.
The next question that we looked at was; why are vertices specified as a single column? In many situations, including those from the data fields listed above, a vertex can be defined by two or more columns. Consider cyber data; it is sometimes of interest to look at the data as `IP + port` pairs rather than just IP. An IP defines a server, and the port defines a service that the server provides. Rather than merging all the services (e.g. just using the IP), the graph can be composed of individual services (IP + port), as shown in the figure below. In other cases, a vertex can be geolocation represented by Latitude + Longitude. These are simple two-column examples, but that doesn’t mean that there are not cases of three or more columns representing vertices.
Release 0.12 includes an expanded renumbering API to support multi-column vertices.
Graph Class — Graph Types
Up to this point, cuGraph only had one graph type, which put the burden on the user to know what type of graph was created and which analytics could be applied against that data without causing errors or producing erroneous results. As part of the refactoring and in release 0.12, we added several graph type classes, which will allow analytics to quickly validate that the algorithm will run correctly. These new graph types will help with NetworkX compatibility.
Graph -> an Undirected graph where edge data is symmetrized. As part of the refactoring, cuGraph will automatically “symmetrize” undirected inputs. That is to say that each undirected edge (u,v) is stored as two directed edges (u,v) and (v,u). This allows extracting more independent tasks when running parallel graph algorithms while having a minimal performance impact because of the high memory bandwidth of GPUs.
When viewing the graph or requesting the number of edges, cuGraph will return this symmetrized view.
DiGraph -> a directed graph. In the past, directed graphs were stored using the same `Graph` type as undirected graphs. We now have a `DiGraph` class that should be used for directed graphs instead. There is also an option to go from DiGraph to undirected `Graph.` However, there is currently no way to go from Undirected to Direct.
Future Graph Types Coming Over the Next Few Releases:
Those two graph types are just the first step. Over the next few releases, we will be rolling out more graph type classes.
MultiGraph and MultiDiGraph -> a graph where there can be multiple edges between any vertex pair. The `Multi(Di)Graph` types will be added to let users load and operate on multiple edge dimensions. Regular `Graph` and `DiGraph` can efficiently be built out of these new structures to take advantage of existing graph analytics to analyze any dimension.
BipartiteGraph and BipartiteDiGraph -> a graph where there are two, or more, vertices types and where there are no edges between vertices of the same type. Bipartite graphs can currently be loaded as either a Graph or DiGraph, however many algorithms will give erroneous results, like PageRank.
DualGraph also called a line graph or interchange graph, and DualDiGraph -> See https://en.wikipedia.org/wiki/Line_graph. A Dual graph is one where edges are converted to vertices and vertices into edges. That translation allows for a shift in the focus on many analytics. Additionally, it makes some hidden structures easier to find.
HyperGraph and HyperDiGraph — See https://en.wikipedia.org/wiki/Hypergraph. A hypergraph allows an edge to connect multiple vertices rather than a vertex pair.
Adhering and Diverging from the NetworkX API
NetworkX is a well known and popular Python-based graph analytic package that has been around for over 14 years. Its popularity is the reason that we try and emulate the NetworkX API. Providing a similar API will give NetworkX users an easy transition path to cuGraph, with the goal being a near drop-in replacement. Over the next few releases, we will focus on fleshing out our API so that we meet that drop-in replacement goal.
However, not all of the anticipated users will be coming from a NetworkX background. And while NetworkX does have a very nice API and workflow, it is not a seamless match with where we would like to be. We envision the use of Property Graphs, which are a natural fit with DataFrame, and application of non-graph analytics against the same data. Users can do all their ETL against the property graph using cuDF and then call graph analytics via cuGraph. (More on Property Graph and cuGraph in an upcoming BLOG). Therefore, we are also going to provide non-NetworkX APIs that are more in-line with the overall cross-functionality scope of RAPIDS.
Consider the execution of PageRank as an example. Using the NetworkX API, we need first to create a Graph, then load the data into the graph, call the PageRank analytic, and finally remember to clean up. The non-NetworkX API performs all of those steps but hides them from the user.
We will also be adding convenience functions to help the transition away from NetworkX API. For example, the `from_cudf_edgelist` function will be enhanced to accept multiple vertex columns, allow auto-renumbering, and for the graph type to be specified. A draft proposal of a new function could be, which does not require a graph object to be initially created, is something like:
cugraph::from_cudf_edgelist(
df = a dataframe,
source = single column or list of columns,
destination = single column or list of columns,
edge_attr = None, single column, or list of columns,
create_using = {Graph, DiGraph, …},
auto_renumber=True )
Final Graph Musing
The goal of cuGraph is always to provide fast graph algorithms within the RAPIDS ecosystem, all wrapped by user-friendly APIs. We encourage and appreciate feedback and suggestions. Please feel free to reach and let us know your thoughts on this proposal or any other suggestions. We are available on Google Group and Slack. Alternatively, you can file a GitHub issue with suggested enhancements. We would also appreciate a star on GitHub.
About the Authors:
Brad Rees is a Senior Manager at NVIDIA and lead of the RAPIDS cuGraph team. Brad has been focusing on Data Analytics, Artificial Intelligence, and HPC for over 30 years. Brad holds a Ph.D. in Computer Science.
Alex Fender is a senior engineer at NVIDIA and the co-lead of RAPIDS cuGraph. He has been developing GPU accelerated solutions at NVIDIA for more than 6 years, focusing on Data Analytics, Artificial Intelligence, and HPC. Alex holds a Ph.D. in Computer Science.