Introduction to Graph Analysis using cuGraph
What is Graph Analysis?
When analyzing data, sometimes the relationships between the data elements are important along with the scope of the question being asked. In those cases, it is better to represent the data as a graph and apply graph analytics.
An example coming out of the COVID pandemic is that everyone became aware of the concept of contact tracing. That process involves collecting information about who someone was in contact with and who those folks were in contact with. Basically, building a social network where the nodes represent people and edges represent contact. This is a case where the relationship, who contacts whom, is the focus of the analysis more than the characteristics of the person in question. Graph analysis is the application of graph techniques and algorithms to answer questions related to the relationship between data objects.
This blog represents a brief introduction to graph analysis and is the kickoff to a series of blogs covering graph analysis using cuGraph. Later blogs will cover cuGraph in depth, with code, scale, and performance details.
The series will cover topics such as:
- Node Similarity
- Node and Edge Centrality
- Path Finding
- Community detection
- Sampling
- Graphs and types of graphs
- … and many more
Why should I care, and what questions can graphs answer?
A full discussion of graph analytics can easily morph into a complete book, see the great work by Stanley Wasserman and Katherine Faust for example, or a PhD level paper on a graph algorithm or performance. However, it is enlightening to first look at the questions graph analytics can solve and ignore the details of how algorithms work.
Graph algorithms help answer questions as diverse as:
- How are diseases spread in a community?
- How do international trade routes develop?
- What are the most important papers in a citation list?
- Which members are redundant or indispensable in an organization?
- Based on information flow, how might a department function if specific people(s) leave?
- How to get from point A to point B with the least cost — the classic traveling salesman problem
Basics of a graph
The next blog will dive into this in greater detail, but for now here are a few key graph parts:
- Nodes are actors in the data, also called vertices in graph theory.
- Edges indicate the interactions between nodes, also called links.
- Attributes are characteristics of nodes and edges, sometimes called features or properties.
Let us consider the following very simple graph that consists of two nodes, A and B, and a single edge that has a weight attribute of 3.
Introduction to RAPIDS cuGraph in 3 simple steps
RAPIDS cuGraph is part of NVIDIA’s suite of accelerated data science libraries that aims to accelerate most aspects of Data Science via GPU’s. RAPIDS cuGraph focuses on providing acceleration to graph algorithms and workflows. You do not need to be a Data Scientist or a GPU expert to benefit quickly from cuGraph. The easiest platform to use cuGraph is a Linux computer with an NVIDIA GPU (Pascal or later generation). RAPIDS cuGraph provides an intuitive and easy to use Python interface as well as a collection of Jupyter notebook examples.
Only three steps are needed to use cuGraph: (1) Load the graph data; (2) Create a Graph; and (3) Call an algorithm.
Okay, really four steps since the packages must be imported.
Let’s dive into a problem
Given a social dataset, the question we want to ask, and understand, is: who is the most popular and least popular person?
Consider a simple dataset which comes from work by JL Moreno in the 1940s looking at the dining-table partners in a dormitory at a New York State Training School. He collected data by asking all twenty-six girls in a dormitory their first and second choice of dining partners. The raw data structure is very simple. Each resident’s answer is a row in a comma-separated file. The ‘source’ is the actor answering the question, the ‘target’ is the actor they would like to dine with, and the weight is either a 1 for first choice or a 2 for second choice. From the sample data shown below, you can see that Ada would prefer to sit with Cora as her 1st choice and with Louise as her 2nd option. In the top left corner of figure 1 below , we can see edges from Ada to Cora and Louise with the corresponding edge attributes.
For determining who is the most important, we are going to look at five different centrality measurements:
In figure 4 above it is visually apparent that Eva, Marion, and Edna are the most popular dining partners. We can discern this by counting the number of edges pointing to them. It is not a surprise that the graph calculations in figure 5 below, validate our observations. In a larger dataset, visualization is more difficult or impossible.
As expected, both Eva and Marion topped all the measures here. The next question to ask is: How do their roles differ?
Betweenness and PageRank scores suggest that Eva is most important to lines of communication and social structure of the dormitory. Eigenvector and Katz scores indicate Marion wields the most influence over both the popular students and the entire population.
Conversely, the lowest scorers in these measures are also of interest.
Many residents have low connectedness as shown by the in-degree distribution chart, Figure 6 below. A total of 10 out of the 26 students were only chosen as a preferred partner once or not at all. The pool of potentially isolated residents is over one third of the population.
Let us sort the graph algorithm results, Figure 7 below, in ascending order (smallest at top) so that we can evaluate least popular actors.
Note: due to the small dataset size, degree centrality does not completely correlate to lowest degree.
We can ask a range of additional question beyond just popularity, like:
- Who in the dorm is the most isolated?
- From the data, many students share the lowest degree centrality identifying Alice, Laura, Ella, and Cora as being among the most isolated, - Who is vulnerable to bullying?
- Those like Betty and Ruth who had the lowest PageRank and Eigenvector centrality scores lack influence and connections to approach others if they are bullied. - Who would be at risk if the population changes?
- Those with low degree to others with low degree like Ada and Cora could suddenly become very isolated if either one left the dorm. - Whose voices are underrepresented in the group?
- Laura has minimum links to only highly connected students. She seems to have no peer group and could be a bystander to the mutually connected groups. Her degree centrality is minimum but scores in the middle of connectedness measures.
While the questions above are answered by observing these measures, further analysis of this dataset could address answer questions like:
- Whose removal would most disrupt the social structure in the dorm?
- Who brokers social status within the dorm?
- Who would ascend in prominence if a particular person left the dorm?
What happens when the data set gets larger?
Let us look at a larger dataset, one that contains over 40 thousand edges, and is illustrated in figure 6 below. At that scale it starts becoming nearly impossible to visually discern the characteristics of individual nodes. Imagine what a graph with over a million edges would look like.
The figure depicts an email communication network from the 1990’s infamous Enron corporate scandal. For our analysis, we will duplicate the same measurements from the above simple set. However, we must rely on some additional external enrichment data to see if the answers make sense.
This dataset contains email metadata publicized during the Enron scandal trial.
Analysis of email traffic (a type of graph analysis) helped convict many of the conspirators by establishing hidden relationships, determining key actors, and isolating small groups within the full email corpus.
- The scandal was widely impactful:
- led to the bankruptcy of Enron in 2001
- Resulted in billions of dollars in lawsuits including a $40 billion suit from shareholders.
- Arthur Anderson, a top five accounting firm, was dissolved.
- The Sarbanes-Oxley Act, among other reforms, was passed to regulate financial record keeping.
- Several Enron officers were sentenced to prison terms - The data contains 40777 unique email connections (the edges) between 6187 different email addresses (the nodes).
- Weights containing the number of emails between addresses are included in each connection data but are not used in this analysis. (Later blogs will use the weights)
Note: Katz Centrality has been omitted from these calculations since it does not properly converge. More on that in a later blog.
Within a matter of seconds (depending on your GPUs), cuGraph can identify important actors in this graph and give clues to their role in the community it represents. In this example…
- Tana.jones@enron.com is connected to the most actors in the graph.
- Jeff.dasovich@enron.com has the highest Betweenness and PageRank. That email is the most influential in the core network and the most important to the graph’s structure. It facilitates the shortest paths between other nodes in the graph.
- Pete.davis@enron.com has the highest eigenvector centrality. That node had the most influence over the entirety of the graph rather than just the other important nodes.
What makes cuGraph special?
There are many platforms and tools to do graph analytics but cuGraph has a variety of significant benefits:
- There are no licensing fees needed to use cuGraph.
- RAPIDS cuGraph integrates with widely used open-source packages like DASK, PyTorch and scikit-learn with more on the way.
- It is part of the RAPIDS open-source GPU accelerated ecosystem allowing start to finish GPU acceleration.
- Abundant Support in the RAPIDS community and from NVIDIA via GitHub.
- It is easy to set up. This article is not focused on setup/installation, but links below contain instructions for running in several environments.
- Scalability and performance are peerless.
- A single 32GB GPU can process graphs of up to 250 million edges.
- NVIDIA engineers have scale tested cuGraph on multi-node, multi-GPU systems on graphs of over one trillion edges. - User-friendly interface: There is a Python API similar to NetworkX, and for integrators there are both C and C++ APIs.
- RAPIDS cuGraph is available in Amazon AWS, Google CoLab, and Paperspace.
How to get started with cuGraph.
- Setup and installation instructions. https://docs.rapids.ai/install
- Latest stable version of cuGraph docs — https://docs.rapids.ai/api/cugraph/stable/
- The base notebook used to do these calculations https://github.com/rapidsai/cugraph/blob/main/notebooks/algorithms/centrality/Centrality.ipynb
- Using containers and Google Colab are the easiest paths to get started with cuGraph.
https://hub.docker.com/r/rapidsai/rapidsai/ contains the available containers to run cugraph
https://colab.research.google.com/drive/1cb40U3gdXZ7ASQsWZypzBFrrFOKpvnbB?usp=drive_open presents a tutorial for using cuGraph in in Google Colab
Where to find more cuGraph information.
RAPIDS cuGraph provides a full suite of documentation and examples including:
- Past blogs, presentations and white papers related to cuGraph here:https://docs.rapids.ai/api/cugraph/nightly/tutorials/cugraph_blogs/
- The full RAPIDS documentation collection: https://docs.rapids.ai/
- Thoroughly tested and maintained cuGraph Jupyter notebooks: https://github.com/rapidsai/cugraph/tree/main/notebooks
- The centrality notebook used for the calculations presented here: https://github.com/rapidsai/cugraph/blob/main/notebooks/algorithms/centrality/Centrality.ipynb
- Notebooks containing examples of each supported algorithm.
- Some include DASK integration scaling to multiple GPUs.
- All are updated with the latest cuGraph additions and fixes.
- Demonstrate cuGraph algorithm visualization, integration, and benchmarking.
In conclusion
As mentioned in the introduction, this is the first in a series of blogs covering graph analysis using cuGraph. The next blog will explore the concept of similarity, followed by blogs on centrality, transversal and other classes of graph algorithms. The series will also dive into the notion of graph ETL, and the steps needed to process the data before creating a graph. Lastly, the series will include new relevant topics and requests accepted into the cuGraph GitHub issues list at https://github.com/rapidsai/cugraph/issues.
Additional References:
For additional read on the New York State Training S study by JL Moreno, his book “Who Shall Survive” is online at: https://archive.org/details/whoshallsurviven00jlmo/mode/2up
“Social Network Analysis: Methods and Applications” by Stanley Wasserman and Katherine Faust is a wonderful book.
Lastly, “Disrupting Dark Networks” by Sean Everton is an informative read on the application of network analysis.
Licensed Material:
The Enron image is from https://lucidmanager.org/data-science/analyse-enron-corpus, and is under a Creative Commons Attribution-ShareAlike 4.0 International License.