Introduction to Graph Analysis using cuGraph

Don Acosta
RAPIDS AI
Published in
10 min readJul 11, 2023

--

cuGraph Logo

What is Graph Analysis?

When analyzing data, sometimes the relationships between the data elements are important along with the scope of the question being asked. In those cases, it is better to represent the data as a graph and apply graph analytics.

An example coming out of the COVID pandemic is that everyone became aware of the concept of contact tracing. That process involves collecting information about who someone was in contact with and who those folks were in contact with. Basically, building a social network where the nodes represent people and edges represent contact. This is a case where the relationship, who contacts whom, is the focus of the analysis more than the characteristics of the person in question. Graph analysis is the application of graph techniques and algorithms to answer questions related to the relationship between data objects.

This blog represents a brief introduction to graph analysis and is the kickoff to a series of blogs covering graph analysis using cuGraph. Later blogs will cover cuGraph in depth, with code, scale, and performance details.

The series will cover topics such as:

  • Node Similarity
  • Node and Edge Centrality
  • Path Finding
  • Community detection
  • Sampling
  • Graphs and types of graphs
  • … and many more

Why should I care, and what questions can graphs answer?

A full discussion of graph analytics can easily morph into a complete book, see the great work by Stanley Wasserman and Katherine Faust for example, or a PhD level paper on a graph algorithm or performance. However, it is enlightening to first look at the questions graph analytics can solve and ignore the details of how algorithms work.

Graph algorithms help answer questions as diverse as:

  • How are diseases spread in a community?
  • How do international trade routes develop?
  • What are the most important papers in a citation list?
  • Which members are redundant or indispensable in an organization?
  • Based on information flow, how might a department function if specific people(s) leave?
  • How to get from point A to point B with the least cost — the classic traveling salesman problem

Basics of a graph

The next blog will dive into this in greater detail, but for now here are a few key graph parts:

  • Nodes are actors in the data, also called vertices in graph theory.
  • Edges indicate the interactions between nodes, also called links.
  • Attributes are characteristics of nodes and edges, sometimes called features or properties.

Let us consider the following very simple graph that consists of two nodes, A and B, and a single edge that has a weight attribute of 3.

Simple Graph Example of two nodes with a relationship between them containing a weight of 3
Figure 1: simple graph example

Introduction to RAPIDS cuGraph in 3 simple steps

RAPIDS cuGraph is part of NVIDIA’s suite of accelerated data science libraries that aims to accelerate most aspects of Data Science via GPU’s. RAPIDS cuGraph focuses on providing acceleration to graph algorithms and workflows. You do not need to be a Data Scientist or a GPU expert to benefit quickly from cuGraph. The easiest platform to use cuGraph is a Linux computer with an NVIDIA GPU (Pascal or later generation). RAPIDS cuGraph provides an intuitive and easy to use Python interface as well as a collection of Jupyter notebook examples.

Only three steps are needed to use cuGraph: (1) Load the graph data; (2) Create a Graph; and (3) Call an algorithm.

Jupyter notebook example of running PageRank algorithm with cuGraph
Figure 2: Jupyter notebook based example of using cuGraph

Okay, really four steps since the packages must be imported.

Let’s dive into a problem

Given a social dataset, the question we want to ask, and understand, is: who is the most popular and least popular person?

Consider a simple dataset which comes from work by JL Moreno in the 1940s looking at the dining-table partners in a dormitory at a New York State Training School. He collected data by asking all twenty-six girls in a dormitory their first and second choice of dining partners. The raw data structure is very simple. Each resident’s answer is a row in a comma-separated file. The ‘source’ is the actor answering the question, the ‘target’ is the actor they would like to dine with, and the weight is either a 1 for first choice or a 2 for second choice. From the sample data shown below, you can see that Ada would prefer to sit with Cora as her 1st choice and with Louise as her 2nd option. In the top left corner of figure 1 below , we can see edges from Ada to Cora and Louise with the corresponding edge attributes.

Example of rows in dorm preference data set
Figure 3: Sample data rows
Visualization of complete dining partner preference dataset
Figure 4: New York State Training School dormitory

For determining who is the most important, we are going to look at five different centrality measurements:

Table of centrality measures with their definitions and uses
Centrality Measures and Uses

In figure 4 above it is visually apparent that Eva, Marion, and Edna are the most popular dining partners. We can discern this by counting the number of edges pointing to them. It is not a surprise that the graph calculations in figure 5 below, validate our observations. In a larger dataset, visualization is more difficult or impossible.

Results of highest centrality calculations for dining preference data set
Figure 5: Sorted Results calculated by running cuGraph’s centrality notebook

As expected, both Eva and Marion topped all the measures here. The next question to ask is: How do their roles differ?

Betweenness and PageRank scores suggest that Eva is most important to lines of communication and social structure of the dormitory. Eigenvector and Katz scores indicate Marion wields the most influence over both the popular students and the entire population.

Conversely, the lowest scorers in these measures are also of interest.

Many residents have low connectedness as shown by the in-degree distribution chart, Figure 6 below. A total of 10 out of the 26 students were only chosen as a preferred partner once or not at all. The pool of potentially isolated residents is over one third of the population.

Histogram illustrating the distribution of residents chosen as preferred dining partners
Figure 6: In-Degree Histogram

Let us sort the graph algorithm results, Figure 7 below, in ascending order (smallest at top) so that we can evaluate least popular actors.

Results of actors with the lowest centrality calculations in the dining preference data set
Figure 7: Results calculated by running cuGraph’s centrality notebook and sorting by lowest popularity

Note: due to the small dataset size, degree centrality does not completely correlate to lowest degree.

We can ask a range of additional question beyond just popularity, like:

  • Who in the dorm is the most isolated?
    - From the data, many students share the lowest degree centrality identifying Alice, Laura, Ella, and Cora as being among the most isolated,
  • Who is vulnerable to bullying?
    - Those like Betty and Ruth who had the lowest PageRank and Eigenvector centrality scores lack influence and connections to approach others if they are bullied.
  • Who would be at risk if the population changes?
    - Those with low degree to others with low degree like Ada and Cora could suddenly become very isolated if either one left the dorm.
  • Whose voices are underrepresented in the group?
    - Laura has minimum links to only highly connected students. She seems to have no peer group and could be a bystander to the mutually connected groups. Her degree centrality is minimum but scores in the middle of connectedness measures.

While the questions above are answered by observing these measures, further analysis of this dataset could address answer questions like:

  • Whose removal would most disrupt the social structure in the dorm?
  • Who brokers social status within the dorm?
  • Who would ascend in prominence if a particular person left the dorm?

What happens when the data set gets larger?

Let us look at a larger dataset, one that contains over 40 thousand edges, and is illustrated in figure 6 below. At that scale it starts becoming nearly impossible to visually discern the characteristics of individual nodes. Imagine what a graph with over a million edges would look like.

The figure depicts an email communication network from the 1990’s infamous Enron corporate scandal. For our analysis, we will duplicate the same measurements from the above simple set. However, we must rely on some additional external enrichment data to see if the answers make sense.

visualization of Enron email communication graph
Figure 6: Visualization of Enron email contacts
Image: Peter Prevos, Lucid Manager, May 3, 2023, lucidmanager.org/data-science/analyse-enron-corpus

This dataset contains email metadata publicized during the Enron scandal trial.

Analysis of email traffic (a type of graph analysis) helped convict many of the conspirators by establishing hidden relationships, determining key actors, and isolating small groups within the full email corpus.

  • The scandal was widely impactful:
    - led to the bankruptcy of Enron in 2001
    - Resulted in billions of dollars in lawsuits including a $40 billion suit from shareholders.
    - Arthur Anderson, a top five accounting firm, was dissolved.
    - The Sarbanes-Oxley Act, among other reforms, was passed to regulate financial record keeping.
    - Several Enron officers were sentenced to prison terms
  • The data contains 40777 unique email connections (the edges) between 6187 different email addresses (the nodes).
  • Weights containing the number of emails between addresses are included in each connection data but are not used in this analysis. (Later blogs will use the weights)
Table of Nodes (vertices) with highest centrality measures in the Enron email data
Figure 6: Sorted Results of Centrality Algorithms for Enron data calculated by running cuGraph’s centrality notebook

Note: Katz Centrality has been omitted from these calculations since it does not properly converge. More on that in a later blog.

Highest centrality addresses in Enron email data with owner roles
Figure 7: Important Enron vertices with addresses and roles

Within a matter of seconds (depending on your GPUs), cuGraph can identify important actors in this graph and give clues to their role in the community it represents. In this example…

  • Tana.jones@enron.com is connected to the most actors in the graph.
  • Jeff.dasovich@enron.com has the highest Betweenness and PageRank. That email is the most influential in the core network and the most important to the graph’s structure. It facilitates the shortest paths between other nodes in the graph.
  • Pete.davis@enron.com has the highest eigenvector centrality. That node had the most influence over the entirety of the graph rather than just the other important nodes.

What makes cuGraph special?

There are many platforms and tools to do graph analytics but cuGraph has a variety of significant benefits:

  • There are no licensing fees needed to use cuGraph.
  • RAPIDS cuGraph integrates with widely used open-source packages like DASK, PyTorch and scikit-learn with more on the way.
  • It is part of the RAPIDS open-source GPU accelerated ecosystem allowing start to finish GPU acceleration.
  • Abundant Support in the RAPIDS community and from NVIDIA via GitHub.
  • It is easy to set up. This article is not focused on setup/installation, but links below contain instructions for running in several environments.
  • Scalability and performance are peerless.
    - A single 32GB GPU can process graphs of up to 250 million edges.
    - NVIDIA engineers have scale tested cuGraph on multi-node, multi-GPU systems on graphs of over one trillion edges.
  • User-friendly interface: There is a Python API similar to NetworkX, and for integrators there are both C and C++ APIs.
  • RAPIDS cuGraph is available in Amazon AWS, Google CoLab, and Paperspace.

How to get started with cuGraph.

Where to find more cuGraph information.

RAPIDS cuGraph provides a full suite of documentation and examples including:

In conclusion

As mentioned in the introduction, this is the first in a series of blogs covering graph analysis using cuGraph. The next blog will explore the concept of similarity, followed by blogs on centrality, transversal and other classes of graph algorithms. The series will also dive into the notion of graph ETL, and the steps needed to process the data before creating a graph. Lastly, the series will include new relevant topics and requests accepted into the cuGraph GitHub issues list at https://github.com/rapidsai/cugraph/issues.

Additional References:

For additional read on the New York State Training S study by JL Moreno, his book “Who Shall Survive” is online at: https://archive.org/details/whoshallsurviven00jlmo/mode/2up

Social Network Analysis: Methods and Applications” by Stanley Wasserman and Katherine Faust is a wonderful book.

Lastly, “Disrupting Dark Networks” by Sean Everton is an informative read on the application of network analysis.

Licensed Material:

The Enron image is from https://lucidmanager.org/data-science/analyse-enron-corpus, and is under a Creative Commons Attribution-ShareAlike 4.0 International License.

--

--