Intro to Graph Analysis using cuGraph: Similarity Algorithms

Don Acosta
RAPIDS AI
Published in
8 min readOct 24, 2023
visualization of a graph with cuGraph name
RAPIDS cuGraph Logo

This is the second chapter in our blog series on GPU-accelerated graph analysis using RAPIDS cuGraph. In the first chapter, we introduced the basics of graphs and analyzed them using a few of the many algorithms in cuGraph. That blog answered some practical questions using a simple dataset and visually verified those results. In this blog, we look at a category of graph algorithms broadly called similarity. If you have not read the first blog, we encourage you to start with the introduction chapter.

Application

Let’s start by answering the question of why and how similarity is important and dive into what type of application could use similarity. It needs to be noted that we are focused on vertex similarity and not graph similarity. Vertex similarity is a metric of how equivalent one vertex is to another vertex. The score is typically a percentage where 0.00 (0%) means no similarity and 1.0 (100%) means an exact match.

Graph similarity on the other hand is a measure of how close one graph (vertices and edges) matches another graph. This is also known as graph edit distance.

Note: Graph similarity is a topic for a future blog once we have graph similarity algorithms in cuGraph.

Similarity algorithms give insight into:

Missing relationships — these algorithms are often called “link prediction”. Incomplete datasets lead to incorrect conclusions. Running similarity algorithms on a graph can reveal likely links that can be either corroborated through other means or used in follow-up calculations to test hypotheses.

Predictions in dynamic data — adding or removing nodes or edges will change how a community behaves. By finding similar nodes, we can identify actors who might emerge or decline as the graph changes.

Making recommendations — while recommender systems can become hugely complex, a simplified model could be to find users (vertices) with “similar” behavior and then recommend to User A those items that User B has that are different. Consider the bipartite graph in figure 1, where we are comparing User 1 (U1) to User 2 (U2). The system would then recommend product P3 to User 1.

Depiction of a bi-partite graph showing users and products
Figure 1: Bipartite graph in recommendation example

Similarity Algorithms

RAPIDS cuGraph implements several algorithms that have their origin in set comparisons from mathematics. When used for node similarity in a graph, these algorithms compare the sets of nodes, which generally comprise the neighbors where Set A = Neighbors(A).

cuGraph currently supports 3 similarity algorithms: Jaccard, Overlap, and Sorensen which we briefly compare in table 1 below. All three algorithms are normalized to values between zero and one. However, since they normalize on different terms, the exact values don’t equate.

Table 1: Similarity algorithm summaries and complexities.

Notes: |A| and |B| are the sizes of each set, see Figure 2 below for how this pertains to graph comparisons in a one hop example. Jaccard is marginally slower than the other two algorithms.

Consider this simple graph to illustrate the three similarity algorithms implemented in cuGraph. The calculations are all based on this graph:

Simple graph highlighting the first hop neighborhood of two connected nodes.
Figure 2: Simple graph to demonstrate similarity calculations.

Jaccard Similarity

The Jaccard similarity measure was developed by botanist Paul Jaccard who used the measure to compare plant species. His work popularized the measure’s use in in other fields as well. It sees great use in recommender systems particularly in collaborative filters which identify users with similar preferences.

Jaccard formula size of the intersection of two node neighborhoods divided by the size of the union of them
Figure 3: Jaccard Similarity formula

Jaccard similarity always results in a value of zero to one. This set comparison algorithm is used in a variety of fields including:

  • Plagiarism analysis
  • Social network analysis
  • Machine learning
  • Recommendation systems

Using the graph in figure 2 above, the similarity between nodes 1 and 2 would be.

Size of |A ꓵ B| = 5

Size of |A ∪ B| = 8

Jaccard Similarity of (A, B) = 5/8 = 0.625

Overlap Similarity

This measure is also known as the Szymkiewicz-Simpson coefficient.

Overlap coefficient formula, size of the intersection of the two neighborhoods divided by the size of the smaller one.
Figure 4: Overlap Coefficient Formula

The main difference from Jaccard is that Overlap considers only the size of the smaller of the two compared sets.
Overlap finds frequent use in:

  • Graphs with a larger degree distribution
  • Topic Modeling
  • Binary classification applications

Again, calculating based on figure 2.

|A ꓵ B| = 5

min(|A|, |B|) = 6

Overlap Coefficient of (A, B) = 5/6 = 0.83

Figure 5 illustrates that as the size of set B (node 2’s community) grows, the Jaccard coefficient stays the same while the overlap coefficient changes. Therefore, Jaccard can miss community changes in dynamic datasets. This example comes from an excellent blog discussing the differences and advantages of Jaccard versus Overlap.

Depiction of how Jaccard and overlap change as the neighborhood sizes vary
Figure 5: Comparing Jaccard to Overlap as set sizes change.

Sørensen-Dice Coefficient

Like Jaccard, Sørensen considers both set sizes, however it doubles the intersection size and does not use the size of the union. While Sørensen results in a zero to one score like the other two algorithms, even small intersections skew to higher values.

While Sørensen is easily applied to graphs and neighbors, it is better known for use in semantic comparisons like:

  • N-grams
  • Fingerprint segments
  • Genome distances
Sorensen formula two times the size of the neighborhood intersection divided by the sum of the sizes of the neighborhoods.
Figure 6: Sorenson-Dice Coefficient formula

Again, going through the calculation based on figure 2:

Size of |A ꓵ B| = 5

Sum of (|A|+ |B|) = 13

Sørensen Coefficient of (A, B) = 10/13 = 0.77

Unlike Overlap, the Sørensen value will be reduced by growth of either node’s neighborhood instead of just the smaller one.

Calculating the similarities using RAPIDS cuGraph

The following code sample shows how easy it is in cuGraph to compute the similarity scores.

import cugraph
from cugraph.datasets import dining_prefs
gdf = dining_prefs.get_edgelist(download=True)
G = cugraph.from_cudf_edgelist(gdf, source='src',destination='dst')
jdf = cugraph.jaccard(G)
odf = cugraph.overlap(G)
sdf = cugraph.sorensen(G)
print(jdf.sort_values(by='jaccard_coeff', ascending=False).head())
print(odf.sort_values(by='overlap_coeff', ascending=False).head())
print(sdf.sort_values(by='sorensen_coeff', ascending=False).head())

By default, running these algorithms uses every connected pair of nodes, which is referred to as 1-hop neighbors. While this gives a representative answer, it can produce more results than needed.

Doing the calculation on all adjacent pairs might be impossible due to the memory requirements of processing extremely large graphs. Additionally, considering adjacent nodes only, leaves out a valuable source of candidates, namely two-hop connections. In figure 2, nodes one and two are adjacent. However, their shared neighbors would be missed had the two not been directly connected as seen in figure 8. The highlighted two-hop relationships are often a great source of similarity between nodes not directly connected. Consider a communication graph where a person has two phones, they could share contacts, but the two numbers logically wouldn’t call each other. This is a common use of similarity to find an alias in a graph.

Simple graph with overlapping neighborhoods but the nodes themselves are not connected.
Figure 8: Importance of 2-hops

The cuGraph implementations of these algorithms allow for comparing 2-hop connections or any customized set of node pairs, by providing the second optional argument specifying the node pairs to examine. This is especially useful in graphs with two unconnected node types called bipartite graphs. A classic bipartite example is a sales graph containing customers and products with edges describing purchases. Code examples in figures 9 and 10 below, extract all node pairs within two hops of each other and look for similarities among those only.

Jupyter notebook function from cuGraph that calls similarity algorithms with the node pair parameter supplied.
Figure 9: sample method to run similarity methods with a pair list using RAPIDS cuGraph.

The cuGraph get_two_hop_neighbors function returns all pairs within two hops of each other. An example of calling this on the dormitory data used later is shown in figure 10.

Sample results from calling the cuGraph get_two_hop_neighbors method.
Figure 10: Sample of two hop neighbors from the dormitory data set
Jupyter notebook code in cuGraph that prints the results of calling similarity methods on the set of two-hop neighbors.
Figure 11: Extracting two-hop pairs from a cugraph and passing that into the caller function using RAPIDS cuGraph.

We will now use the familiar New York Dormitory dining preference data (from the intro BLOG) and will use similarity algorithms to answer several questions beyond their two favorite dining partners. Figure 12 highlights some of the similarity calculations from table 2 as they appear in the visualization graph visualization.

NY Dormitory data visualized with similar residents highlighted
Figure 12: Dormitory data highlighting similar groups

Using the RAPIDS cuGraph similarity notebook, we calculate the similarity of each pair with the highest values shown in the table below.

Table showing the most similar pairs in the NY Dormitory data set with all three similarity scores
Table 2: Similarity calculations from NY Dormitory data set

In this small dataset, the rankings from all three algorithms are nearly identical as shown in table 2. This will not normally be the case, but it does allow a general demonstration of what similarity reveals here.

How do the results reflect what we can infer from the diagram above?

Missing relationships — There are five residents in the graph who have transitive similarities (Louise -> Lena -> Marion -> Frances -> Adele) This cluster is revealed by similarity scores. It not as obvious in the graph since each resident is connection-limited by the sampling technique. The similarity series points to a possible hierarchy or clique. In a larger graph, a set of similar nodes often points to a cohesive community within the graph.

Predictions in dynamic data — What happens if someone leaves the dorm? In this case, if Lena or Louise left the dorm, it is logical the other would inherit their preferences. In a larger graph, we could run the algorithms after manipulating the data to see if new similarities reveal changes in the groupings.

Making recommendations — Frances and Lena are similar to Marion but are not connected to each other. If Marion were to leave the dorm, either could be recommended as a partner in her place. A larger graph with higher edge density would provide abundant alternatives to recommend.

Wrap-up

Similarity algorithms are simple to understand but immensely useful in fields as disparate as community detection, recommender systems, and Natural Language Processing (NLP). As shown, RAPIDS cuGraph enables running these in just a few lines of python code and yet leverages the parallelism of GPU-computing to scale over multiple GPU’s or even multiple nodes with multiple GPU’s.

The next blog in this series will focus on centrality functionality in cuGraph. It will explore various measures of node and edge importance focusing on why they are central to graph analytics. There will be more code examples to quickly run any of those measures on even the largest datasets.

How to get started with cuGraph

References:

--

--