Intro to Graph Analysis using cuGraph: Similarity Algorithms
This is the second chapter in our blog series on GPU-accelerated graph analysis using RAPIDS cuGraph. In the first chapter, we introduced the basics of graphs and analyzed them using a few of the many algorithms in cuGraph. That blog answered some practical questions using a simple dataset and visually verified those results. In this blog, we look at a category of graph algorithms broadly called similarity. If you have not read the first blog, we encourage you to start with the introduction chapter.
Application
Let’s start by answering the question of why and how similarity is important and dive into what type of application could use similarity. It needs to be noted that we are focused on vertex similarity and not graph similarity. Vertex similarity is a metric of how equivalent one vertex is to another vertex. The score is typically a percentage where 0.00 (0%) means no similarity and 1.0 (100%) means an exact match.
Graph similarity on the other hand is a measure of how close one graph (vertices and edges) matches another graph. This is also known as graph edit distance.
Note: Graph similarity is a topic for a future blog once we have graph similarity algorithms in cuGraph.
Similarity algorithms give insight into:
Missing relationships — these algorithms are often called “link prediction”. Incomplete datasets lead to incorrect conclusions. Running similarity algorithms on a graph can reveal likely links that can be either corroborated through other means or used in follow-up calculations to test hypotheses.
Predictions in dynamic data — adding or removing nodes or edges will change how a community behaves. By finding similar nodes, we can identify actors who might emerge or decline as the graph changes.
Making recommendations — while recommender systems can become hugely complex, a simplified model could be to find users (vertices) with “similar” behavior and then recommend to User A those items that User B has that are different. Consider the bipartite graph in figure 1, where we are comparing User 1 (U1) to User 2 (U2). The system would then recommend product P3 to User 1.
Similarity Algorithms
RAPIDS cuGraph implements several algorithms that have their origin in set comparisons from mathematics. When used for node similarity in a graph, these algorithms compare the sets of nodes, which generally comprise the neighbors where Set A = Neighbors(A).
cuGraph currently supports 3 similarity algorithms: Jaccard, Overlap, and Sorensen which we briefly compare in table 1 below. All three algorithms are normalized to values between zero and one. However, since they normalize on different terms, the exact values don’t equate.
Notes: |A| and |B| are the sizes of each set, see Figure 2 below for how this pertains to graph comparisons in a one hop example. Jaccard is marginally slower than the other two algorithms.
Consider this simple graph to illustrate the three similarity algorithms implemented in cuGraph. The calculations are all based on this graph:
Jaccard Similarity
The Jaccard similarity measure was developed by botanist Paul Jaccard who used the measure to compare plant species. His work popularized the measure’s use in in other fields as well. It sees great use in recommender systems particularly in collaborative filters which identify users with similar preferences.
Jaccard similarity always results in a value of zero to one. This set comparison algorithm is used in a variety of fields including:
- Plagiarism analysis
- Social network analysis
- Machine learning
- Recommendation systems
Using the graph in figure 2 above, the similarity between nodes 1 and 2 would be.
Size of |A ꓵ B| = 5
Size of |A ∪ B| = 8
Jaccard Similarity of (A, B) = 5/8 = 0.625
Overlap Similarity
This measure is also known as the Szymkiewicz-Simpson coefficient.
The main difference from Jaccard is that Overlap considers only the size of the smaller of the two compared sets.
Overlap finds frequent use in:
- Graphs with a larger degree distribution
- Topic Modeling
- Binary classification applications
Again, calculating based on figure 2.
|A ꓵ B| = 5
min(|A|, |B|) = 6
Overlap Coefficient of (A, B) = 5/6 = 0.83
Figure 5 illustrates that as the size of set B (node 2’s community) grows, the Jaccard coefficient stays the same while the overlap coefficient changes. Therefore, Jaccard can miss community changes in dynamic datasets. This example comes from an excellent blog discussing the differences and advantages of Jaccard versus Overlap.
Sørensen-Dice Coefficient
Like Jaccard, Sørensen considers both set sizes, however it doubles the intersection size and does not use the size of the union. While Sørensen results in a zero to one score like the other two algorithms, even small intersections skew to higher values.
While Sørensen is easily applied to graphs and neighbors, it is better known for use in semantic comparisons like:
- N-grams
- Fingerprint segments
- Genome distances
Again, going through the calculation based on figure 2:
Size of |A ꓵ B| = 5
Sum of (|A|+ |B|) = 13
Sørensen Coefficient of (A, B) = 10/13 = 0.77
Unlike Overlap, the Sørensen value will be reduced by growth of either node’s neighborhood instead of just the smaller one.
Calculating the similarities using RAPIDS cuGraph
The following code sample shows how easy it is in cuGraph to compute the similarity scores.
import cugraph
from cugraph.datasets import dining_prefs
gdf = dining_prefs.get_edgelist(download=True)
G = cugraph.from_cudf_edgelist(gdf, source='src',destination='dst')
jdf = cugraph.jaccard(G)
odf = cugraph.overlap(G)
sdf = cugraph.sorensen(G)
print(jdf.sort_values(by='jaccard_coeff', ascending=False).head())
print(odf.sort_values(by='overlap_coeff', ascending=False).head())
print(sdf.sort_values(by='sorensen_coeff', ascending=False).head())
By default, running these algorithms uses every connected pair of nodes, which is referred to as 1-hop neighbors. While this gives a representative answer, it can produce more results than needed.
Doing the calculation on all adjacent pairs might be impossible due to the memory requirements of processing extremely large graphs. Additionally, considering adjacent nodes only, leaves out a valuable source of candidates, namely two-hop connections. In figure 2, nodes one and two are adjacent. However, their shared neighbors would be missed had the two not been directly connected as seen in figure 8. The highlighted two-hop relationships are often a great source of similarity between nodes not directly connected. Consider a communication graph where a person has two phones, they could share contacts, but the two numbers logically wouldn’t call each other. This is a common use of similarity to find an alias in a graph.
The cuGraph implementations of these algorithms allow for comparing 2-hop connections or any customized set of node pairs, by providing the second optional argument specifying the node pairs to examine. This is especially useful in graphs with two unconnected node types called bipartite graphs. A classic bipartite example is a sales graph containing customers and products with edges describing purchases. Code examples in figures 9 and 10 below, extract all node pairs within two hops of each other and look for similarities among those only.
The cuGraph get_two_hop_neighbors
function returns all pairs within two hops of each other. An example of calling this on the dormitory data used later is shown in figure 10.
We will now use the familiar New York Dormitory dining preference data (from the intro BLOG) and will use similarity algorithms to answer several questions beyond their two favorite dining partners. Figure 12 highlights some of the similarity calculations from table 2 as they appear in the visualization graph visualization.
Using the RAPIDS cuGraph similarity notebook, we calculate the similarity of each pair with the highest values shown in the table below.
In this small dataset, the rankings from all three algorithms are nearly identical as shown in table 2. This will not normally be the case, but it does allow a general demonstration of what similarity reveals here.
How do the results reflect what we can infer from the diagram above?
Missing relationships — There are five residents in the graph who have transitive similarities (Louise -> Lena -> Marion -> Frances -> Adele) This cluster is revealed by similarity scores. It not as obvious in the graph since each resident is connection-limited by the sampling technique. The similarity series points to a possible hierarchy or clique. In a larger graph, a set of similar nodes often points to a cohesive community within the graph.
Predictions in dynamic data — What happens if someone leaves the dorm? In this case, if Lena or Louise left the dorm, it is logical the other would inherit their preferences. In a larger graph, we could run the algorithms after manipulating the data to see if new similarities reveal changes in the groupings.
Making recommendations — Frances and Lena are similar to Marion but are not connected to each other. If Marion were to leave the dorm, either could be recommended as a partner in her place. A larger graph with higher edge density would provide abundant alternatives to recommend.
Wrap-up
Similarity algorithms are simple to understand but immensely useful in fields as disparate as community detection, recommender systems, and Natural Language Processing (NLP). As shown, RAPIDS cuGraph enables running these in just a few lines of python code and yet leverages the parallelism of GPU-computing to scale over multiple GPU’s or even multiple nodes with multiple GPU’s.
The next blog in this series will focus on centrality functionality in cuGraph. It will explore various measures of node and edge importance focusing on why they are central to graph analytics. There will be more code examples to quickly run any of those measures on even the largest datasets.
How to get started with cuGraph
- Setup and installation instructions. https://docs.rapids.ai/install
- Latest stable version of cuGraph docs — https://docs.rapids.ai/api/cugraph/stable/
- Using containers and Google Colab are the easiest paths to get started with cuGraph. https://hub.docker.com/r/rapidsai/notebooks — the available container to run cuGraph notebooks.
References:
- The first blog in this graph analytics series: https://medium.com/rapids-ai/introduction-to-graph-analysis-using-cugraph-a9dc2fbc3c5e
- Calculations in this blog were done using this notebook: https://github.com/rapidsai/cugraph/blob/main/notebooks/algorithms/link_prediction/similarity_combined.ipynb
- Earlier NVIDIA blog comparing Jaccard and Overlap: https://medium.com/p/610e083b877d