Understanding Graph based similarity: Simrank, Simrank++

Kushal Dave
4 min readNov 23, 2022

--

Courtesy: Unsplash Gordon Mak (@gcmac)

Simrank and its family of algorithms can be used to obtain node similarity between pairs of nodes in the graph.

The problem of finding similarity is ubiquitously present in many web scale applications like recommender systems (finding user-user and item-item similarity), information retrieval system (query-query and document-document similarity) or expanding your training data for your favourite machine learning models.

Most of these applications have an inherent graph structure. Some cases like recommender systems, information retrieval have bipartite graphs as a part of logs/training data available (user-item data can be seen as a bipartite graph).

Simrank is not restricted to work only on bipartite graphs. Simrank can be used to find node similarity from different types of graphs.

Graph based similarity is an unsupervised learning algorithm (one which does not require labelled data) which utilizes the connections in the graph to find node similarity for all nodes in the graph.

Simrank works on a simple idea:

Two nodes in a graph are similar if they have are related (connected) to similar nodes in the graph.

In the context of recommender systems:

  • People are similar if they purchase similar items.
  • Items are similar if they are purchased by similar people.

Simrank: For any two nodes a and b in the graph, let I(a) define the set of neighbors of node a and I(b) define the set of neighbors of node b. Based on the idea mentioned above: Similarity of two nodes in the graph is computed by how similar their neighbors/connections are. The below formula gets similarity between all pairs of neighbors of a and b.

The similarity is normalized by the total number of pair combinations of neighbors of node a and node b. C is a constant between 0 and 1 and signifies the confidence on this similarity score. Typically C is set to 0.8 for all algorithms.

It should be noted that Simrank is an iterative algorithm, where the similarity between the neighbors is initialized to 1 if a connection exists between a and b. Else it is set to 0.

For iteration 1, we take the initial values and compute new similarity values. For iteration 2, we take the similarity values computed in iteration 1 and so on. A typical stopping criteria is to check the average change in similarity value and put a threshold on it.

The extension to Simrank is to apply it on a weighted graph, where the similarity is initialized based on the weights on the edges.

Simrank++: Simrank does not take into account the number of common neighbors does two nodes have. Intuitively, higher the number of common neighbors, more is the similarity between the nodes.

Let us consider the below sample graph:

Image courtesy: Simrank++ Paper

Based on Simrank’s iterative algorithm:

  • Actual: Simrank(camera, digital camera) less than Simrank(pc, camera)
  • Expected: Simrank(camera, digital camera) greater than Simrank(pc, camera)

To fix this, Simrank++ incorporates another term in the Simrank calculation called evidence. The idea is to come up with a function which is an increasing function based on the number of common neighbors.

Based on the evidence, we can see that the new Simrank values for our above example is as expected:

Simrank++ also incorporates weights on the edges better than Simrank but I will skip that part in this post for now.

For curious readers: Here is the paper link to explore more

  • SimRank: A Measure of Structural-Context Similarity. Paper
  • Simrank++: Query Rewriting through Link Analysis of the Click Graph. Paper

--

--

Kushal Dave

Principal Applied Scientist @ Microsoft, working on Information Retrieval, Recommender Systems using DL techniques.