Let’s learn A-Z of Knowledge Graphs: one step at a time |Part 4- Pagerank & more about graph traversal

Preeti Singh Chauhan
10 min readNov 17, 2022

--

Hello again! Hope you are having fun while learning about Knowledge Graphs, bit by bit, in this series. If you have not gone through the previous stories about the introduction to KG and exploratory graph analytics, feel free to explore the same. Here is the link to the previous story -> sum total of exploratory graph analysis.

So, in this part, we are going to go through the following topics:

  1. What is Graph Traversal
  2. What is DFS
  3. What is BFS
  4. What is Random Walk
  5. What are different Graph algorithms & how graph traversal is used
  6. What is Pagerank
  7. A detailed explanation of how Pagerank works

Spoiler alert: this story is going to be “theory-heavy”. Deep dive into the workings of algorithms. In the next story, I’ll show the practical implementation of the algorithms.

Enough buildup, let’s get started…

What is Graph Traversal: In computer science, graph traversal (also known as graph search) refers to the process of visiting (checking and/or updating) each vertex in a graph. Such traversals are classified by the order in which the vertices are visited. Graph traversal is basic function used in any graph algorithms. For eg. if we want to detect communities in a graph, we’ll take a walk around graph(visit each node and edge), i.e. we’ll do graph traversal once or multiple times to identify which nodes belong to which community.

There are 2 basic graph traversal algorithms that are being used universally, even by many other algorithms(like Pagerank) internally.

  1. Depth-First Search:
  2. Breadth-First Search:

What is Breadth-First Search (BFS):

BFS is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next depth level. Extra memory, usually a queue, is needed to keep track of the child nodes that were encountered but not yet explored.

Let’s see how it works..

source- wikipedia

Consider above graph with nodes -> (a,b,c,d,e,f,g,h). BFS, starting with node ‘a’, will cover the first immediate neighbours or 1st level first i.e. ‘b’ and ‘c’. Then it goes on to visit all immediate neighbours of ‘b’, then covering the same level 2, will visit all immediate neighbours of ‘c’. Once level 2 is finished, the only node in level 3, i.e. ‘h’, which is immediate neighbour of ‘e’ is visited. Here our traversal ends. So, the visit order will be:

a->b->c->d->e->f->g->h

A fixed length BFS starting from a node, thus, will explore local neighbourhood/features first.

Just for fun, let’s implement BFS from scratch with python. See the code below:

What is Depth-First Search (DFS):

Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible along each branch before backtracking. Extra memory, usually a stack, is needed to keep track of the nodes discovered so far along a specified branch which helps in backtracking of the graph.
Note: DFS explores the global features of a graph wrt to a node.

let’s see how it works:

source-wikipedia

Consider above graph with nodes -> (a,b,c,d,e,f,g,h). DFS, starting with node 1, will cover the whole depth of the graph in one direction first. So the nodes covered will be 2 to 4. Then starting from other neighbour of root i.e. 5, the whole depth in that direction is covered again i.e. 6 & 7. Before moving to next neighbour of root, all its children’s depth is covered, i.e 8 is covered. Then, we move on to next neighbour of root, i.e 9 and cover its entire depth i.e 10. Here our traversal ends. So, the visit order will be:

1->2->3->4->5->6->7->8->9->10

A fixed length DFS starting from a node, thus, will explore global neighbourhood/features first.

Again, for fun, let’s implement DFS from scratch with python. See the code below:

Graph walk:

Before we move ahead, let’s go over one of the important concepts that will be used in some of the algorithms. This is called a Graph walk. A walk can be defined as a sequence of edges and vertices of a graph. When we have a graph and traverse it, then that traverse will be known as a walk. In a walk, there can be repeated edges and vertices. The number of edges which is covered in a walk will be known as the Length of the walk.

Random Walk:

More importantly, let’s focus on a type of walk called Random Walk in the graph and where and how we use it.

Note: We’ll be using random walk in numerous graph algorithms and also in graph embeddings. Watch out for my upcoming stories for the same.

a random walk around the neighbourhood from your house

Let’s imagine, if I ask you to take a random walk(you may take any route and pass by anyone’s house) of ’n’ steps, starting from your house (u). You end up at the ‘n’th house from your own house. All the houses visited and the roads taken between u -> n is considered as ‘neighbourhood’ of u, Nr(u). Also, if Vivian’s house (v) lies in your random walk(or your neighbourhood Nr), there is a high probability that Vivian’s house is similar to your house (u~v), in terms of buildup or status or any other measure.

In graph language,

image by Jure Lesckovec, Stanford

There are other variations of random walk, like:

  • Short- steps random walk
  • Biased Random walk

Keep this thought with you. We’ll be utilising this concept of random walks in upcoming stories.

What are Graph Algorithms: Graph algorithms are a set of instructions that traverse (visits nodes of a) graph, with some specific end goal in mind. Some algorithms are used to find a specific node or the path between two given nodes. Some others are used to find the communities in the graph. Below are some of the majorly used categories of graph algorithms, which we are going to deep dive, in this story and upcoming stories

  1. Link Analysis Algorithms — Pagerank etc.
  2. Community Detection Algorithms — Girvan-Newman etc.
  3. Shortest Path Algorithms — Dijkshtra’s algorithm etc.
  4. Components algorithms — Weakly connected components(Union Find) etc.

PAGERANK:

Let’s pick our first algorithm Pagerank and try to understand how it works.

Do you know, how Google manages to show the most relevant search results at the top? Yes, you are right! It uses PAGERANK.

According to Google, “PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites”.

So, Pagerank used by Google considers www as a very large network and all the webpages as nodes and interactions between them as links. The webpage relevance is measured by the number of incoming links that lead to it. Furthermore, the relevance increases if the incoming link to the page comes from a well-known page like Wikipedia.

How the web can be represented?

PAGERANK as “flow-model”:

So, if we consider ‘Pagerank’ as flow-model, the flow equation we’ll have to solve will be:

r = M . r

where, r is the rank and if we go into the matrix representation of graph, M can be considered a Stochastic Adjacency Matrix.

So, pagerank matrix formulation will be something like this:

Also, pagerank can be formulated as a random walk problem. Let’s see how:

Pagerank can also be formulated as an eigenvector problem. Let’s have a look:

Do you remember eigenvector equation for the Adjacency matrix for a graph.

Now this can be used in our flow equation as below:

Now we have seen the Pagerank can have

  • Matrix Formulation
  • Random Walk Distribution Formulation
  • Eigenvector Formulation

So, now, it’s easy to solve the equation:

r = M . r

by a method called “Power Iteration”.

Now, let’s summarise what we have learnt so far about pagerank:

  • Pagerank used the link structure of the graph to measure the importance of the nodes.
  • Pagerank models a random web surfer using the stochastic adjacency matrix M.
  • Pagerank solves equation r = Mr, where r can be viewed as both the principal eigenvector of M and as the stationarity distribution of random walk over the graph
  • Pagerank solves the equation r = Mr, with a method called power iteration.

What is Power iteration method:

I am not going into great detail. Here is the idea of this method:

Generally, about 50 iterations are enough to reach the stationary distribution or limiting solution.

Just to make things more clear, let’s see this calculation with an example. Consider a graph with 3 nodes shown below. To solve the power equation, as a first step initialise the values randomly to 1/3. Iterate the power calculation over a few times and you’ll see that the r converges to [6/16, 6/15, 3/15]. So, after n iterations, the equation converges and you get your final rank vector r.

I hope, this is clear now!

So, for Pagerank to work perfectly, the equation (r = M . r) should:

  • converge
  • converge to where we want
  • give reasonable results

In reality, all these conditions are not met in many cases. There are basically 2 problems with Pagerank:

  1. Dead-ends: Some pages are a dead end(has no output links). It causes importance to ‘leak-out’. Dead-end is a mathematical problem because our matrix is not column stochastic, so the initial assumption is not met.
  2. Spider-traps: All out-links are within the group. Eventually, spider traps absorb all the importance. This is not a mathematical problem though, the eigenvector still converges, but page rank scores are not what we want.

How to Solve these problems. The answer is, by a method called ‘Random Teleport’.

What is random Teleport: Whenever the equation gets stuck in a spider-web or dead-end, move (teleport) randomly to another node. By using random teleport, how to get out of the above-mentioned problems:

  • Spider-traps: Never get stuck in a spider trap by teleporting out of it in a finite number of steps
  • Dead-ends: Make the matrix columns stochastic by always teleporting when there is nowhere to go

So, how does Google’s pagerank equation change by introducing teleportation:

This is still solvable by Power Iteration.

Note: What is beta? In practice, beta =0.8,0.9(make 5 steps on average, then jump).

Let’s see an example of solving pagerank equation using random teleport.

Here is the initial graph with 3 nodes with stochastic matrix M. You can notice dead-ends and spider-traps.

We add extra edges(in green), for teleportation. beta = 0.8 here, intial r values=[1/3,1/3,1/3]. The modified equation can be seen below:

So, finally, after a few iterations, the equation converges and we get our rank vector r =[7/33,5/33,21/33]

Lastly, one final example of calculated Pagerank in the network and how to interpret it.

image-wikipedia

So, now you know all about Pagerank, it is easy to understand the PageRank scores of the nodes in the above image. B(indegree 6) has the highest Pagerank since it has a high in-degree plus it is also connected to more important nodes. In comparison, E has a high in-degree as well, but since it is not connected to important nodes, its page rank is not high as compared to B. Also, if you see C, its in-degree is not very high, still, since it is connected to the most important node in the network, its PageRank score is very high.

Page rank, along with being very interesting mathematically also very easy and efficient to use on large networks.

There are other variations of Pagerank, like Personalized Pagerank, Pagerank using random walks with restarts. For now, I am not elaborating on these. In future, I might write a detailed story on these.

Conclusion:

I hope you enjoyed the deep dive here. If you want me to elaborate on any of the topics mentioned here, please mention them in the comments. In the upcoming stories, we’ll take a few real-world use cases and will try to implement these concepts using python. See you then in the next story. Till then, cheers to learning Knowledge Graphs and related concepts!

Note: many of the images are taken from the lectures by Jure Leskovec, Stanford.

--

--