Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Demystifying PageRank & Variants

--

Understanding different edge cases of PageRank & personalization techniques

PageRank (PR) is an algorithm used by Google to rank websites in their search engine results. It was named after Larry Page, one of the founders of Google. The underlying assumption is that more important websites are likely to receive more links from other websites. It is one of the most important applications of Linear Algebra.

In this post, we will first understand the maths behind page rank algorithm, then we will go to the edge cases of page rank & possible solutions . In the end, I will also introduce the concept of Personalized Page Rank.

Understanding the Maths behind Page Rank

To model the activity of the random Web surfer, the PageRank algorithm represents the link structure of the Web as a directed graph. Webpages are nodes of the graph, and links from webpages to other webpages are edges that show the direction of movement. The idea is simply that importance score is measured by how many links referenced a page and how important were those links.

The graph is then converted into a modified version of the adjacency matrix (A). The first thing to understand in the matrix is that if a page is referencing a lot of other pages, then the contribution of importance given by that page to the linked pages is distributed uniformly. Consider a simple graph shown below:

Example 1 — Source

The matrix A for this graph is given by following matrix. The columns are normalized by the number of outgoing links for a node.

Let π(i) be the page rank of “i” node. Then we can represent π(i) by following equations:

Hence, the simplest form of Page-Rank can be represented by following equation where A is called the transition matrix (similar to Markov chain transition matrix) and X is the ranking vector.

                         π = A * π      --- Eq (1)

To get ranking vector X, we first assume an initial vector of equal ranking normalized by the number of nodes (Here X is [1/4, 1/4, 1/4, 1/4])and multiply it with A iteratively. In an ideal scenario, this product should finally converge to an equilibrium vector X* which will be our page rank.

Some questions that are often asked in the context of PageRank are:

  • Under what circumstances or properties of A is it guaranteed to converge?
  • Will it converge to just one vector or multiple vectors?
  • Does the convergence depend on the starting vector π?

In Example 1, if we perform iterative multiplication, then we will reach an equilibrium vector X* given by [0.38, 0.12, 0.29, 0.19]

The convergence is not valid for any matrix A. There are some special conditions that A should follow for convergence like stochasticity, irreducibility, and aperiodicity. If all these conditions are satisfied, then a stationary vector (which is the PageRank vector in this case) exists, is unique, and can be found by a simple power iteration irrespective of the initial ranking (the sum of the elements of the initial ranking should always be 1 and each element should be positive for convergence).

However, normally A does not satisfy these conditions. The failure cases are explained below and we will also see how Google solves these issues.

Edge Cases

1. Nodes with no outgoing edges (Dangling Nodes)

Example 2— Source

Here node 3 does not have any outgoing links. For this case, the transition matrix A will be given by:

This matrix is Not Stochastic because the sum of the third column is 0. Hence if we multiply x = [1/3, 1/3, 1/3] with A iteratively, we will get a 0 vector after two iterations which does not make sense. Ideally, we’d prefer the PageRank vector to be positive, i.e., contain all positive values.

2. Disconnected components

Example 3— Source

The transition matrix for this graph is given by:

This matrix is Not Irreducible because the directed graph is not strongly connected. If we multiply x = [1/5, 1/5, 1/5, 1/5, 1/5] with A iteratively, we will get the same vector x. With the simple form of PageRank, we can only tell the relative importance of a node within a component. But it does not tell you anything about how different components compare.

3. Problem of Cycles

Example 4 — Source

The transition matrix for this graph is given by:

If we multiply A with x = (1, 0) iteratively, we will get (1,0) at even multiples of A and (0,1) at the odd multiple of A. Intuitively user is kind of stuck in an infinite and hopping between these two nodes. Hence there is no unique solution.

Solution — Google Matrix

To solve the above problems, Google introduced a damping factor ‘α’ such that it gives a non-zero probability of going from one node to every other node. The equation for this matrix M is given by:

                  M = α*W + (1 − α)*R     --- Eq (2)

Here, W is our original transition matrix representing the random web surfer and R is the matrix in which all the entries is equal to 1/n where n is the number of nodes. The damping factor ‘α’ lies between 0 and 1 and its ideal value is considered to be 0.85.

So, how did this simple change in the previous equation solved our problem?

  • With the introduction of R, the new matrix M become stochastic and irreducible. Each column sums up to 1 and therefore convergence is ensured
  • The intuition behind R is that in every step, with probability 1 − α, the random surfer “gets bored” by the current website and surfs to a new random site

It is important that we pick a value of ‘α’ close to 1 so as not to depart away from the original graph. The value of ‘α’ also affects the rate of convergence of page rank vector.

The rate of convergence of the power method (explained below in python implementation) is the rate at which α^k→ 0. As α goes to 1, the number of iterations required for convergence increases. Hence it is important to choose an optimal value of α taking care of both speed and accuracy

Python Implementation

def PageRank(A, alpha=.85):
n = A.shape[0]
# add damping factor
M = alpha * A + (1 - alpha) / n * np.ones((n, n))
# power iteration
prev = np.zeros(n)
rank = prev + 1 / n
while (rank - prev) @ (rank - prev) > 1e-10:
prev = rank
rank = M @ rank
return rank

Personalized Page Rank

Page Rank is one of the most popular algorithms of the modern century but with increasing size, there has been a demand for personalized services in terms of ranking. In an ideal scenario, each user would like to define its own notion of the importance of web pages. According to Stanford, “While in principle a personalized version of the PageRank algorithm can achieve this task, its naive implementation requires computing resources far beyond the realm of feasibility which increases the motivation to look for efficient computation of personalized variants of PageRank.”

One of the simple method to introduce personalization is to change the R matrix in the equation 2. Instead of giving equal importance to all the web-pages, we can introduce some bias towards specific page/pages according to user personal preference. The new equation becomes:

                x= α*W*x + (1 − α)*E     --- Eq (3)

Here E vector consists of the personalized distribution. W is still the same — our original transition matrix representing the random web surfer.

In the case of normal PageRank, algorithm user performs a random jump to another node with equal probability. The only thing which is different to personalized PageRank is, that in PPR the jump (often called teleportation) always performed back to some specific pre-selected nodes (called source nodes) depending on distribution of E vector. For example, setting E = (1,0,0,0) is a special case for PPR where the jump is always made back to the same node 1.

In order to decide vector E, we can use multiple resources like user’s search history, their attributes like location, or their social network etc.

To Be Continued…..

About Me: Graduate from USF with Masters in Data Science & BTech. from IIT Bombay. Currently working at Uber as Data Scientist.
LinkedIn: www.linkedin.com/in/alvira-swalin

References

  1. USF Lecture by David Uminsky & Xuemei Chen
  2. Personalized Page Rank
  3. http://pi.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture3/lecture3.html
  4. http://geza.kzoo.edu/~erdi/patent/langvillebook.pdf
  5. http://meyer.math.ncsu.edu/Meyer/PS_Files/DeeperInsidePR.pdf

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Alvira Swalin
Alvira Swalin

No responses yet