A Handwritten Introduction to PageRank

Published in

The Startup

10 min readSep 27, 2020

Google owes a great part of its success to the algorithm which was originally used to rank webpages. This algorithm, PageRank, sorts all the pages of a network according to their popularity. The more webpages link to some page A, the highest its score is and the most likely it is to appear among the top results of a search. The example of PageRank is commonly given in linear algebra courses, as it is a good illustration of the applications of eigenvalues and eigenvectors. This article aims at building an intuitive understanding of PageRank and will show the maths and the code behind it.

It assumes basic knowledge of linear algebra, including manipulating matrices and vectors. Prior knowledge of eigenvalues and eigenvectors will be helpful, albeit not indispensable. We will illustrate the algorithm with a pure implementation in Python and Numpy as well as with the library NetworkX.

Let’s start by setting up a tiny internet. It consists of six webpages which contain links to other pages or themselves. We can represent this network as an oriented graph, like in the messy drawing below.

We will represent this graph with an adjacency matrix where the value at row r and column c is the probability of landing on the webpage r given that you have just followed a link from the webpage c. For example, page A links to pages B, D and F. Since there are three destinations, we normalise our vector by a factor of 1/3, which gives us:

We called this vector L[A], which stands for Link from page A.

By applying the same process to all our pages, we get the link matrix shown below.

Although you may find implementations that behave differently, our link matrix ignores self-referencing links. We don’t consider multiple references to the same target either.

Before jumping into a more abstract description of PageRank, this extract from the original paper published in 1998 (link in the footnotes) can give you a good perception of what the algorithm does:

Intuitively, this can be thought of as modeling the behavior of a “random surfer”. The “random surfer” simply keeps clicking on successive links at random.

The idea behind PageRank is to give each page a popularity score, which we call a rank. If the surfers often land on some webpage, it means that many other pages recommend it, and, therefore, that it may be good. The goal of the algorithm is to find the probability that the random surfers end up on each page as they stroll around the internet.

The n-th row of the matrix indicates all the pages that contain a reference to the n-th page and the probability of reaching it. By summing all the values on each row, we could determine a first popularity score. But we also need to consider the popularity of the page which contains the reference, as you’d better be mentioned by a popular page than one which is rarely visited. We can translate this description into an iterative formula where the popularity of the page A depends on each reference that links to it, weighted by the popularity of the page that contains the reference:

We can generalise and simplify the expression above by writing it as the product of the link matrix L by a column matrix R which contains the popularity score of each webpage:

We evaluate this expression until it converges, i.e. until the popularity of all webpages does not change when multiplied to the link matrix.
In other words, this operation boils down to finding the eigenvectors of eigenvalue 1 of the link matrix. Given the size of the matrix — which can quickly become extremely large— using this recursive formula, known as the power method, is one of the most efficient ways of solving the problem.

Let’s take a look at a simple implementation in Python:

def rank_websites(links: np.ndarray, threshold: float = 1e-4):
    size = links.shape[0]
    initial = np.ones(size) * 1/size
    previous_ranking: np.ndarray = initial
    iterations = 1
    popularity = links @ initial

    while np.linalg.norm(previous_ranking - popularity) > threshold:
        previous_ranking = popularity
        popularity = links @ popularity
        iterations += 1

    return popularity, iterations

The function takes in two parameters: the link matrix as a Numpy array and a threshold, set to 1×10^{-4} by default. The threshold corresponds to the tolerance when checking if the popularity has converged.

We start by creating our initial popularity vector, where the popularity of each page is uniformly distributed.

We also need to store the previous popularity vector, which we will use to check if it has converged.

We can now iterate while the norm of the difference between the previous and the current ranking vector is greater than the threshold.

If we evaluate this function with our link matrix L, we get the following output (I multiplied it by 100 to make it more legible) after 35 iterations:

[29.12342755 22.33189164 11.64928625 9.71001964 13.98001348 13.20536144]

Keep in mind that our method only gives an approximation of the eigenvectors of eigenvalue 1 of the link matrix. Since the network is small enough, we can compare our results with the ones returned by the eigenvector calculator of Numpy:

eigenvalues, eigenvectors = np.linalg.eig(L)
order = np.absolute(eigenvalues).argsort()[::-1]
eigenvectors = eigenvectors[:, order]
r = eigenvectors[:, 0]
normalised_r = r / np.sum(r)print("using numpy eig function, r =", 100 * normalised_r)

This code starts by fetching the eigenvalues and eigenvectors of the link matrix. We then compute the order of the indices based on their eigenvalue in line 2), which we use to sort the eigenvectors by increasing eigenvalue (in line 3). We extract the eigenvector with the highest eigenvalue in line 4, which we normalise to make it stochastic. This gives us the following result:

[29.12621359-0.j 22.33009709-0.j 11.65048544-0.j  9.70873786-0.j
 13.98058252-0.j 13.2038835 -0.j]

The Damping Factor

The previous implementation worked perfectly fine with a simple example. But we don’t expect all networks to be as well-behaved: if we consider more examples and some special cases, we may stumble upon some unexpected results. Let’s take the following graph, Delta, as an example:

The network is separated into two sub-networks that we called (i) and (ii). (i) and (ii) are connected by a one-way link. Therefore, if the random surfers start their quest in (i), the probability that they end up in (ii) is not zero.

However, once they reach it, they cannot go back to (i). If we build the link matrix and compute its main eigenvector, we get the following results:

[-3.58205484e-01, 0.00000000e+00, 7.61186653e-01, 2.23878427e-02, 4.06650495e+14, -8.13300990e+14, 1.21995149e+15, -8.13300990e+14]

As we could expect, the pages in (ii) have a much higher rank than those in (i). A common example of such sub-networks is Wikipedia: a lot of pages contain a reference towards it, but Wikipedia pages only contain internal links (external links are “no-follow links”, and therefore are not registered by search engines).

The problem of this implementation of PageRank is that it assumes that once you have landed on any page on Wikipedia, you go through all other pages without ever going back or teleporting to another website by typing its URL. Obviously, this approach does not depict the real behaviour of a web surfer and the results it gives are not relevant.

A more conspicuous example of this problem is when the network consists of several graphs which are completely disconnected, like in the example below.

Since there are no relationships between both sub-networks that make up Epsilon, it will be impossible to get a general popularity ranking.

If we try to use our algorithm to compute the PageRank of this network, here is the r vector after each of the 10 first iterations:

Initial [0.2 0.2 0.2 0.2 0.2]
1 [0. 0.2 0.4 0.2 0.2]2 [0. 0.4 0.2 0.2 0.2]3 [0. 0.2 0.4 0.2 0.2]4 [0. 0.4 0.2 0.2 0.2]5 [0. 0.2 0.4 0.2 0.2]6 [0. 0.4 0.2 0.2 0.2]7 [0. 0.2 0.4 0.2 0.2]8 [0. 0.4 0.2 0.2 0.2]9 [0. 0.2 0.4 0.2 0.2]10 [0. 0.4 0.2 0.2 0.2]

The rank of pages B and C oscillates between 0.2 and 0.4 without settling. When letting it run a bit longer, it keeps iterating more than 1 million times, without congeriving.

We will use Numpy eigenvectors calculator to try to understand the reason behind this behaviour.

eigenvalues: [ 1. -1.  1. -1.  0.]
first eigenvector of eigenvalue 1: [0.         0.         0.         0.70710678 0.70710678]
second eigenvector of eigenvalue 1: [0.         0.70710678 0.70710678 0.         0.        ]

The problem of this network is that it contains multiple eigenvectors of eigenvalue 1: the first vector gives the rank of the network {DE} without considering the other one and the second vector gives the rank of the network {ABC}.

These two examples explain why we need to include an additional parameter to our formula: the Damping Factor. It is described as follows in the original paper:

[…] if a real Web surfer ever gets into a small loop of web pages, it is unlikely that the surfer will continue in the loop forever. Instead, the surfer will jump to some other page. The additional factor E can be viewed as a way of modeling this behavior: the surfer periodically “gets bored” and jumps to a random page chosen based on the distribution in E.

The damping factor (that we will call d) corresponds to the probability that the random surfer will follow a link in the page they are currently visiting. Therefore, we can view the probability 1 — d as the probability that they type a URL in the search bar and land on a random website.

When using the damping factor, we try to compute the eigenvector of eigenvalue 1 of a new matrix — we will call it L_hat (couldn’t find a way to put a circumflex on a consonant on my keyboard…) — which is an “improved” version of the link matrix L. It is given in terms of d and L by the formula below:

This method helps fixing the problems that we encountered by removing all links with a value of 0.

To update our Python implementation, all we need is to add a new parameter d and write the following line after initialising the size variable:

    links = (d * links + (1 - d) * 1/size)

Note that the matrix containing 1‘s everywhere is necessary in the mathematical statement because subtracting a scalar to a matrix is not defined, but this operation is implicit when using Numpy arrays (it will take away the value from each entry of the matrix).

Let’s now test our improved implementation with the graph Delta. Instead of getting huge ranks in the magnitude of 10^{14} for the pages in the second sub-network and minuscule ones orbiting around 10^{-1} for the pages in the first network, we obtain much more realistic results when using a damping factor of 0.85:

[0.1011186, 0.10702601, 0.07014488, 0.07014488, 0.10941522, 0.15981798, 0.22251445, 0.15981798]

Overall, the sub-network (ii) remains more popular than (i). But all ranks are in the same magnitude, which corroborates the intuition that we would have and prevents pages which contain no outbound links from absorbing all the traffic.

The damping factor also enables our random surfers to teleport through Epsilon. Therefore, PageRank now yields a consistent distribution across the network:

[0.03 0.2918891 0.2781109 0.2 0.2 ]

Page A has a non-zero rank, even if nobody recommends it ; and all entries add up to 1. This means that they correspond to the probability of landing on each page with respect to the entire network instead of local sub-graphs.

Using NetworkX

Now that we know how PageRank works, we can simply use its implementation in NetworkX. NetworkX is a Python library that enables to manipulate graphs. Here is how you can do it.

Installing networkx.

$pip install networkx

2. Building the graph. There are several ways of doing it, one of the most useful one being the function from_numpy_matrix. It takes in a Numpy adjacency matrix (the link matrix) and returns the graph:

import networkx as nxinternet = nx.from_numpy_matrix(L)

3. Computing the page rank. Here again, there exists multiple functions that do the work differently. The two main ones are pagerank, which iterates a given number of times and raises an exception if the network does not converge ; and pagerank_numpy which uses Numpy to calculate the eigenvector of the adjacency matrix of the graph.

nx.pagerank(graph, alpha=0.85, max_iter=100)
nx.pagerank_numpy(graph, alpha=0.85)

The example above shows a basic usage of these two functions. The parameter alpha is the damping factor. For more information, here are the links to the documentation of both functions.

networkx.algorithms.link_analysis.pagerank_alg.pagerank - NetworkX 2.5 documentation

The eigenvector calculation is done by the power iteration method and has no guarantee of convergence. The iteration…

networkx.github.io

networkx.algorithms.link_analysis.pagerank_alg.pagerank_numpy - NetworkX 2.5 documentation

dangling ( dict, optional) - The outedges to be assigned to any "dangling" nodes, i.e., nodes without any outedges. The…