What is PageRank? Why does it matter? How is it calculated?

Published in

Polo Club of Data Science | Georgia Tech

10 min readNov 12, 2023

Why PageRank?

PageRank was created by Google’s founders Larry Page and Sergey Brin to rank web pages, treating the internet as a directed graph. The goal is to identify the most central or interesting node within this graph, based on the intuition that a node is important if it’s connected to other important nodes. This recursive definition, though intuitive, begs the question of how to translate this intuition into a numerical score. The solution proposed is to employ the random walk model, where the most popular nodes are those with the highest steady-state probability (SSP). But what does this concept truly entail, and can we find a simpler way to comprehend it?

PageRank employs a random walk model

Imagine a random walker, someone whose only task is to click on web pages across the entire internet. This random walker can initiate a random walk from any node in the graph. This person visit a web pages and then decide, through a coin flip, which of its neighboring webpages to visit next.

Understanding random walk with an example

Suppose we have a toy internet graph with only 5 nodes, as shown in the figure below. And suppose the random walker first visits node 1, then that person will flip a coin and proceed to the only connected node 2 (i.e., flipping the coin doesn’t really matter, but that person does it anyways!). Once at node 2, the random walker clips the coin again to decide whether to go to node 3 or node 5, with an equal chance for each. This process repeats infinitely, leading to the computation of probabilities for being on specific nodes, referred to as the steady-state probability (SSP), which also represents the PageRank score. In this example, “states” correspond to web pages.

The most popular webpages are the ones that have the highest steady-state probability or SSP.

PageRank: Simplified (without Teleportation)

To compute the steady-state probability, recall from linear algebra that we construct a transition matrix. In our example with 5 states (or web pages), we create a 5 by 5 transition matrix, where the number of nodes matches the number of rows and columns. This matrix is a transposed, column-normalized one. The transposition arises from considering where nodes originate (columns) and where they lead to (rows). For instance, node 2 has two options: going to node 3 or node 5, resulting in values in column 2, cells 3 and 5. It is column-normalized to ensure that these values sum to 1, signifying an equal probability of transitioning to node 3 or node 5. To calculate the steady-state probability, we use the equation Bp = p, where p can be any vector. It doesn’t matter; you can find p exactly by finding an inverse.

Another perspective on this formula involves calculating p as the eigenvector corresponding to the largest eigenvalue of the adjacency matrix. The highest eigenvalue, always 1 due to column normalization, allows us to construct the transition matrix from the graph using only its nodes and edges, helping us compute the steady-state probability stored in the vector p. The Perron-Frobenius theorem guarantees the existence of p as long as the matrix is n by n (matching the number of nodes, rows, and columns), nonnegative (reflecting node transition probabilities), and irreducible (we’ll discuss this shortly). This theorem ensures we can find p when these conditions are met.

To recap, before delving into the PageRank algorithm, we envision a random walker traversing nodes and edges in the graph, with a node’s PageRank score representing the steady-state probability of finding the walker at that node, enabling us to calculate scores for all nodes in the graph.

Addressing irreducibility, let’s consider a scenario where the random walker could get trapped in one part of the network due to a lack of connecting edges between two subgraphs. If you start in one subgraph, you can’t reach the other because there are no links to follow. To tackle this, we introduce occasional random jumps or teleportation to any node in the graph, making the matrix irreducible and resolving potential isolation between subgraphs.

PageRank: Full Algorithm (with Teleportation)

Now, let’s examine the formal definition of irreducibility. It means that from any state (in our case, any node), there is a non-zero probability of transitioning to any other state, or any other node. This essentially aligns with the random jump or teleportation allowed for the random walker.

Now, looking at the full algorithm, the formula you see is quite similar to what you’ve seen before: p = Bp, which we know we can solve exactly using inverse operations. The new addition is the constant, denoted as c (or 1-c). We refer to 1-c as the fly-out probability or the teleportation probability for the random walker to move to any random node in the graph. Let’s break down part of the algorithm:

cBp: This represents link-following, where c is the probability of link-following.
(1-c)/n and an all-1 vector: here, 1-c is the fly-out probability. n represents the number of nodes, and the all-1 vector is simply a vector with all values set to the same value. Essentially, when you insert n inside that vector (1 vector), it means that all values are 1/n. This implies a 1-c fly-out probability of using this 1/n vector, allowing an equal probability of transitioning to any of the nodes.

How to Run PageRank on Huge Matrix?

This formula is a highly scalable method for computing PageRank for large matrices or graphs, capable of handling billions of nodes and edges. It employs a power iteration method, a concept often covered in linear algebra classes, where you repeatedly apply the algorithm.

For instance, in iteration zero, you initialize the vector p on the right-hand side to any non-zero value, often using an all-ones vector or values proportional to node degrees. You then apply the formula to the right-hand side, generating a new vector, p prime. In the next iteration, you move p prime to where p is, creating p prime prime. This process continues, and eventually, you’ll observe that the vector no longer changes. When this happens, the algorithm has stabilized, and it’s guaranteed to do so, allowing you to run it for numerous iterations and obtain the PageRank scores for all nodes in the graph. In practice, for large graphs, you typically don’t need to run many iterations. Often, just 20 or 30 iterations are sufficient, as the PageRank values tend to stabilize, changing very little.

PageRank Explained with Javascript

When implementing the PageRank algorithm, it’s essential to perform a sanity check to ensure its correctness and accuracy. To facilitate this, I recommend using an interactive PageRank demo. With this tool, you can create simple toy graphs by adding nodes and connecting them with edges. The demo provides you with immediate feedback on the PageRank values, displayed as white text on a black background. Additionally, you can adjust the fly-out probability, referred to as the damping factor in this context, to check your PageRank calculations.

Run PageRank on Any Graphs

PageRank is a versatile algorithm that can be applied to various types of graphs. It requires only the graph’s edges to operate, making it a valuable addition to your algorithm toolbox. It is generally better than degree centrality, which considers only direct neighbors, as PageRank accounts for longer-range relationships, such as those two or three steps away, and aggregates this information. PageRank is well-suited for large graphs due to its linear runtime in the number of edges, allowing scalability to billion-scale graphs.

However, PageRank has a limitation, prompting Google to employ a combination of algorithms. One method to manipulate PageRank scores is through a technique known as a “Google Bomb,” aimed at misleading the PageRank algorithm. To boost the PageRank score of a specific node, individuals create numerous fake nodes in the graph, all pointing to the node of interest. In the context of the algorithm, having more incoming links to a node can increase its PageRank score.

Personalizing PageRank

While the PageRank algorithm produces consistent results regardless of who runs it, this uniformity may not align with the personalized preferences of different individuals. For instance, if you’re interested in sports websites and someone else favors news websites, not all web pages are equally relevant to both of you. To address this, we can make a small modification to the PageRank algorithm.

This change involves adjusting the all-one vector, which is responsible for the uniform results we discussed earlier. If you want to compute a personalized PageRank score, let’s say for your strong interest in sports websites, you replace the all-one vector with a 0–1 vector, featuring zeros and non-zero values. You assign non-zero values to nodes that align with your interests, focusing on a limited set of sites. For example, if you’re interested in news websites, you might assign a non-zero value to nodes like the New York Times and the Wall Street Journal.

The impact of this change is that when the algorithm is about to perform a random jump, it no longer selects any node in the graph with equal probability. Instead, it shows a preference for jumping to the websites that match your interests. Additionally, this adjustment takes into account the interconnected nature of news websites, meaning that by assigning a non-zero value to nodes like the Wall Street Journal or New York Times, you increase the likelihood of the random walker not only preferring to jump to those websites but also visiting other related news websites you haven’t assigned a value to yet.

Why Personalized PageRank?

Personalized PageRank is valuable because it can be employed for recommendation purposes. Consider a scenario where you want to rank products, such as in the case of Amazon, using a customer-product graph. As a new Amazon customer who has made one or two purchases, you construct a bipartite graph connecting products to the customers who have bought them. By placing a “1” in this graph for the products you’ve purchased and running the personalized PageRank algorithm, you can obtain scores for related products — those commonly purchased by other customers, aside from your selections.

This technique is highly adaptable because it requires only the graph structure, and the specific way you create the graph is not crucial. As long as the graph has edges, you can compute personalized PageRank scores. Moreover, it serves as a valuable tool for graph visualization. Instead of visualizing every node and edge, you can gather user feedback to identify interesting nodes (assigning them a “1”) and then run the personalized PageRank algorithm. This allows you to focus on visualizing nodes with the highest scores tailored to each user’s preferences.

Personalized PageRank belongs to a broader class of algorithms, often referred to as diffusion-based or guilt-by-association algorithms, where connections signify likely relationships. It is sometimes called “Random Walk with Restart.” In other domains like Human-Computer Interaction (HCI), it is known as “spreading activation” or “degree of interest.” There an algorithm called Belief Propagation, which achieves similar outcomes to personalized PageRank. Belief Propagation is a versatile algorithm utilized in a wide range of applications, from fraud detection to image segmentation and error-correcting codes. Powerful techniques like PageRank and personalized PageRank find applications across various domains, albeit under different names.

Why are These Algorithms Popular?

And all these algorithms are really popular because they tend to be very intuitive to interpret. They use network effect and homophily. They’re also very easy to implement and the math is relatively simple. For example, PageRank is only one formula and implementation is essentially just doing matrix-vector multiplication, a few of them. And in practice, they’re also very fast to run. They’re linear to the number of edges, so that means they can scale to a very large graph. And also, very nicely, is that they often have probabilistic meaning. That means not only empirical ranking, but also you can interpret the scores probabilistically.