Parallel PageRank: An overview of algorithms and their performance

Published in

Analytics Vidhya

10 min readFeb 24, 2021

If you have studied graph algorithms in the past, chances are that you have heard of an algorithm called PageRank. It was developed in 1999 by none other than Larry Page and Sergey Brin, the co-founders of Google! They needed a new algorithm to rank the ever-increasing number of web pages indexed by Google and PageRank was the brainchild of their efforts in that direction.

Note: All you geeks who want to dive into the original paper can find it here: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

The internet is completely open and democratic, which while great for people like us can be a headache for search engines like Google. Since anyone can start a website on the internet, it is filled with low-quality content, fake news and even websites that are dangerous to visit. How does Google ensure that your top search results contain up-to-date, original, high-quality content?

While Google’s current search algorithm is bound to be much more complex and nuanced than PageRank, the PageRank algorithm does provide a wonderful insight into how the internet can be though of as a graph. The algorithm envisages the internet as a directed graph with unweighted edges. Each node of the graph represents a webpage and links between two webpages are denoted by a link. Each node receives a score based on its ‘importance’. Pages with high-quality content should receive higher scores than pages with low-quality content.

There are two key factors which decide the importance of a webpage:

The number of webpages that contain a link to it
The importance of the webpages that contain to link to it

Sometimes a webpage has very few links to it but they are from credible webpages. For example, a newly published article on the New York Times’ website will contain only one link to it which will be from the homepage of the New York Times. But since the New York Times homepage is an important webpage, the PageRank score of the new article will be high and it will show up in the search results.

Other times, a page will contain links from many smaller, less known websites. For example, many bloggers in the past few days have written about how redditors on r/WallStreetBets made huge gains from investing in GameStop. Their blog posts contain a link to the subreddit. This increases the importance of the r/WallStreetBets page and makes it more likely to appear in the search results.

Understanding the PageRank formula

Let’s understand how PageRank is calculated with the help of the above graph.

PR(A) = PR(C) / out(C)

PR(B) = PR(A)/ out(A) + PR(D) / out(D)

PR(C) = PR(A) / out(A) + PR(B) / out(B)

PR(D) = PR(A) / out(A)

Where PR(X) is the PageRank score of vertex X and out(X) is the out-degree of vertex X.

To summarise, the PageRank score of a vertex is the sum of the PageRank score of the vertices that it receives and edge from divided by their out-degree.

Initially, PageRank of each vertex is initialised to (1/ number of vertices in the graph). The standard serial algorithm proceeds in iterations with PageRank being computed for each vertex in an iteration until the change in value of PageRank of a vertex in an iteration is less than a certain threshold called tolerance. Note that the order in which the vertices are processed within an iteration is irrelevant because we always consider the PageRank calculated in the previous iteration during the calculation.

The above was a simplified description of PageRank. The formula for PageRank that is used more frequently in practice is:

V is the set of vertices of the graph, N+(v) is the set of vertices that receive an edge from v and N-(v) is the set of vertices that v receives an edge from. d is called the teleportation factor and is usually set to 0.85. (1-d) is the probability that a surfer may randomly navigate away from the current webpage to one that it does not contain a link to(like the way I sometimes start reading an article about graph algorithms but can’t resist the urge to check the score of the cricket match :D).

Parallel PageRank

PageRank seems pretty straightforward so far, right? The problem is that there are more than 50 billion websites on the internet(source: a quick Google Search). If we run the serial algorithm of PageRank on a graph with 50 billion vertices, we would all likely be dead by the time it finishes executing(unless Elon Musk figures out a way to upload our consciousness into a computer, but that’s a topic for a different day). What’s the solution? Why, parallel PageRank of course!

There are three paradigms when it comes to executing PageRank in parallel on large graphs: in-memory, out-of-core and distributed. In-memory computation loads the whole graph into the main memory of the server(state-of-the-art servers having 1 TB of main memory are not unheard of). Out-of-core computation keeps the graph on disk and loads parts of it into main memory as required. Distributed computation splits the graph across multiple servers and co-ordinates the execution of PageRank between the servers. In this article, we’ll focus on in-memory parallel PageRank since it is the fastest. We’ll consider two main types of such algorithms: topology driven and data driven.

Topology Driven vs Data Driven Algorithms

A topology driven algorithm is one in which all vertices of a graph are processed in one iteration. The Bellman-Ford algorithm for finding the shortest path in a weighted graph is a typical example of a topology driven algorithm. In this algorithm, we relax the edges of all vertices of a graph in one iteration, irrespective of whether a particular vertex’s edges need to be relaxed or not.

On the other hand, data-driven algorithms process only the ‘hot’ vertices in the graph, ie. the ones which require processing. Breadth First Search is a great example of such an algorithm. In each step of this algorithm, we extract a vertex from the worklist and add its neighbours to the worklist if they are undiscovered.

Data-driven algorithms tend to be harder to analyse theoretically and can have worse worst-case time bounds than topology driven algorithms. However in practice, carefully-designed data-driven algorithms have been shown to outperform topology driven algorithms, as they do not carry out redundant work[2].

Topology Driven PageRank

I know this one looks a bit more complex, but it is the vectorized version of PageRank. x is the PageRank vector, e is a unit vector, alpha is ‘d’ or the teleportation factor that we discussed above and epsilon is the tolerance. On line 4, Sv is the set of vertices that vertex v receives an edge from. Tw is the set of vertices that receive an in-edge from vertex w. It’s basically the same algorithm that we discussed earlier.

The bottleneck of this algorithm is that accessing the neighbours of a vertex leads to random DRAM accesses, since the neighbours may be located anywhere in DRAM. This leads to poor spatial locality of data access, which causes the cache to be ineffective and most of the time is spent trying to get data from the DRAM. Note that temporal locality of data access is very poor too since a vertex’s adjacency data once accessed will only be accessed again in the next iteration.

A potential solution to this problem is to apply the Gather-Apply-Scatter model(GAS) and rewrite the algorithm. This divides the computation into three phases:

Gather: In this phase, each vertex reads the updates to it’s in-neighbors’ PageRank values from a dedicated location in memory where they are written by the neighbors
Apply: In this phase, the vertex updates its own PageRank value since it has received the updated values from its neighbours
Scatter: In this phase, the vertex writes its new PageRank value in the dedicated memory locations of its out-neighbors for them to process in the next iteration.

The advantage of doing this is that in the Gather phase, the updates are stored serially which improves spatial locality of data access.

While this is an improvement, a further optimisation called Partition Centric Processing Methodology(PCPM) was proposed a couple of years ago[3]. In PCPM, the vertices of a graph are grouped into partitions, as shown in the figure below.

Example graph with partitions of size 3(source:[3])

For each partition, the partitions from which an in-edge is received is recorded. Contiguous space is allocated to each partition for getting updates from other partitions. For example, P2 is the only partition from which P0 will receive updates, since 6->2 , 7->0 , 7->1 and 7->2 are the only in-edges to it. Thus we can pre-allocate contiguous space for these two updates(ie. from vertex 6 and 7), even before the algorithm starts executing.

In the simple GAS approach, updates are sent out along each edge in the scatter phase. This results in a large number of redundant, random DRAM writes. These are eliminated using the above approach. In the parallel implementation of this algorithm, one thread gathers or scatters across all the nodes of a partition at a time. The scatter phase of any partition is not allowed to begin until the gather phase of all partitions is completed and vice-versa.

However, there are still random accesses to DRAM occurring as one partition’s nodes may contain edges to multiple partitions and updates may get interleaved across all of these partitions in the scatter phase. To prevent this, a per partition bipartite graph is constructed as shown below.

Per-partition bipartite graph(source:[3])

Now in the scatter phase, one thread collects all the updates that a partition is supposed to receive and writes them serially. This still results in random reads of the update values, but since the bipartite graphs are much smaller than the actual graph, the random reads are more cache-friendly.

Data Driven PageRank

Data driven PageRank works just like topology driven PageRank except that it computes the PageRank of a vertex and then pushes its neighbours into the worklist(just like BFS). The below image shows the pseudocode for the same:

The terminology of symbols used here is exactly the same as that in the previous image of the topology driven algorithm.

But this is not the only way to implement data driven PageRank. This is what is referred to as the ‘pull’ implementation. This is because whenever a vertex is popped from the worklist, it pulls or reads the values of its neighbors’ PageRanks and then updates its own value. Another way to do it, is to push the updated PageRank of a vertex to its neighbors. This is called the ‘push’ implementation and its pseudocode is shown below:

Push based data driven PageRank(source:[4])

As we can see, here we have an additional vector r which contains V elements. This vector is called the Residual Vector. It stores the contributions(also called residuals) of each vertex to its neighbours when its PageRank gets updated. It has been proved previously that this algorithm converges and eventually all the residuals become 0.

It was observed in a previous study[4] that in terms of execution time:

push based data driven < pull based data driven < naive topology driven

While the read-mostly implementations of the topology driven and pull based data driven algorithms are more cache-friendly than the push based one, the amount of work required to be done while pushing is much lesser. This is because while pushing, a vertex can selectively activate neighbours whose residual is greater than the tolerance and hence need to be processed. On the other hand, the other two algorithms perform much more work as they process a larger number of vertices.

An interesting future work could be to compare the GAS and PCPM implementations of topology driven PageRank with the push based data driven version. Will the increased cache friendliness of GAS and PCPM help them to compensate for the time spent doing extra work as compared to the push version? Another interesting question is how can the push version be optimised further?

These are the problems that I am exploring currently as a part of my research under the guidance of Dr Vijay Chidambaram, who is a professor at UT Austin. Most of my work is done using the Galois graph analytics framework, developed in-house at UT Austin(https://iss.oden.utexas.edu/?p=projects/galois). I would recommend interested readers to check it out.

In case of any further questions or feedback, I would love to hear from you. You can reach out to me at chinmayvc@gmail.com.

References: