A quick introduction to Google’s Pregel graph processing system

Aditya Chatterjee
4 min readMay 11, 2018

Pregel was first outlined in a paper published by Google in 2010. It is system for large scale graph processing (think billions of nodes), and has served as inspiration for Apache Giraph, which Facebook uses internally to analyze their social network, as well Apache Spark’s GraphX library, which provides an API for Pregel computations.

A Pregel computation consists of a sequence of iterations, each called a superstep.

“During a superstep the framework invokes a user-defined function for each vertex, conceptually in parallel. The function specifies behavior at a single vertex V and a single superstep S. It can read messages sent to V in superstep S − 1, send messages to other vertices that will be received at superstep S + 1, and modify the state of V and its outgoing edges. Messages are typically sent along outgoing edges, but a message may be sent to any vertex whose identifier is known.”

Vertices may choose to halt at a given step, switching its state from activated to deactivated. The Pregel framework will not invoke the function at this vertex from the next superstep onward, unless the vertex is reactivated by a message.

The function can be invoked at each vertex in parallel, since individual vertices communicate via message-passing rather than shared memory.

“Within each superstep the vertices compute in parallel, each executing the same user-defined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its outgoing edges, receive messages sent to it in the previous superstep, send messages to other vertices (to be received in the next superstep), or even mutate the topology of the graph. Edges are not first-class citizens in this model, having no associated computation.”

The paper also backs the decision to use message-passing rather than shared-memory communication by stating that the researchers had not found any graph algorithms in which message passing was not a sufficient means of communication between vertices.

The image below shows a sequence of supersteps to calculate the highest value among vertices. This illustration is taken from Adrian Coyler’s blog.

Step 1: All four vertices are active, and send their values across outgoing edges.

Step 2: Messages received by each vertex are combined, and the largest value is found. If this value is larger than the vertex’s value, then the vertex’s value is updated. If none of the messages are of higher value than a given vertex, it deactivates itself.

Step 3–4: The largest value is propagated to every vertex in the graph, and there are no active vertices remaining. Pregel computation complete!

In step 2, I mentioned that the messages at a vertex from the last superstep can be combined, in this case to produce the message with the highest value. The function that is responsible for this is called the combiner.

Additionally, vertices can exchange messages globally for communication across the entire graph. The aggregator is a function that combines these global messages.

Now let’s look at how the Pregel API can be used in Spark GraphX. The following is the method signature for pregel, a higher order function (aka functor? forgive me, I’m still new to this whole functional programming thing) in Graphx.


def pregel[A]
(initialMsg: A, maxIter: Int = Int.MaxValue, activeDir: EdgeDirection = EdgeDirection.Out) //parameter list 1(vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], mergeMsg: (A, A) => A) //parameter list 2: Graph[VD, ED] //output

As you can see, there are two sets of parameters for the pregel (Scala allows multiple sets of function parameters)

The first parameter list contains initialMsg (a message that each vertex starts with in the first superstep), maxIter (an upper bound on the number of supersteps this pregel computation can have) and activeDir (specifies direction for message propagation).

The second parameter list is for functions that define behavior at each vertex for each superstep. vprog defines how the current vertex can be updated depending on the output of mergeMsg, sendMsg determines messages to be sent and which vertices they should be sent to, and mergeMsg defines how the messages from the previous superstep should be combined (yep, mergeMsg is the combiner function!)

And that’s a wrap! I hope this post inspires you to learn more about pregel, graph processing or distributed/parallel computing!

If you are unfamiliar with functional programming concepts such as first-class functions, higher order functions, immutable data structures, program state etc. I recommend checking out MPJ’s introduction to functional programming in Javascript.

--

--