Graph Processing: Systems

Motivation and Design

SIGMOD14: Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

todo

CPU/Muti-node/In-memory

Challenge: Parallelism (mostly inter-node)

  1. communication: smart partition or overlap with computation
  2. load balance: smart partition or dynamic work-stealing

VLDB12: Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud

todo

OSDI12: PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

EuroSys15: PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs

todo

SoCC15: GRAM: Scaling Graph Computation to the Trillions

Gram exploits Remote Direct Memory Access (RDMA). It claims that message passing on RDMA is even faster than shared memory access within the same server. Benefits of RDMA are non-blocking and no CPU involvement. Overhead is negligible as long as communication is overlapped with computation. The disadvantage of intra-node memory access is its randomness (bad locality) and potential false sharing.

Graph is partitioned by source vertex (row) and all cores are symmetric, which means that an alltoall connection need to be established. The author suggests to group receive buffer and multiplex incoming messages. As for send buffer, we can either send directly or batch for better communication.

SC15: PGX.D: A Fast Distributed Graph Processing Engine

todo

OSDI16: Gemini: A Computation-centric Distributed Graph Processing System

A series of optimizations are not so exciting, most of which originate from previous publications. Put/Pull interface(Ligra), Master/Mirror replication (PowerGraph), CSR/CSC plus some index, plain chunk partition (note the balance in vertex AND edge count), NUMA-aware sub-partition and OpenMP lightweight workstealing. The only point that interest me is the commuincation pattern. To implementa alltoall operation and overlap computation, the framework leverages virtual cyclic ring.

CPU/Single-node/In-memory

Challege: cache memory

PPoPP13: Ligra: A Lightweight Graph Processing Framework for Shared Memory

todo

dense sparse edge push pull

SOSP13: Galois: A Lightweight Infrastructure for Graph Analytics

It adopts a very general programming model amorphous data parallelism (ADP), which can naturally derive from sequential program description. The is handled by the framework and .

One takeaway code branch optimizations lightweight large system supporting many features.

VLDB15: GraphMat: High Performance Graph Analytics Made Productive

todo

CPU/Single-node/Out-of-core

Challenge: cache disk into memory

  1. low disk random access bandwidth: should always be sequential
  2. large disk IO amount (frequent cache replace)
  3. slow random memory access: improve locality by partition

KDD13: TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC

todo

SOSP13: X-Stream: Edge-centric Graph Processing using Streaming Partitions

X-Stream proposes the novel edge-centric processing, characterized by a new partition method. In each partition, there is a vertex set and edges stored with no preprocessing (unsorted and even uncompressed). This results in less preprocessing time, but ignores locality and load balance issues.

For each partition, it first sequentially read edges and randomy read/write vertex data, while appending all updates in sequential. Many updates belong to other partitions and hence need an extra rearrangement. Then it sequentially read the updates and At last, it read updates sequentially and apply them to vertex randomly. The whole process is referred to as Scatter-Shuffle-Gather.

ATC15: GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning

GridGraph is based on 2D edge partition, which combines the (silly) three-phase design in X-stream into one.

However, 2D brings up the problem of finding the optimal partition number, because fine-grained partition will not only provide locality and flexibility, but also incurs proportional io access on vertex data. The author thus proposes 2-level partition. Perfect!

Performance excels among state-of-art implementations. It is even comparable to distributed graph systems. Solid!

ICDE16: NXgraph: An Efficient Graph Processing System on a Single Machine

The paper tells the same story as GridGraph and covers its ineptitude by fancy definitions… DSSS is the 2D partition, SPU is TurboGraph update strategy, DPU is nonsense because it claims to be used under a tight memory budget but allocate even more, and finally MPU is a freak. Not to say its misleading and unfair comparison. Waste of time to read it!

CPU/Multi-node/Out-of-core

SOSP15: Chaos: Scale-out Graph Processing from Secondary Storage

todo network is faster than storage

GPU/Multi-node/GPU Out-of-memory

Challenge: GPU memory cache

  1. small GPU memory capacity
  2. slow CPU-GPU data transfer/PCIE bandwidth
  3. parallelism (mostly inter-node)

SC15: GraphReduce: Processing Large-Scale Graphs on
Accelerator-Based Systems

Specialized Accelerator/Single-node/In-Memory

Challenge: overcome software limitations

  1. slow random memory access: memory sub-system
  2. high overhead scheduling: instruction set and hardware units
  3. for faster sequential memory access: prefetchers and multiple streams
  4. parallelism (communication and load balance)

ISCA15: A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

It is a Near-Data-Processing work based on Micron’s HMC.

The main contribution is to design an on-chip network and its corresponding communication interface to facilitate the collaboration of many cores. At any time, each core can put remote memory access and get interrupted to receive and execute put calls from other cores. For efficiency, common techniques like batching and non-blocking are employed.

As far as I am concerned, the method is problematic in practice. As graph is an irregular application, some cores might starve in a specific context, cause load imbalance, or even come to a deadlock.

ISCA16: Energy Efficient Architecture for Graph Analytics Accelerators

The paper features asynchronous graph computing.

MICRO16: Graphicionado: High-Performance and Energy-Efficient Accelerator for Graph Analytics

It is intuitive to leverage on-chip scratchpad memory to cache random access. Cache improves both bandwidth and laterncy, while another subtle advantage is to control cache line granularity and avoid bandwidth waste.

However, such on-chip designs always suffer from capacity issues, which reminds us of the issue in our-of-core solution. Here edges are divided by destination vertex id (column).

To avoid conflicts, he whole processing process has two phases, which are replicated separately and connected by a crossbar switch. Compared with the ISCA15 idea, the hardware components are not identical but assigned different roles in a pipelined fashion, eliminating the possiblity of troubles addressed above.

Miscellaneous: support for symmetric (undirected) graph, support for large vertex property and support