Large Graph Visualization with RAPIDS cuGraph

New Accelerated Force Atlas 2 API processes and visualizes graphs with millions of vertices and edges in seconds

Published in

RAPIDS AI

8 min readDec 3, 2020

Graph visualization helps us understand the interactions between different entities such as relations in a social graph or communications in a computer network. Still, current visualization pipelines are either lacking an easy-to-use interface or good performance. In this work, RAPIDS cuGraph presents a fast and faithful GPU accelerated version of the popular Force Atlas 2 layout, achieving 3x-6x speedup over the state of the art.

Introduction

Data visualization is a critical step in any data science pipeline. It allows us to have a better overall understanding of the problem at hand and provides meaningful insight into the data we are manipulating. Graph layout algorithms, although available through open source libraries such as NetworkX or others are not fast enough to run on real-world problem settings and often force the user to work with a subset of the data. In this blog, we show that graphs with more than 50M vertices and edges can be visualized efficiently. Using RAPIDS cuGraph, we showcase how the Force Atlas 2 algorithm efficiently draws such graphs in just a few minutes. Our implementation can run 2788x faster than existing Python packages and 3x-6x faster than the previous GPU version of Govert G. Brinkmann et al.

Force Atlas 2 Overview

Force Atlas 2 was made popular by Gephi, one of the first visualization tools capable of letting the user see the drawing process on a graph and interact with it. Force Atlas 2 is built upon a force directed model in which nodes are represented as particles and edges as springs. The simplicity of this model makes the final drawings easily interpretable and visually pleasant for the user.

In this physical system, we can consider that the particles are in an agitated state. The goal of Force Atlas 2 is to iteratively decrease the global energy of the system until an equilibrium state is reached, giving us our final layout. An iteration involves computing for each node attraction forces between adjacent vertices and repulsion forces which is computed against every other vertex (also known as the N-Body problem).

If we were relying on these two forces only, nodes would keep oscillating back and forth and never reach a stable position. The strength of Force Atlas 2 lies in another component that will assure stable convergence known as the global speed. It maintains a cruising convergence speed throughout the iterations, making it possible to get out of local minima by increasing or decreasing the factor by which we update the node positions. We encourage you to check out the original paper for a deeper understanding of the algorithm and the impact of the different parameters.

Accelerating Force Atlas 2 on the GPU

The algorithm can be summarized into five main components:

Attraction
Repulsion
Gravity
Global/Local Speed
Displacement

Although steps such as gravity, speed, and node update can be efficiently parallelized. The attraction and most importantly, the repulsion operation will require more optimizations in order to bring down the running time of the algorithm. We will cover the key points that greatly improve the performance of these steps.

Attraction Optimization

As usual in cuGraph, we want to use sparse matrix formats in order to store our graph efficiently. Most of our algorithms go through a COO (Coordinate List) to CSR (Compressed Sparse Row) conversion as CSR minimizes memory consumption and allows us to efficiently process our graph in a parallel and vectorized row-wise fashion. However, in the case of Force Atlas 2 using COO matrices instead of CSR allows us to maximize the use of all the GPU threads as each thread can process an edge in parallel. Using CSR would result in a workload imbalance between threads since some nodes (hubs) are more highly connected than others. Moreover, the COO format minimizes the number of computations we have to make since we can update the attraction force once for both directions. The main requirement with this approach is to sort the graph before starting the iterations for coalesced memory accesses. Luckily, by using Thrust which launches Cub’s radix sort under the hood, this step becomes negligible as it comprises less than 1% of the computing time.

Repulsion Optimization

The most important part of the algorithm is the repulsion which is O(V²). This step is the major reason we want to leverage the GPU’s computing power, but the exact repulsion is still quite expensive because of its complexity. Thanks to the Barnes Hut optimization, we are able to approximate the repulsion in O(V * log(V)) with minimal error. The approximation leverages quadtrees to summarize the information of several nodes into a new center of mass and position that will allow us to compute a node-cell approximation instead of processing every node in the distant cell.

Figure 1: Overview of the Barnes Hut approximation. The red cell’s center of gravity is close enough so the repulsion will be computed against every node. However, the yellow cell is considered far enough so only one computation will occur.

Our implementation is based on the version of cuML that optimized the Barnes Hut algorithm for TSNE. With the use of efficient data structures as described in the original paper, we were able to perform uniform branching in the repulsion kernel. We also enabled compiler optimizations with the use of const, restrict, with as few volatiles as possible. All of which translated in a 3x speedup for the repulsive step compared to Leiden and Amsterdam University’s version. Since more than 80% of the time in Force Atlas 2 is spent on the repulsion, speeding up this part of the code was crucial to obtain decent drawing times.

RAPIDS cuGraph Speedup

Previous work by Govert G. Brinkmann et al showed that Force Atlas 2 can be efficiently parallelized on the GPU and have obtained high speedups against their classical C++ implementation. We benchmarked the algorithm time between our two versions on an NVIDIA 32G V100 GPU using CUDA 11. Below the total running time after 500 iterations are reported.

Figure 2: Running time of cuGraph vs Leiden-Amsterdam University

For our experiments, we used datasets that are composed of real-world graphs ranging from less than 1k to more than 50M vertices and edges. Figure 3 summarizes the speedup obtained by cuGraph for each one of these graphs.

Figure 3: Dataset characteristics and speedup achieved

On average, we achieve a 3x speedup for graphs between 1–5M vertices. On the larger graph of 50M vertices, we even get 6x faster. Note that for smaller workloads, the speedup is lower since they underutilize the GPU. In the former implementation, when looking at the end-to-end pipeline, data preparation and memory transfers ended up taking even more time than the algorithm itself. RAPIDS cuDF and cuXfilter allow us to run the full visualization pipeline on the GPU without data transfers.

For a cyber graph of 706,529 vertices and 1,238,568 edges, cuGraph’s Force Atlas 2 will run in 4.8s while a pure Python implementation will need 3h43min to complete, obtaining a speedup of 2788x. This huge performance gap, giving almost instant results on the GPU, allows the user to perform more experiments on graph analytics problems.

Memory Usage

With cuGraph, running analytics on a single GPU can typically go up to 500M edges. As detailed in our previous blog, the memory usage is mainly dominated by the ETL (extract, transform, and load) phases. We also saw that for Force Atlas 2, storing the results had the smallest memory footprint of all our algorithms. This is due to the optimized repulsion that requires the allocation of multiple arrays, scaling up to 2 or 4 times the number of vertices in our graph. When looking at the memory footprint of the europe_osm graph, a memory peak of 8.65GB was captured which is around 27% of the V100’s capacity. Since the memory usage in our case scales with the allocations made for the analytic, we expect Force Atlas 2 to be able to run on graphs up to around 200M vertices.

Force Atlas 2 Inside a RAPIDS Workflow

This notebook illustrates how you can run Force Atlas 2 in just a few lines of code. You simply have to load your graph with cuDF which will directly load the data on the GPU.

Following the current Python interfaces available, we added different user parameters such as the strong gravity or lin-log mode that contribute to producing different outputs depending on the use case. Calling Force Atlas 2 is straightforward.

In just a few lines of code, we’ve been able to generate 2D positions for every vertex in our graph. For rendering purposes, we opted to use Datashader, a well-suited tool to visualize very large datasets. Datashader does not accept the vertex positions as is but uses instead a connected edge path format that we can create on the GPU using cuXfilter.

Now all that’s left is to make a call to datashade that will render all the points. You will also notice that you can interact with the plot directly in your notebook. This can be enabled by setting up HoloViews or as in the cuxfilter notebook which has integrated a dashboard viewer to explore your graph with colors and more.

Learn more about the ongoing integration work with RAPIDS and Holoviews in a blog from Plotly here.

Layout Quality

cuGraph validated the outputs of its own Force Atlas 2 using floating point comparison for the exact version and trustworthiness score for the approximated one. The visual results are similar compared to the other Python packages available.

Figure 4: Layout comparison of cuGraph (left) vs Python fa2 (right) on the cyber and netscience graphs.

Conclusion

We have seen how Force Atlas 2 could benefit from GPU acceleration to draw very large graphs. We are now capable of processing graphs with Millions of vertices and edges in just a few seconds. This is a big step for advances in large scale graph visualization as this is to our knowledge the first open source CUDA implementation available through a Python interface.

This work is the first of many for RAPIDS towards making large graph visualization easily accessible to anyone in the data science community.

Your feedback is greatly welcome, please let us know on Github or on our Google Group if you have any questions or suggestions related to graph visualization in RAPIDS.

References

Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, Mathieu Bastian: ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software (2014) https://journals.plos.org/plosone/articleid=10.1371/journal.pone.0098679Govert

G. Brinkmann, Kristian F. D. Rietveld, Frank W. Takes: Exploiting GPUs for fast force-directed visualization of large-scale networks (2017) https://ieeexplore.ieee.org/document/8025312

Martin Burtscher, Keshav Pingali: An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm (2011) https://iss.oden.utexas.edu/Publications/Papers/burtscher11.pdf