Measuring Graph Analytics Performance

The Diverse Landscape of Graph Analytics Requires a Comprehensive Benchmark

Henry Gabb

Published in

Intel Analytics Software

6 min readFeb 7, 2020

What Is Graph Analytics And Why Does It Matter?

A graph is a good way to represent a set of objects and the relations between them (Figure 1). Graph analytics is the set of techniques to extract information from connections between entities.

For example, graphs analytics can be used to:

Get recommendations among friends in a social network
Find cut points in a communication network or electrical grid
Determine drug effects on a biochemical pathway
Detect robocalls in telecommunications network
Find optimal routes between locations on a map

Graph analytics has been getting a lot of attention lately, possibly because Gartner listed it among the top 10 data and analytics technology trends for 2019:

The application of graph processing and graph DBMSs will grow at 100 percent annually through 2022 to continuously accelerate data preparation and enable more complex adaptive data science.
(Source: Gartner Identifies Top 10 Data and Analytics Technology Trends for 2019)

Intel has a long history of leadership in graph analysis. For example, Intel coauthored the GraphBLAS specification to formulate graph problems as sparse linear algebra. Though the GraphBLAS API was just published in 2017, the initial proposal and manifesto were published more than 10 years ago. Today, the same industry and academic partnership is coauthoring the forthcoming LAGraph specification for a library of graph algorithms. Intel was also selected by DARPA to develop a new processor to handle large graph datasets (see “DARPA Taps Intel for Graph Analytics Chip Project”). We will continue to innovate and push the graph analytics envelope.

Benchmarking Graph Analytics Performance

Graph analytics is a large and varied landscape. Even the simple examples in Figure 1 show differing characteristics. For example, some networks are highly connected; others are sparser. A network of webpages exhibits different connectivity than a network of Twitter users, where some users have millions of followers while most have only a few. Consequently, no single combination of graph algorithm, graph topology, or graph size can adequately represent the entire landscape.

Therefore, we use the GAP Benchmark Suite from the University of California, Berkeley to measure graph analytics performance. GAP specifies six widely-used algorithms (Table 1) and five small- to medium-sized graphs (Table 2). Each graph has different characteristics, which is important because optimizations that work well for one graph topology may not work well for others. For example, the Road graph is relatively small, but its high diameter can cause problems for some algorithms. Consequently, the 30 GAP data points provide good coverage of the graph analytics landscape.

Table 1. GAP measures the performance of six common graph analytics algorithms.

Table 2. GAP uses five graphs of varying size and topology to give a more complete picture of graph analytics performance.

GAP also has the advantages of being a clearly-defined, objective, off-the-shelf benchmark. It does not require special hardware or software configurations, so it is easy to run.

Here are the steps that I took to run the benchmark:

Download the GAP package.
Run make in the gapbs-master subdirectory to build the reference implementations for the six graph analytics kernels. For now, I’m just using the default GAP build parameters (the GNU C++ compiler with the -std=c++11 -O3 -Wall -fopenmp options). My system had the GNU v7.4.0 compiler installed.
In the same directory, run make -f benchmark/bench.mk bench-graphs to download or generate the five benchmark graphs and convert them to more efficient input formats.
The GAP reference implementations are parallelized with OpenMP, so it is important to set the number of threads. Otherwise, GAP will use all available cores, even if this doesn’t give the best parallel efficiency. The export OMP_NUM_THREADS=32 command sets the number of OpenMP threads to 32, for example.
Finally, run make -f benchmark/bench.mk bench-run to launch the benchmarks and generate results (Table 3).

Table 3. Compute times (in seconds, lower is better) for GAP running on a two-socket Intel® Xeon® system [1]. GAP was run with 1, 8, 16, 24, 32, 48, 64, and 96 threads. Best performance is shown for each test.

I tried to generate NVIDIA V100 comparative data, but ran into several technical barriers:

Only one of the GAP graphs (Road) fits in the memory of a single V100.
Only one of the GAP algorithms (PR) in RAPIDS cuGraph can use the aggregate memory of multiple V100s.
The current version of cuGraph (v0.9.0) does not provide a BC implementation.
The cuGraph APIs and documentation do not expose certain implementation details or algorithm parameters. For example, the multi-V100 PR implementation in cuGraph does not provide a convergence tolerance parameter, which makes an apples-to-apples comparison to the GAP results difficult.

Consequently, it’s only possible to run nine of the 30 GAP tests (Table 4). As you can see, where GAP can run on both architectures, the Intel Xeon processors outperformed the V100 on most tests, even when the graph is small enough to fit in GPU memory (i.e., Road) or when the algorithm can use multiple GPUs (i.e., PageRank). Also, total cost of ownership (TCO) clearly favors Intel Xeon processors for graph analytics (Table 5).

Figure 4. Compute times (in seconds, lower is better) for cuGraph running on an AWS EC2 p3.16xlarge instance (eight V100 GPUs connected via NVLink) [2]. All Road tests used a single V100. Twitter and Web tests used eight V100s. Note that the multi-V100 PR in cuGraph does not provide a convergence tolerance parameter so the default parameters were used for these tests. Kron and Urand tests failed with a “thrust::system::system_error.” DNR = Does not run because of insufficient memory. NA = Not available in cuGraph v0.9.0.

These results highlight the advantages of a benchmark suite that tests multiple algorithms on different graph topologies. For example, SSSP implementations typically allow users to specify a delta stepping parameter to improve performance for high-diameter graphs. The cuGraph implementation provides no such option, so its internal defaults may be suboptimal for a high-diameter graph like Road. The same could also be true for the cuGraph CC implementation.

Table 5. Relative TCO of the Xeon and V100 benchmark systems. Note that the AWS price comparison is only an approximation because no EC2 instances exactly match the Xeon-based benchmarking system, but the m4.16xlarge instance is similar.

Conclusions

I can probably improve the GAP results on the Intel Xeon processor platform (Table 3) with a few simple changes like using the Intel compiler and aggressive optimization and vectorization, tweaking the thread placement and affinity, experimenting with the numactl utility, adjusting page sizes, etc. [5, 6], but that’s not the purpose of this blog. My goal is to show how easy it is to get comprehensive, objective, and reproducible graph analytics performance data without obfuscation or resorting to benchmarking tricks.

References

Processor: Intel Xeon Gold 6252 (2.1 GHz, 24 cores), HyperThreading enabled (48 virtual cores per socket); Memory: 384 GB Micron DDR4–2666; Operating system: Ubuntu Linux release 4.15.0–29, kernel 31.x86_64; Software: GAP Benchmark Suite (downloaded and run September 2019).
Processor: NVIDIA Tesla V100; Memory: 16 GB; Operating system: Ubuntu Linux release 4.15.0–1047-aws, kernel 49.x86_64; Software: RAPIDS v0.9.0 (downloaded and run September 2019).
Source: https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/ (accessed September 2019) and https://ark.intel.com/content/www/us/en/ark/products/192447/intel-xeon-gold-6252-processor-35-75m-cache-2-10-ghz.html (accessed September 2019).
Source: https://aws.amazon.com/ec2/pricing/on-demand/ (accessed September 2019).
See Boosting the Performance of Graph Analytics Workloads if you’re interested in tuning GAP.
See Measuring the Impact of NUMA Migrations on Performance if you’re interested in experimenting with the numactl utility.

This article was originally published in The Parallel Universe (issue 39).