Relentlessly Improving Performance
NVIDIA DGX A100 640GB Systems & BlazingSQL Provide Big Value in a Small Space
Introduction
In the two years since its introduction, the RAPIDS team has been laser-focused on bringing the performance of GPUs to the Python data science ecosystem. If performance is a primary goal, it is critical to be able to measure that performance, and how it changes over time. Benchmarks help us understand the impact of both improvements (and sometimes regressions!) to the software and hardware. There are a number of internal benchmarks that have been used to benchmark RAPIDS. However, in the last year, we have focused on building a robust, repeatable, realistic performance benchmark to more consistently test RAPIDS against an end-to-end Big Data workflow at a 10+ terabyte (TB) scale.
Today, we want to explain what we are using to benchmark RAPIDS and how our results have progressed. We also want to showcase what RAPIDS can do on the new NVIDIA A100 80GB GPU just announced. Spoiler alert — the new A100 80GB blows away our previous performance numbers, doing the same amount of work in about half the number of nodes. It’s an impressive suite of hardware, ideally suited to data science workloads.
Summary of GPU Big Data Benchmarks in 2020
Our benchmark, which we are calling the GPU Big Data Benchmark (GPUbdb), uses 10 TB of simulated data designed to mimic real data from a large retail or finance company. It comprises a mix of structured and unstructured data, requiring large scale ETL, natural language processing, and machine learning. Most importantly, the benchmark is evaluated “end-to-end”, including everything from loading data all the way to writing output files — that means data starts on disk, is read into GPUs, analytics are performed, and the result is written back out to disk. This makes the benchmark directly relevant to real-world workflows applicable to many enterprises. Great, so how does RAPIDS look on this benchmark?
May 2020 — NVIDIA DGX-1 + RAPIDS + Dask
At GTC Spring 2020, NVIDIA CEO, Jensen Huang, announced preliminary results, completing the entire GPUbdb benchmark workload in under 30 minutes on 16 NVIDIA DGX-1 nodes using 128 NVIDIA V100 GPUs. Our code implementation primarily relied on GPU DataFrame analytics at scale provided by PyData tools like RAPIDS 0.13, Dask, Numba, CuPy, and more. We relied on UCX to take advantage of NVIDIA NVLink, the high-speed GPU-to-GPU interconnect. The cost of the cluster was approx. $2M.
Setup:
- Scale Factor — 10TB
- Systems: 16x NVIDIA DGX-1
- Hardware: 128 total NVIDIA V100 GPUs with 4 TBs total GPU memory connected locally over NVIDIA NVLink and node-to-node via NVIDIA Mellanox InfiniBand networking.
- Software: RAPIDS v0.14, Dask v2.16.0, UCX-Py v0.14
June 2020 — NVIDIA DGX A100 320GB + RAPIDS + Dask
In June 2020, NVIDIA again announced breakthrough performance, completing the GPUbdb benchmark in under 15 minutes (½ the time) using 16 NVIDIA DGX A100 320GB nodes and the newly released RAPIDS 0.14 software. Using the same number of nodes with new (at the time) A100GPUs, we were able to double our performance. Improvements to both hardware and software made this huge leap possible. The cluster cost was approx. $3.2M — higher than the previous result on 16 DGX-1 systems, but the increase in performance led to a lower total cost of ownership (TCO) than the DGX-1 solution.
Setup:
- Scale Factor — 10TB
- Systems: 16x NVIDIA DGX A100 320GB
- Hardware: 128 total NVIDIA A100 GPUs with 5 TBs total GPU memory connected locally over NVIDIA NVLink and NVIDIA NVSwitch and node-to-node via NVIDIA Mellanox InfiniBand networking.
- Software: RAPIDS v0.15, Dask v2.17.0, UCX v0.15
However, most of our code relied on DataFrames. While DataFrame APIs are great, many people are more familiar with SQL. We knew that implementing this benchmark primarily in SQL would be critical to demonstrating that the thousands of SQL queries running on CPU clusters across companies would be faster and cheaper on GPUs. Enter BlazingSQL.
October 2020 — NVIDIA DGX A100 320GB + RAPIDS + Dask + BlazingSQL
BlazingSQL is a high-performance distributed SQL engine in Python built on RAPIDS. The BlazingSQL implementation of this benchmark raised the bar even higher. The BlazingSQL version completed the benchmark in fewer than 12 minutes, with just 10 DGX A100 320GB nodes. That’s 20% faster performance at only 60% of the cost (approx $2M for the cluster).
Setup:
- Scale Factor — 10TB
- Systems: 10x NVIDIA DGX A100 320GB
- Hardware: 80 total NVIDIA A100 GPUs with 3.2 TBs total GPU memory
- Software: RAPIDS v0.16, Dask v2.30.0, BlazingSQL v0.16
As excited as we were about these results, we were nowhere near stopping.
November 2020 — DGX POD with DGX A100 640GB + RAPIDS + Dask + BlazingSQL
NVIDIA announced at SuperCompute 2020 a new version of the A100 GPU, increasing the per-GPU memory from 40GB to 80GB. This is huge for data science workloads, which are commonly memory intensive and often need to temporarily increase memory usage. While RAPIDS has many tools to manage larger-than-memory workloads, there is no substitute for simply having more memory.
Based on this release, we’re excited to show a step function improvement in our performance and cost savings. Using BlazingSQL, RAPIDS, Dask, CuPy, and Numba on a single DGX POD with 6 DGX A100 640GB nodes, we completed the benchmark in under 11 minutes. That’s even faster still while reducing cost by yet another 10% (approx $1.8M for the cluster). The larger memory capacity enables us to spike memory higher during key ETL operations, dramatically reducing the need to spill data from GPU memory to CPU memory.
Setup:
- Scale Factor — 10TB
- Systems: 6x NVIDIA DGX A100 640GB
- Hardware: 48 total NVIDIA A100 GPUs with 3.8 TBs total GPU memory
- Software: RAPIDS v0.16, Dask v2.30.0, BlazingSQL v0.16
Wrapping Up
These results are exciting, because not only do we have a clear path to more effectively test the performance of RAPIDS, but we also get to show our improvement over time. Progressing from the V100 32GB to the A100 80GB GPUs, we’ve improved from our RAPIDS 0.14 results completing the benchmark in 30 minutes on a $2M system to under 11 minutes on a $1.8M system using BlazingSQL on RAPIDS 0.16.
The A100 80GB is an impressive suite of hardware, and its additional memory makes it ideally suited to the sorts of high-performance data analytics RAPIDS is designed to perform. A six node, single rack, DGX POD with DGX A100 640GB can run 10TB workloads at blazing speeds — an impressive performance. Just as importantly, the RAPIDS software is becoming more and more efficient. By considering both software and hardware holistically, the RAPIDS team is making the vast potential of accelerated computing accessible to data practitioners across industries and institutions.