RAPIDS Makes GPUs More Accessible for Python users at the National Energy Research Scientific Computing Center

Published in

RAPIDS AI

8 min readMay 14, 2020

By: Rollin Thomas (NERSC), Nick Becker (NVIDIA), Laurie Stephey (NERSC)

Cabinet art for Perlmutter (Credit: LBNL)

This year, NERSC — — the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory — — will begin deploying its newest supercomputer, “Perlmutter.” Perlmutter will include a partition of thousands of nodes incorporating NVIDIA’s Ampere GPUs, making it NERSC’s first production system including GPUs (read the announcement from NERSC here). Much of NERSC’s simulation and deep learning workload naturally will migrate to Perlmutter’s GPUs to produce state of the art accelerated science. What is the picture like for data processing and analytics?

Part of being the primary scientific computing facility for the US Department of Energy’s Office of Science means that NERSC also supports a growing population of scientists who are users of large experimental and observational data (EOD) facilities. EOD facilities (like telescopes, particle accelerators, fusion reactors), have blessed their user communities with opportunities for discovery but also presented them with challenges in dealing with their new, massive data sets. NERSC has tracked a growing preference among these users for the abstraction power of high-level languages, software libraries, and programming environments like MATLAB, R, and most especially, Python.

How can these users make the most of Perlmutter’s powerful NVIDIA GPUs to accelerate their path to insight without becoming expert CUDA programmers overnight?

This post is a window into NVIDIA’s partnership with NERSC to meet EOD users where they are, ensuring that Perlmutter is a productive engine for GPU-accelerated data processing and analytics for science through Python.

Enter RAPIDS

RAPIDS is a suite of open-source software libraries for running data science and analytics pipelines entirely on GPUs, unlocking potentially massive speedups over CPU-based workflows. Because RAPIDS libraries are consistent with popular Python libraries like Pandas, NumPy, and Scikit-learn, scientists can smoothly convert their existing code to run entirely on GPUs.

To better support scientific users, the RAPIDS team at NVIDIA has been working closely with NERSC staff and users to both accelerate discovery and build the tools scientists need to test hypotheses faster. We’re excited to share preliminary results from two of our use cases, and give some insight into how we convert existing CPU workflows into RAPIDS workflows.

The first science use case from NERSC is an inelastic neutron scattering (INS) data analysis workflow, aimed at helping improve our understanding of material properties using neutrons. The second use case comes from a research project that is studying how applications use the high-speed network on NERSC’s current supercomputer, Cori, with the goal of improving the present and future interconnect efficiency.

Inelastic Neutron Scattering

One example science use case from NERSC is an inelastic neutron scattering (INS) data analysis workflow. This workflow comes from Claire Saunders, a researcher at Caltech. Claire is a user of the Spallation Neutron Source (SNS), located at Oak Ridge National Laboratory. SNS is an accelerator-based pulsed neutron beam user facility operated by DOE for the purposes of research and development involving neutrons. Claire uses SNS to measure atomic vibrations in single crystal samples using INS. In this process, she scatters neutrons off of samples taking measurements at various sample rotation angles. After this, she reduces the data to a manageable size and reworks it to physical quantities. While simulations (done at NERSC) are important for comparing to and interpreting the data, Claire shared her workflow for processing and analyzing the INS data itself with the RAPIDS team, and we came together to develop a GPU-accelerated version of her work.

We accelerated this code by leveraging CuPy, a GPU-accelerated array library with a NumPy-like API. Below, we’ve extracted two functions from the data processing. This is the original code, except for the few lines we’ve changed from np.array to cp.array (mostly in the __init__ for the nxspe class).

Note that we changed the lines of code for creating arrays to instead create CuPy arrays, but we left the other lines that execute functions as they were, thanks to the Array Function Protocol. This is very useful for porting existing code.

After porting the full workflow shown above, we obtained the following results for processing one file’s worth of experimental data.

The RAPIDS version was 17x faster overall than CPU coordinate conversation simply from a drop-in replacement to a few lines of code!

On one set of 154 files from Clarie’s experiment, the RAPIDS version took about one minute compared to 19 minutes for the CPU-based version. Since Claire runs this on hundreds of files every time she wants to test a new hypothesis, this speedup directly affects her research. She no longer needs to wait 20 minutes to get results. These major improvements in turnaround could fundamentally change the way a lot of scientists, including Claire, do their work.

Large-Scale Distributed Monitoring

This use case, shared by Taylor Groves in the Advanced Technologies Group at NERSC, tries to understand how NESRC applications actually use Cori’s interconnect. The goal is to learn:

How can adaptive routing influence application performance,
Are some applications more likely to experience congestion on the network, and
What is the typical latency experienced on the network?

There are thousands of counters per network switch and there are thousands of switches and network interface controllers all collecting data at 1 Hz. This results in hundreds of TB of time-series data that need to be analyzed quickly.

While GPU libraries are often just a drop-in replacement, sometimes we need to restructure our code logic to be more column-oriented in nature to get the full benefit of GPUs and RAPIDS. This workflow was an example of this, and we’ll highlight one segment below.

A key portion of the original workflow defined a user-defined function with nested loops and applied it to every row in the data using Python’s built-in multiprocessing library to run across CPU cores. The core code is shown below. Note that parallelize_on_rows is just a wrapper around pandas and multiprocessing.

This function is creating new columns at each iteration of the loop to capture the total sum of specific columns. GPUs are amazing at these kinds of mathematical operations because the processing for each row is independent of every other row. But, RAPIDS cuDF’s user-defined function API requires us to explicitly write the name of an index into every column in the dataframe. Since we have roughly a thousand columns, this becomes unwieldy. Instead, in the RAPIDS version, we restructured the function to use row-wise summation operations via the sum API, which also operates in parallel and provides a huge speedup compared to the CPU. You can compare the CPU version to the RAPIDS version below:

The original workflow actually couldn’t scale to process all the LDMS data due to limitations of Python’s built-in multiprocessing library. However, the RAPIDS version doesn’t have the same limitations.

With the changes shown above and to the rest of the workflow, the RAPIDS version of this workflow scaled with dask is 191x faster than the original CPU workflow, which now means the full dataset can be analyzed and perhaps information previously hidden in the data can now come to light.

Summary and Future Work

Substantial improvements in scientific instrumentation over the past decade have enabled scientists to gather ever-increasing quantities of data to process, analyze, and understand. And the pressure to make sense of these data sets quickly, to test new hypotheses in real-time, is also growing. This has helped motivate NERSC to collaborate with NVIDIA to accelerate scientific discovery through data analytics with GPUs. To address this important goal, NERSC and NVIDIA have teamed up to understand the kinds of issues NERSC data users encounter with GPUs in Python, to optimize key elements of the data analytics stack, and to teach NERSC users how to optimize their workflows with RAPIDS on Perlmutter.

Engineering teams at NVIDIA are working hard to improve the RAPIDS libraries to better serve the data science community. Partnering with NERSC helps the RAPIDS team identify key features and improvements needed to accelerate scientific discovery. We’ve only highlighted two use cases in this blog, but because of the development work resulting from the partnership, NERSC users and scientists across the world are now able to GPU-accelerate their Python code with minimal changes.

About the Authors

Rollin Thomas is a Data Architect in the Data and Analytics Services Group at NERSC. There he directs the Python support strategy, coordinates the NERSC Exascale Science Application Program for Data, and leads efforts to enable interactive supercomputing at NERSC through Jupyter. Prior to joining NERSC in 2015, he was a Staff Scientist in the Computational Research Division at Berkeley Lab. He has more than two decades of experience in scientific computing in astrophysics covering both numerical simulations as well as experimental data processing workflows.

NERSC homepage | LinkedIn | Medium

Nick Becker is a Senior Software Engineer and Data Scientist on the RAPIDS team at NVIDIA, where his efforts are focused on building GPU-accelerated data science products. Nick has a professional background in finance, government, and technology. Prior to NVIDIA, he worked as a data scientist and technical lead at Enigma Technologies, a data science startup that has raised $130 million dollars. Before Enigma, he conducted economics research and forecasting at the Federal Reserve Board of Governors, the central bank of the United States.

Webpage | Github | LinkedIn | Medium

Laurie Stephey is a Data Analytics Engineer at NERSC. She is working with the DESI (Dark Energy Spectroscopic Instrument) project to help make their large-scale data processing run efficiently at NERSC on both current systems (Cori) and future systems (Perlmutter). She is especially interested in how to port Python code to GPU architectures. In addition to high-performance computing, Laurie has a broad scientific background in plasma physics and fusion, musical acoustics, thin-layer deposition, and other projects.

NERSC homepage | LinkedIn