Accessible, Easy to Use, and More Performance in the RAPIDS Accelerator for Apache Spark v0.3 Release

Karthikeyan Rajendran

Published in

RAPIDS AI

7 min readJan 28, 2021

We’re excited to announce the third release of the RAPIDS Accelerator for Apache Spark!

Here’s What We’re Thrilled About

RAPIDS Accelerator for Apache Spark is now integrated into AWS EMR and Google Cloud Dataproc. Check out the getting started guides.
We’re faster than ever.
We’re covering more workloads for you.

Please refer to the changelog and release notes for all the details.

Focus Areas for Release 0.3

This release is focused on improvements in three main areas: performance, functionality, and user experience. By making strides in these three areas, we have made the RAPIDS Accelerator for Apache Spark accessible and easy to use all while improving already impressive performance. Consistent with the fundamental design for the NVIDIA Spark GPU acceleration solution, all of these benefits are transparent to the application and end-user, with no code changes required. Let’s dive into these three improvement areas.

Performance

GPUs bring serious performance to Data Science. The impetus for creating the RAPIDS Accelerator for Apache Spark was to bring that performance to the highly used, accessible toolset. By increasing performance and usability on top of GPUs, we make it easy to realize the benefits of accelerated computation, help data scientists and machine learning engineers be more efficient, and help enterprises improve their bottom line.

Like several other vendors in the industry, NVIDIA has used nearly 100 representative decision support queries derived from a popular data analytics benchmark suite, to evaluate performance and TCO benefits. In this release, we have progressed significantly — with v0.3 we can now run 19 of the queries with improved performance and TCO (compared to just 2 queries in the v0.2).

The chart below shows the improvement in performance on the 19 queries comparing elapsed time and cost for those queries for CPU (blue), GPU with the v0.2 (dark green), and v0.3 (light green) — lower is better.

This is an example of NVDIA’s commitment to continuous performance improvements — which will accrue to users automatically over time.

CPU Hardware Config: Master: n1-standard-4, Slave:4 x n1-standard-32 (128 cores, 480GB RAM), GPU Hardware Config: Master: n1-standard-4, Slave:4 x n1-standard-32 (128 cores, 480GB RAM + 16 x T4), GPU Hardware Config: Master: n1-standard-4, Slave:4 x n1-standard-32 (128 cores, 480GB RAM + 16 x T4), CPU Cloud Costs: $7.399/hr (Based on *standard pricing on 4 x n1-standard-32* + *Dataproc Pricing), GPU Cloud Costs: $12.999/hr (Based on* *standard pricing on 4 x n1-standard-32 + 16 xT4* + *DataProc Pricing*)

Enhancements from v0.2 to v0.3

There were multiple features and enhancements that contributed to the improvement from v0.2 to v0.3. Some of these improvements came from:

The ability to execute tasks concurrently on the GPU
Improvements to parallelize data loading for small files
Support for high-performance Apache Spark shuffle

The ability to execute tasks concurrently on the GPU

Prior to v0.3, Apache Spark SQL operations were not able to execute concurrently on the GPU. With the v0.3 release, we are able to schedule multiple Spark tasks concurrently on the GPU without requiring any changes to the application code. (For accessible technical detail on the mechanism underlying this change, please read a detailed blog about the per-thread default stream.)

Improvements to parallelize data loading for small files

We introduce improvements to parallelize Apache Parquet-formatted files (hereafter “Parquet files”) data loading on the GPU for small Parquet files. This utilizes both CPU and GPU and increases overall I/O throughput. This allows us to read multiple files in parallel on the CPU side while transferring data to the GPU for the GPU Parquet reader to read. This is beneficial in most use cases. For instance, in one use case where we were running a query that read 55,000 files with a total size of 14.5GB from Azure blob storage, the time to read the Parquet data improved from about 11 minutes down to 4.5 minutes.

Support for high-performance Apache Spark shuffle

Processing nontrivial database queries on partitioned data sets requires moving some data between partitions; Apache Spark calls these move operations “shuffles,” and they involve disk I/O, data serialization, and network I/O. The RAPIDS Accelerator for Apache Spark now includes an accelerated shuffle implementation that leverages UCX, an open-source framework for data-centric and high-performance applications, to optimize GPU data transfers, keeping as much data on the GPU as possible, and using the fastest interconnect available between GPUs. We recommend enabling RAPIDS Shuffle manager on machines with multiple GPUs and NVLink connectivity and on machines connected across RDMA (Infiniband or RDMA over ethernet) to take advantage of fast intra-node and internode GPU communication; see the documentation for more details.

New Features and Functionality

In general, adding new features to the RAPIDS Accelerator for Apache Spark solution means parts of the user’s Spark application will run on the GPU and fewer will run on the CPU. Thus, adding features translates directly into better performance for more applications.

Initial support for ArrayType, StructType, Map.

Nested data types such as Struct, Map, Array are very common in today’s Big Data ETL pipeline. We’re introducing GPU support to read Parquet files with nested data types and access members in the Struct Data types.

SELECT AVG(structF1.byteF) FROM test_table
SELECT AVG(arrayF1[0]) FROM test_table
SELECT AVG(mapF[123]) FROM test_table

Support for DecimalType

In Apache Spark, DecimalType is the data type representing java.math.BigDecimal values (used for finance calculations), a Decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on the right side of the dot). We’re introducing GPU support for DecimalType with precision <= 18 (i.e.: fits in 64 bits) as a Beta feature. Users would need to enable the flag “spark.rapids.sql.decimalType.enabled” as it’s disabled by default. The support for decimal type is limited. In upcoming releases, we will continue to tune performance and expand operators support for decimal data type. For the detailed list of supported operations, please refer to supported operators.

Support for Lead/Lag for Window Functions

Window functions operate on a group of rows, referred to as a window, and calculate a return value for each row based on the group of rows. Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative statistic, or accessing the value of rows given the relative position of the current row. The LAG() function allows access to a value stored in a different row above the current row. LEAD() allows access to a value stored in a row below, for example:

SELECT lag(intF) OVER (PARTITION BY byteF ORDER BY shortF) FROM test_table
SELECT lead(intF) OVER (PARTITION BY byteF ORDER BY shortF) FROM test_table

Support for Greatest/Least

The Greatest () function returns the greatest value of the list of arguments and Least() function returns the smallest value of the list of arguments.

SELECT least(byteF, shortF, intF) FROM test_table
SELECT greatest(byteF, shortF, intF) FROM test_table

User Experience

For the v0.3 release, we’ve updated documentation to cover supported operations and data types. As you visit the documentation page, you can now clearly see the supported, unsupported, and partially supported operations by each executive operator, data type, and input file formats. We would love to hear your feedback and comments on our Github site.

In addition to this, we have added a getting started guide for AWS EMR. AWS EMR offers native integration for the RAPIDS Accelerator for Apache Spark. The current release of AWS EMR (6.2) ships with the 0.2 release of the RAPIDS Accelerator; we expect that the next release of AWS EMR will include the 0.3 release of the plugin.

Looking Forward in 2021

As we continue to expand the functionality and improve the performance of RAPIDS Accelerator, we recognize that not only do Apache Spark users run standard SQL and Dataframe operators, but they may also have custom operations that need to run via user-defined functions (UDF). We want to address those functions as well when we consider accelerating Apache Spark jobs on the GPU. In this release, we have included an experimental Scala UDF compiler that can convert user-defined functions into catalyst expressions that are already supported on the GPU. To see a detailed list of supported operations on Scala UDF, please refer to the documentation here. We have worked to enable basic operations, like arithmetic, logical, equality, math, string operations, and more, and we have more operations that we are working to enable on the GPU in our roadmap.

In upcoming releases, we will continue to optimize the performance of Scala UDF and UCX based shuffle manager. In addition to these optimizations, we will continue to focus on the expanding support for nested data types and we will work on introducing new architectural improvements for RAPIDS Accelerator. This will include:

Apache Spark 3.1.0 will include stage-level scheduling, improved caching, and persist for GPU dataframes.
Support for NVIDIA® GPUDirect® Storage (GDS) which will enable a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.
Unified Apache Spark Shuffle manager for CPU and GPU to leverage fastest interconnect available in a cluster

With an increasing number of data sources and the volume of data to analyze, traditional CPU-based processing will soon become prohibitively expensive and slow. The mission for RAPIDS accelerator for Apache Spark is transparently accelerating Spark applications while reducing costs. It is definitely too early to claim that all Spark-based ETL applications will be sped-up. As we continue to expand coverage and improve performance over time, RAPIDS Accelerator will speed-up the majority of the big data applications that are either too expensive or slow on the CPU.

Get Started and Give Feedback

Give the RAPIDS Accelerator for Apache Spark solution a try. You can easily access it on Databricks, Google Cloud DataProc, or Amazon AWS EMR.

We are excited to get your feedback. Let us know if there are specific features that you use which are not GPU-accelerated. Please call out any issues with setting up or using the solution on GitHub.