Leveraging Spark 3 and NVIDIA’s GPUs to Reduce Cloud Cost by up to 70% for Big Data Pipelines

Published in

The PayPal Technology Blog

12 min readFeb 21, 2024

At PayPal, hundreds of thousands of Apache Spark jobs run on an hourly basis, processing petabytes of data and requiring a high volume of resources. To handle the growth of machine learning solutions, PayPal requires scalable environments, cost awareness and constant innovation. This blog explains how Apache Spark 3 and GPUs can help enterprises potentially reduce Apache Spark’s jobs cloud costs by up to 70% for big data processing and AI applications.

Our journey will begin with a brief introduction of Spark RAPIDS — Apache Spark’s accelerator that leverages GPUs to accelerate processing via the RAPIDS libraries. We will then review PayPal’s CPU-based Spark 2 application, our upgrade to Spark 3 and its new capabilities, explore the migration of our Apache Spark application to a GPU cluster, and how we tuned Spark RAPIDS parameters. We will then discuss some challenges we encountered and the benefits of the updates.

The “libra scales in the cloud” figure shows the computational resources equilibrium of a big CPU-cluster to a small GPU-cluster — Libra scales in the cloud, generated by AI

Background

GPUs are everywhere, and their parallelism characteristics are perfect for processing AI and graphics applications, among other things. For those unfamiliar: what makes GPUs different from CPUs, computation-wise, is that CPUs have a limited amount of very strong cores, whereas GPUs have thousands, or even tens of thousands or more, relatively weak cores that work together very well. PayPal has been leveraging GPUs to train models for some time now, and so we decided to evaluate if the parallelism of the GPU can be helpful with processing big data applications based on Apache Spark.

In our research, we encountered NVIDIA’s Spark RAPIDS open-source project. It has many purposes, however we focused on Spark RAPIDS’s cost reduction potential, because enterprises like PayPal spend lots of money on running Spark jobs in the cloud. Using Spark with GPUs isn’t common in the industry yet, but according to our findings as described in this blog, the potential benefits could be enormous.

What is Spark RAPIDS?

Spark RAPIDS is a project that enables the use of GPUs in a Spark application. NVIDIA’s team adapted Apache Spark’s design to harness the power of GPUs. It is beneficial for large joins, group by, sorts, and similar functions. Spark RAPIDS can boost the performance of certain workloads, which we’ll discuss later in the blog their identification process. You can review the documentation here for more details.

There are a few reasons to use Spark RAPIDS to accelerate big data processing with GPUs: GPUs have their own environment and programming languages, so we can’t easily run Python/Scala/Java/SQL code on them. You must translate the code to a GPU programming language, and Spark RAPIDS does this translation in a transparent way. Another cool design change that Spark RAPIDS has made is how Spark handles the tasks in each stage of the job’s Spark plan. In pure Spark, every task of a stage is sent to a single CPU core in the cluster. This means that the parallelism is at the task level. In Spark RAPIDS, the parallelism is intra-task, meaning the tasks are parallelized as well as the processing of the data within each task. The GPU is a strong computation processor, which gives us incentives to manipulate our job to be more compute-bound, hence to work with large partitions.

Task level parallelism vs data level parallelism, provided by NVIDIA

For more information and thorough explanations, we recommend reading NVIDIA’s book, Accelerating Apache Spark 3.

Getting Started

Our initial experiment with Spark RAPIDS was successful in the PayPal research environment, which is an open environment with access to the web but with limited resources and without production data. The next step was to take the accelerator to production in order to measure real production applications.

According to Spark RAPIDS documentation, not all jobs are a good fit for this accelerator, so we worked on finding the most relevant ones. We started with a Spark 2 (CPU cluster) job that handles large amounts of data (multiple of ~10TB inputs), executes SQL operations on exceptionally large datasets, uses intense shuffles, and requires a fair number of machines. The job was predicted to have high success rate based on NVIDIA’s Qualification Tool, which analyzes Spark events from CPU based Spark applications to help quantify the expected acceleration of migrating a Spark application to a GPU cluster.

As explained above, we understood that in order for the GPU to be well-leveraged, we had to manipulate our Spark job to work with large partitions. Our objective of working with large partitions is to manipulate our queries and operations to be more computation-bound, rather than I/O or network-bound, thus utilizing the GPU in an effective way.

In order to manipulate our job to work with large partitions, we changed two parameters: The AQE (Adaptive Query Execution) parameter, which is a new optimization technique in Spark 3 that, among other things, adjusts the number of partitions in a shuffle stage such that each partition will be in a certain size. The second parameter is spark.sql.files.maxPartitionBytes, which handles the input partition’s size. The number of partitions in those shuffle/input stages affects many accompanying stages as well.

For the baseline run, we did not set the spark.sql.files.maxPartitionBytes parameter, so the Spark plan used the default value of 128MB. Now let us see how the original stage of reading the large input looks like in the Spark UI:

As you can see, we got 9.5TB of data as an input, Spark divides it into ~185,000 partitions(!) which means that every partition is around 9.5TB/185,000 = 50MB. The input files size is around 1GB, it does not make sense for us to divide each file into 20 different partitions in the Spark cluster. This separation causes many network communication overheads and results in a longer latency at this stage.

Now, after setting the spark.sql.files.maxPartitionBytes parameter to 2GB (where we manipulate Spark to read larger input partitions and thus work with larger partitions in the next stages), let us see how the stage was affected:

Our 9.5TB was distributed to 10,000 partitions, which is nearly 20 times fewer partitions than the baseline run, and it resulted in the decrease of the total time to 40 minutes, which is a 30% reduction in runtime.

Now, let us look at all the heaviest input stages of our baseline run, where spark.sql.files.maxPartitionBytes is set to default:

After setting spark.sql.files.maxPartitionBytes to 2GB:

As we can see, the change lowered the number of tasks in the input processing stages, this simple parameter change resulted in reducing the runtime of these stages by more than 20 minutes.

Spark 3 and AQE

To migrate our job to Apache Spark 3, a fair number of steps had to be taken. We had to update some syntax in our code, and each jar of our infrastructure and applications had to be compiled with an updated Scala version. You can review the official migration guide.

In Spark 3, the ability to use GPUs was added and the AQE optimization technique was enabled. As mentioned above, the goal is to manipulate Spark to work with large partitions which means applying AQE to at least 1GB (or reducing the spark.shuffle.partitions number). In order for Spark application to work with partitions of 1GB, these properties need to be configured:

"spark.sql.adaptive.enabled" : "true"
"spark.sql.adaptive.advisoryPartitionSizeInBytes" : "1GB"

As we can see below, in our use case, this kind of practice is beneficial in runtime terms:

A shuffle stage in our baseline run (no AQE):

A shuffle stage with AQE:

After tuning the candidate job to work with large partitions, we checked the cluster‘s utilization and saw it was not fully utilized, so we could try to reduce the amount of machines the application consumes. The baseline job is with 140 machines and after tuning Spark and the cluster nodes, we ended up with 100 machines that were fairly utilized. This change only slightly affected the runtime of the job, but dramatically reduced the cost!

The intermediate result:
We cut ~20% of our application runtime and ~30% of our resources, resulting in a ~45% cost reduction!

As an example, If the initial cloud usage cost was 1,000 PYUSD, so right now we would potentially stand at around 550 PYUSD!
Chart of CPU runs:

Overall, our first intention was to work with large partitions solely to benefit from the GPUs but as we can see, there is a significant performance boost even before using Spark RAPIDS, which is exciting!
(Disclaimer: This practice does not bring the same results for all jobs. It depends on the data and the operations you do with it.)

So far, we just prepared our job to be suitable for Spark RAPIDS and GPUs, now the new challenges began — migrating to GPU cluster, learning new tuning concepts, troubleshooting and optimizing GPU usages.

Migration to GPU Cluster

The GPU migration included enabling the Spark RAPIDS init scripts, copying all their dependencies into PayPal’s production, supporting GPU parameters in our internal infrastructure, learning the GPU cluster features of our cloud vendor and more.
(Disclaimer: These days, cloud vendors release new, custom images with a built-in instance of Spark RAPIDS, so this work can be saved.)

After running some simple jobs, making sure we created a stable and reliable infrastructure where the GPU clusters run Spark RAPIDS as expected, we deep-dived into running our candidate production application with it. Thanks to the Spark RAPIDS documentation, we triaged the few runtime errors we encountered while tuning it for our needs. Let us quickly cover two issues that helped us understand the Spark RAPIDS tuning better:

Could not allocate native memory: std::bad_alloc:
RMM failure at: arena.hpp:382: Maximum pool size exceeded

The meaning behind this error is that the GPU memory pool was exhausted. To resolve this issue, some pressure is needed to be released from the GPU’s memory. After reviewing the literature, it was clear that some configurations are critical for each job. for example:
spark.rapids.sql.concurrentGpuTasks — meaning the number of tasks that the GPU handles concurrently.

Intending to maximize the performance of our execution, we aimed to run in parallel as many tasks as possible. We were over-ambitious at first and set this parameter too high, and immediately got the above error. It happened because we use Tesla T4 GPUs, that have only 16GB of memory. As a check, we set the spark.rapids.sql.concurrentGpuTasks parameter to 1, and noticed that there are no memory errors. In order to utilize our resources properly, we had to find the sweet spot of the GPU concurrency parameter. To find that, we looked at the GPU utilization metrics, which we will explain later in the blog, and aimed the utilization to be around 50% — advised to us by NVIDIA’s team in order to have a fair division between the GPU computation and its communication/data transfer with the main memory. In our case, after some trial and error, we settled with running 2 tasks at a time, meaning setting spark.rapids.sql.concurrentGpuTasks = 2.

Another interesting issue we encountered was with runtime performance and stability. After reducing the number of machines in our cluster, from 140 to 30 machines, our Spark job was slower than expected and occasionally failed with the following prompt:

java.io.IOException: No space left on device

We looked deeper into our nodes and noticed that when we added the GPUs to the machines, we were able to solve the computation bottleneck, but the “pressure” moved to the local SSDs. This is because our GPUs with low memory capacity tend to swap memory onto local disks. The fact that our Spark plan is using large partitions adds to the disk spill. Originally, when each node had 4 SSDs (of 375GBs), we found that our job was slower than we expected, and sometimes even failed. To overcome this issue, we doubled the amount of our SSDs to 8, got stable results and better performance. Adding local SSDs is relatively cheap in cloud vendors, so this solution didn’t really affect our overall cost.

All interactions with local SSDs are much slower than main memory access. A critical parameter for this case is:
spark.rapids.memory.host.spillStorageSize — the amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disks.

Increasing the spill storage parameter to 32GB decreased our job’s runtime.

Spark RAPIDS Optimizations: Tips and Insights

Choosing NVIDIA’s Tesla T4 GPU: Among NVIDIA’s GPUs, we found that the Tesla T4 generally has the best performance/price ratio for this kind of computation, recommended to us by NVIDIA’s team for the purpose of cost reduction. (Disclaimer: The new L4 GPU may give better results.)
Considering memory overhead: Keep in mind that the GPU does not work with the executor’s memory, but with off-heap memory, hence we have to guarantee enough memory overhead for each executor. We set the memory overhead to 16GB.
Tuning spark.task.resource.gpu.amount: This parameter limits the number of tasks that are allowed to run concurrently on an executor, whether those tasks are using the GPU or not. At first we were greedy and tried to assign a lot of tasks to each executor. It slowed the stage’s runtime because of excessive I/O and spilling. In our case, we found that 0.0625 (1/16) was a good spot.
Using spark.rapids.memory.pinnedPool.size: Pinned memory refers to memory pages that the OS will keep in system RAM and will not relocate or swap to disk. Using pinned memory significantly improves performance of data transfers between the GPU and host memory. We set this parameter to 8GB.
Configuring NVME Local SSDs: The disks in the Spark RAPIDS cluster were configured to use the NVME protocol, resulting in 10% speedup.

With stronger compute power, we allowed ourselves to challenge the cluster and reduce the number of machines. After some trial and error, we settled the GPU cluster to run with 30 machines of 32 cores, 120GB RAM, 8 SSDs and 2 Tesla T4 GPUs each, lasting for 1.3 hours.

Spark RAPIDS Final Tuning

GPU Utilization

Our cloud vendor provided a tool/agent that extracts metrics such as GPU utilization and GPU memory from GPU VM instances. This allowed us to monitor the usage of our GPUs, which is crucial to identify underutilized GPUs and optimize our workloads.

Final Cost Comparison

Below we can find a summary of our research findings:
As an example, consider a job that costs 1,000 PYUSD, Spark 3 with GPUs reduces that cost to 300 PYUSD. Depending on the configuration, you can enjoy potential cost savings of up to 70% for processing large amounts of data using GPU Clusters.

Key Learnings

GPUs can be effectively leveraged not only for training AI models, but for big data processing as well.
Spark jobs that consume large amounts of data to perform certain SQL operations on large datasets are good candidates to be accelerated with Spark RAPIDS. Their eligibility can be validated with NVIDIA’s Qualification Tool.
Certain workloads benefit from being compute-bound, which can be achieved by manipulating the Spark job to work with large partitions, via spark.sql.files.maxPartitionBytes and the AQE parameters.
Leveraging Spark 3 with GPUs and Spark RAPIDS can significantly reduce your cloud costs for eligible workloads.

Thoughts for the Future

The potential of running Spark RAPIDS with an autoscaling GPU cluster is highly regarded by us. This practice may significantly reduce the costs of major GPU machines due to their lower spot prices compared to permanent instances.

Acknowledgments

Thanks to the significant contributions of Lena Polyak, Neta Golan, Roee Bashary, and Tomer Pinchasi for the project’s success. Thanks so much to NVIDIA’s Spark RAPIDS team for supporting us.