How Using Spark Best Practices Significantly Improves Runtimes
About two years ago, we started working on a new billing system. In the beginning, our main purpose was to run our process and start collecting feedback to fix issues. But day after day, we saw that our process runtime had increased significantly, impacting our development and deployment processes.
So, my manager asked me to start thinking about what we could do to improve our process’s performance.
In this post, I’ll focus on how my journey through Spark’s best practices led to our process performance improving by more than double.
Why we made the change in the first place
Our new billing system involved multiple processes from various teams, and running end-to-end tests often took a significant amount of time, sometimes exceeding six hours.
This process was written in Scala using Spark, which consisted of several jobs that ran sequentially.
To create a more efficient process, I began investigating the cause of the slowness.
Initially, I added more partitions. In Spark, a partition is an atomic chunk of data stored on a node in the cluster, serving as the basic unit of parallelism. However, we encountered out-of-memory (OOM) issues during the first run.
Adding more partitions allowed more tasks to run in parallel under the same executors, leading to increased memory usage and potential OOM errors if memory was insufficient.
To address this, I increased our resources, including memory, Spark executors, and executor cores. Despite these adjustments, we noticed that in some cases, performance remained unchanged while costs increased.
As a result, I began learning and digging deeper into the Spark Jobs code, collaborating with the focal point of one of our data platform teams to diagnose what was causing the slowness.
Implementing Spark best practices — my journey
As I mentioned above, our process consists of several Spark jobs, so I reviewed each of them in detail to figure out the exact code that was causing the slowness.
As a first step, I started learning more about Spark and its best practices. I also tried using Spark UI, which has many features that could help me investigate and figure out the problematic jobs, stages, and transformations.
Spark UI
Let’s dive into some of the features that were helpful in identifying the stages that caused the slowness:
Stages tab
In this screen, we can see the stage run time, data input, shuffle
read, and more features that help us understand the exact data that was used in these stages.
SQL/DataFrame tab
Here, we can drill down into the stage and check the tasks that are running inside it, including data scans from external storage (like S3), Transformations, etc.
Executors tab
This screen allows us to monitor our resource usage, executors, and drivers to see if there are any issues, like memory usage or GC time, with our executors and drivers.
The first section gives us an overview of the total usage for all the executors and drivers, including input, shuffle read/write, and GC time.
The second section provides a detailed breakdown of each executor’s usage.
Spark Best Practices
Now, let’s take a look at some Spark best practices that I’ve found very useful for our process:
Reduced shuffling
By partitioning and carefully selecting the necessary data from the DataFrame/dataset, we were able to significantly reduce the shuffling steps and data shuffling between stages and executors, resulting in better efficiency.
Transformation order
I noticed that when joining two large DataFrames, we initially performed the join and then selected the required columns. This approach was not efficient, as we saw prolonged durations in the Spark UI.
But by modifying the sequence of transformations — simply selecting the necessary columns before performing the join — I was able to improve the efficiency significantly. This reordering reduced the join runtime by more than half.
As a result, the Spark tasks now handled less data during the join, leading to less data being shuffled between tasks and executors and ultimately enhancing the overall performance.
Persist
Sometimes, I work with the same DataFrame for different operations. Due to Spark’s lazy evaluation, the DataFrame is recalculated with each use. To optimize this, you can persist the DataFrame in memory so it’s only computed once by the first action that accesses it.
However, monitoring memory usage is crucial to avoid Out-of-Memory (OOM) errors. Remember to unpersist the data once you’re done to prevent memory overhead.
Broadcast small data frame when using join transformation
Using Spark’s broadcast feature for a small DataFrame can enhance performance. I noticed that Spark automatically broadcasts a small DataFrame, sending a copy to each executor.
We used this approach to minimize data shuffling between executors during the join operation, resulting in fewer data transformations and more efficient processing.
Early data filtering (row perspective)
Early filtering of our DataFrame reduces Spark action runtime and memory usage. So, by removing the unnecessary entities from the DataFrame, we were able to decrease the amount of data involved in subsequent calculations. This also contributed to more efficient processing.
Need a Spark of runtime genius?
By applying Spark’s best practices and gaining a deep understanding of how it works, we were able to reduce our process runtime from 6 hours to 3 hours!
The changes we implemented led to a dramatic improvement in our development process, resulting in a significant boost in productivity.
As a rule of thumb, applying the best practices of any platform we use ensures our processes run faster and are more resource-efficient. This not only enhances our productivity but also provides a better experience.
Now is the perfect time to emphasize the importance of continuous learning and exploring the full potential of the technologies we use. By staying curious and proactive, we can unlock new opportunities and drive innovation forward.
