Announcing: Spark Performance Advisor

5 min readFeb 28, 2023

The first post of 2023 is a bit different: it’s a product announcement. With my colleagues at Joom, we made public a tool that was very helpful internally — Spark Performance Advisor.

Executive Summary

Spark Performance Advisor automatically finds specific performance problems in Spark applications without a senior data engineer manually reviewing every single job.

When you have a couple of data engineers, and they maintain a dozen of jobs, it is easy to notice performance regressions and pinpoint inefficiencies. But eventually, you’ll have hundreds of jobs, dozens of data sources, and a dozen of authors, some of whom might be backend developers or analysts. And let’s be honest, Spark UI is not something you can throw at people and expect them to find their way.

Even to detect a basic data skew, one would have to find long stages, click on them, and know where to look. To find an excessive spill, you need to know exactly what is excessive. And if half of the executors are idle most of the time, UI would not tell. We need a better UI that we can offer to other colleagues and honestly expect it to be useful.

The idea is not new at all, others have tried it before, but here’s why you might want to give our product a try today:

We focus on detecting high-level problems that you can immediately investigate, as opposed to just showing raw metrics
2023 is a year where optimizing your cloud costs is important
We’re not up-selling you any enterprise offering

If you’re already interested, please head to https://cloud.joom.ai to try, and for more details, read on.

High-Level Problems: a Case Study

Since we had some experience with common performance problems, we thought we could automatically compute “problem scores” — a number that shows how likely and severe a problem might be in a particular job. We implemented several scores, ran them on our statistics, and had a few surprises.

Excessive Spill

Whenever a Spark task runs out of memory to process a partition, it starts using disk as scratch space. Most often, it happens when sorting data, in preparation to join. In that case, we’re switching from an in-memory sorting algorithm to a disk-based merge sort.

For example, suppose you have 64GB of memory available, and each executor runs 8 tasks in parallel. Each task, therefore, has at most 8GB of memory. If one partition of your data frame has 16GB of data, there’s no way to sort it in memory, and performance will suffer. The “excessive spill” problem metric quantifies the severity of the problem.

One of our affected jobs used 250 partitions, and simply processed too much data. After switching it to use 1000 partitions, the total run time went from 19 minutes to 14 minutes, for a 26% improvement. Another job was already configured to use 1000 partitions years ago, but on the current data, 3600 partitions were more appropriate, and that reduced run time from 34 minutes to 20 minutes, for a 40% improvement.

Unused executors

The “unused executors” problem adds up idle time across all executors. We found several examples in our workload where this problem happens.

In a few cases, the adaptive query execution mechanism gets in the way. Say, one stage uses 16 executors but generates small output. AQE decides to coalesce partitions, and use 4 executors, while the rest are idle. In one case, we just reduced the number of execution by half, without much impact on job time. In another case, we determined AQE is simply too aggressive and uses too few partitions. Disabling AQE made the job run faster than before.

Executors can also be idle if too much work is done by the driver. Say, we run a query, then the driver is calling an external API. If the API is synchronous, and there are some cached dataframes to keep executors alive, the inefficiency can be dramatic. The primary solution, in this case, is to parallelize calls to external APIs and perform them on the executors. Both the wall time and executor time impact here can reduce by several times.

Wasted task time

Spark can be too smart sometimes. If a task or stage fails, it is automatically retried a few times, and if 3rd retry succeeds, you never know that resources were wasted on two first attempts. The “wasted task time” problem score is essentially summing up the time wasted on discarded computations.

The retries and wasted time are usually caused by out-of-memory errors on the executors, or, more rarely, by running out of disk space. Once the problem is detected, fixing it is easy, jobs complete with no task retries, and execution time went down more than 50% in some cases.

Data skew

We can’t wrap up the case studies part without mentioning the classic problem of data skew — where some executors process an unjust amount of data and block the progress.

In our case, we were generally good at avoiding the case of joining with null, or other singular values, in the join column. However, in some cases Spark found the right side of the join to be small, decided to use broadcast loop join, and that backfired. Since there’s no shuffle to redistribute the other side of the join, any data skew in the input data becomes data skew in the join. Forcing Spark to use merge join was effective, in one case reducing the rum time from 1.5 hours to 3.4 minutes.

The Business Side

Now that we’ve seen a few cases where high-level problem metrics lead us to significant performance improvements, let’s briefly look at the more boring, but probably more important side.

First, costs are important in 2023. There is loud news about FAANG layoffs and the cooling of the VC market, even for medium-sized companies that don’t have to appease either Wall Street or VCs, costs are important. Big Data is a naturally costly area, so investing in optimization and cost reduction can have a notable effect. Literally, a week of optimization can save the equivalent of an engineer’s salary going forward.

Second, it is very easy to try our service. You need a Spark listener we provide (it is open-source) and a couple of Spark configuration options. We process only telemetry, and never see your actual data. The library is tiny, the network usage is nonsignificant, and there’s no performance impact. Finally, the service is free in two important ways:

We don’t charge any dollars. Because the amount of data to process is relatively small, we hope we won’t need to ever charge for the typical usage.
We don’t waste your time. You only need a Google account to sign up. There will be no marketing emails and nobody will call you on the phone to upsell things.

Conclusion

Spark Performance Advisor is a free tool to find potential performance problems in your jobs and will be a useful addition to your Spark toolbox.

We’ll be happy for you to try it. If you’re already convinced, just login to the console and ask questions or comment here or on Twitter.

Looking for more details? You can read the documentation and FAQ at https://cloud.joom.ai. In particular, you might find useful our documentation of potential problems.

Finally, for more background, the blog post from my colleague Sergey Kotlov describes how we used this tool internally, back in 2022.