Improving Apache Spark performance on k8s

Published in

datamindedbe

4 min readMar 4, 2022

At Data minded we offer a managed data platform, called Datafy, to build and run your data projects at scale. The platform incorporates software engineering best practices and aims to speed up development as well as reduce the operational overhead of managing and scaling a large number of jobs on kubernetes.

One of the use cases that we support is Apache Spark. We often get questions on how to improve the performance of a job or why a certain job is slower than another one working on the same data. As many of you probably know, there is no one-size-fits-all solution for improving the performance of your Spark job as it heavily depends on your data, code, and the specified tuning parameters. More information about the tuning options for spark jobs can be found here.

In this blogpost we want to highlight one tuning option when running Spark on kubernetes, namely adding external disks to your executors, as well as illustrate the performance improvement it can bring. We added support for configuring external disks into Datafy but the same principle can be used by anybody running Spark on Kubernetes. For those of you who use the spark on k8s operator from Google, it also has support for specifying external disks, for more details look here.

TPC-DS benchmark

In order to compare the performance of our Spark jobs we use the TPC-DS benchmark, which is an industry standard for measuring database performance. It models a data warehouse and focuses on OLAP, online analytical processing tasks. For more background information, I can recommend the following blogpost.

Using the TPC-DS benchmark

We used the following Github repositories and packaged both artifacts in a single docker image:

TPC-DS toolkit provided by Databricks: https://github.com/databricks/tpcds-kit.git
TPC-DS queries and performance testing framework from Databricks: https://github.com/datamindedbe/spark-sql-perf.git

With both in place, you can easily create jobs generating data and running queries. In Datafy we do this by defining airflow tasks and using our custom SparkSubmitOperator, but most settings can be ported to another Spark operator.

Generating data

The most important things to highlight are:

We use a scale factor of 1000 resulting in a dataset of 1Tb
Each fact table will be split into 500 files
We use the S3 magic committer to automatically rename directories in S3. For more details about the magic committer look at the official documentation.

Running queries

We run two similar jobs, one job without disks attached and a second job with 4 100Gb disks attached to the executors.
Other important configuration options:

We use mx_4xlarge instances for our executors, which is a custom instance type in Datafy that has 16 cores and up to 54 Gb of memory available for Spark.
We have 6 executors available to perform Spark operations
We use the S3 magic committer

Benchmark results

The results for 3 queries with and without additional storage are as follows:

Two things stand out when looking at the results:

For 2 out of the 3 queries the results are roughly the same, no real improvement
For query 64, there is 60% performance increase when using 400Gb disks

Spark is a distributed compute engine and requires exchanging data between nodes when performing joins, aggregations, and sorts across multiple executors. It uses a shuffle process that writes data on local disks before sending it over the network when other executors require that data.

By default we add ephemeral disks to the Spark executors but this can be a big performance bottleneck if your job requires a lot of shuffles. When you specify the executor_disk_size option we attach one or more ebs volumes to your job, depending on the instance type of the executor, summing up to the specified size.

Conclusion

In this blogpost we explained why it is useful to add external disks when running Spark jobs on kubernetes. The reason is that it can significantly improve the performance of Spark jobs that require a lot of shuffles. This was illustrated with query 64 of the TPC-DS benchmark, which had a 60% performance boost on a 1Tb dataset. For users that do not use Datafy but are running Spark on Kubernetes, it might be interesting to look into adding external disks as well, if you notice that Spark is doing a lot of shuffling.

Accelerate your journey with Datafy

If you like what you see, you should definitely check out our product Datafy, which makes it super easy to containerize, deploy and orchestrate all your workloads at scale.