Performance of Apache Spark on Kubernetes has caught up with YARN

Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on Kubernetes!

Published in

Data Mechanics

7 min readOct 20, 2020

Apache Spark is an open-sourced distributed computing framework, but it doesn’t manage the cluster of machines it runs on. You need a cluster manager (also called a scheduler) for that. The most commonly used one is Apache Hadoop YARN. Support for running Spark on Kubernetes was added with version 2.3, and Spark-on-k8s adoption has been accelerating ever since.

If you’re curious about the core notions of Spark-on-Kubernetes, the differences with Yarn as well as the benefits and drawbacks, read our previous article: The Pros And Cons of Running Spark on Kubernetes. For a deeper dive, you can also watch our session at Spark Summit 2020: Running Apache Spark on Kubernetes: Best Practices and Pitfalls.

In this article, we present benchmarks comparing the performance of deploying Spark on Kubernetes versus Yarn. Our results indicate that Kubernetes has caught up with Yarn — there are no significant performance differences between the two anymore. In particular, we will compare the performance of shuffle between YARN and Kubernetes, and give you critical tips to make shuffle performant…

Performance of Apache Spark on Kubernetes has caught up with YARN

Learn our benchmark setup, results, as well as critical tips to make shuffles up to 10x faster when running on Kubernetes!

Written by Hisham Itani