Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks Data+AI Summit 2022)

5 min readJul 15, 2022

This article summarize an exciting sharing session hosted by Zhou Jiang, Aaruna Godthi from Apple on Data+AI Summit 2022. In this session, Zhou and Aaruna talked about how they built a centralized Apache Spark cluster on Kubernetes that processes 380K+ Spark jobs per day to support analytics workflow and scientists experimentation in Apple.

Original sharing: video

At a glance

*The daily volume of Spark jobs on Apple data platform*

Variant workload patterns (wide, deep, wide&deep) post different resources and processing requirements on the cluster.
Apple provide an unified interface (control plane) shared between the on-premise cloud and customer-owned-cloud, with fast and easy cluster onboarding process.
Large number of concurrent Spark applications put tremendous amount of stress on the Kubernetes cluster. To name a few with solutions:
Tuning Kubernetes for Spark workload at scale: jack up ETCD storage size, cluster auto-scaler, priority class and preemption, use IPv6.
Spark Orchestration (with Operator) at Scale: global max concurrent job limit, gang-scheduling, operator-side timeouts, transition history pruning.
(Brilliant!) Use Spark metrics from previous runs to provide resource recommendations for future run of the same job.
Enabling “Dynamic Allocation” feature for better resource usage.
Scale up history server — stores aggregated view of most recent jobs.

Data Platform in Apple

Apple’s data platform by design enabled different use cases from different disciplines and roles across the company, such as big data — Data Engineer, Data Scientist, Machine Learning Engineer, vs. business intelligence — Business Analysts. The scale they have achieved in 2022 is quite impressive!

In MSAI, we breakdown the “big data” use case into “big data” vs. “science” to better accommodates the requirements for Data Engineer vs. Data Scientist. With this structure we are able to provide best toolsets and resources in compliance fashion to accommodate their specific needs. Imagine the differences between “compute productivity index of user” vs. “email smart-composing”.

E2E Spark Application Lifecycle

Nothing fancy about the E2E infrastructure that guarantees secure, routine deployment of code, with some sort of integration testing, monitoring and security tooling. This is straight up solid work invested for the data platform business to thrive.

Running Spark jobs at scale (with Kubernetes)

Types of Spark workload

*Different payloads post different requirements on the cluster*

Deciding the type of your workload as “a lot of small jobs” vs. “a few of large jobs” is key for successfully scaling out the platform. In most case, you can get away with using “batching” technique to turn “a lot of small jobs” into “a few of large jobs”, which is more friendlier on the cluster and scheduling components. However, there are cases where you cannot “merge” the smaller jobs, for example, while training tenant model that requires end-to-end isolation in knowledge mining, what is the best solution for the “wide” workload?

Spark Orchestrator Saves the Day

*Spark Orchestrator for multi-cloud support and better scheduling*

While training the tenant graph embeddings, we have the most trouble handling the “wide” workload pattern — essentially we need to construct tens of thousands of “tenant graph” that varies a lot in sizes. While the number of users can be different(1k times different), the actual graph construction time are actually only vary by a couple of hours.

A couple of problems MSAI came across handling “wide” workload:

Overheads: the less-than-expect variation in process time proves job overhead might be taking up a good portion of the process time.
Job scheduling and queuing: HDInsights does not provide an external scheduler which means the internal YARN scheduler or Livy session queue is used for buffering the “spiky traffic”. (and performs poorly)
Observability: due to compliance requirement of processing customer content, logs, metrics are not easily accessible.

Job Concurrency and Resource Management for Scale

Apple’s data platform made a very good design choice to use an external scheduler which largely get around some of the problems we faced using a more old-fashion setup. Using the Spark Orchestrator, Apple benefits from a couple aspect:

Multi-cloud support with fast rollout.
Global job concurrency control — critical for keeping good job SLA while cluster is busy, and avoiding job deadlock and mitigate starvation.
Better observability to optimize job resource requirement.