Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks Data+AI Summit 2022)

Stan Lin
5 min readJul 15, 2022

--

This article summarize an exciting sharing session hosted by Zhou Jiang, Aaruna Godthi from Apple on Data+AI Summit 2022. In this session, Zhou and Aaruna talked about how they built a centralized Apache Spark cluster on Kubernetes that processes 380K+ Spark jobs per day to support analytics workflow and scientists experimentation in Apple.

Original sharing: video

At a glance

The daily volume of Spark jobs on Apple data platform
  • Variant workload patterns (wide, deep, wide&deep) post different resources and processing requirements on the cluster.
  • Apple provide an unified interface (control plane) shared between the on-premise cloud and customer-owned-cloud, with fast and easy cluster onboarding process.
  • Large number of concurrent Spark applications put tremendous amount of stress on the Kubernetes cluster. To name a few with solutions:
  • Tuning Kubernetes for Spark workload at scale: jack up ETCD storage size, cluster auto-scaler, priority class and preemption, use IPv6.
  • Spark Orchestration (with Operator) at Scale: global max concurrent job limit, gang-scheduling, operator-side timeouts, transition history pruning.
  • (Brilliant!) Use Spark metrics from previous runs to provide resource recommendations for future run of the same job.
  • Enabling “Dynamic Allocation” feature for better resource usage.
  • Scale up history server — stores aggregated view of most recent jobs.

Data Platform in Apple

Stakeholders, tooling, and use cases

Apple’s data platform by design enabled different use cases from different disciplines and roles across the company, such as big data — Data Engineer, Data Scientist, Machine Learning Engineer, vs. business intelligence — Business Analysts. The scale they have achieved in 2022 is quite impressive!

In MSAI, we breakdown the “big data” use case into “big data” vs. “science” to better accommodates the requirements for Data Engineer vs. Data Scientist. With this structure we are able to provide best toolsets and resources in compliance fashion to accommodate their specific needs. Imagine the differences between “compute productivity index of user” vs. “email smart-composing”.

E2E Spark Application Lifecycle

Nothing fancy about the E2E infrastructure that guarantees secure, routine deployment of code, with some sort of integration testing, monitoring and security tooling. This is straight up solid work invested for the data platform business to thrive.

Running Spark jobs at scale (with Kubernetes)

Types of Spark workload

Different payloads post different requirements on the cluster

Deciding the type of your workload as “a lot of small jobs” vs. “a few of large jobs” is key for successfully scaling out the platform. In most case, you can get away with using “batching” technique to turn “a lot of small jobs” into “a few of large jobs”, which is more friendlier on the cluster and scheduling components. However, there are cases where you cannot “merge” the smaller jobs, for example, while training tenant model that requires end-to-end isolation in knowledge mining, what is the best solution for the “wide” workload?

Spark Orchestrator Saves the Day

Spark Orchestrator for multi-cloud support and better scheduling

While training the tenant graph embeddings, we have the most trouble handling the “wide” workload pattern — essentially we need to construct tens of thousands of “tenant graph” that varies a lot in sizes. While the number of users can be different(1k times different), the actual graph construction time are actually only vary by a couple of hours.

A couple of problems MSAI came across handling “wide” workload:

  1. Overheads: the less-than-expect variation in process time proves job overhead might be taking up a good portion of the process time.
  2. Job scheduling and queuing: HDInsights does not provide an external scheduler which means the internal YARN scheduler or Livy session queue is used for buffering the “spiky traffic”. (and performs poorly)
  3. Observability: due to compliance requirement of processing customer content, logs, metrics are not easily accessible.

Job Concurrency and Resource Management for Scale

Apple’s data platform made a very good design choice to use an external scheduler which largely get around some of the problems we faced using a more old-fashion setup. Using the Spark Orchestrator, Apple benefits from a couple aspect:

  1. Multi-cloud support with fast rollout.
  2. Global job concurrency control — critical for keeping good job SLA while cluster is busy, and avoiding job deadlock and mitigate starvation.
  3. Better observability to optimize job resource requirement.
Concurrency control saves the day (SLA)
Timeout plays a key role of avoiding deadlock and mitigate starvation
Dynamic allocation free up resources when the job is not actively using
Using Spark metrics to provide resource recommendations is brilliant!

Last Words

Lastly, kudos for Zhou Jiang and Aaruna Godthi for the very detailed and impressive sharing and great achievement for Apple’s data platform!

Stan Lin — Tech Lead @MSAI | Snowboarder | Youtuber | Options Trader | Cryptocurrency enthusiast View all posts by Stan Lin

Originally published at http://coderstan.com on July 15, 2022.

--

--

Stan Lin

Talks about A.I., Machine Learning, MLOps, and Big Data technologies. Tech Lead, Graph Intelligence, MSAI | Snowboarder | Youtuber | Options/Crypto Trader