Why run Spark on Kubernetes?

5 min readSep 1, 2018

Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications.

Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps.

Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala.

You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.

Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark.

How it works?

spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows:

Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

But, Why we should use Kubernetes as cluster Manager?

This integration is certainly very interesting but the important question one should consider is why an organization should choose Kubernetes as cluster manager and why not run on Standalone Scheduler which come by default with Spark or run on Production grade cluster manager like YARN.

Let me try to attempt to answer the question with following points.

Are you using data analytical pipeline which is containerized? A typical company want to have data pipelines unified like If a typical data pipeline requires Queue, Spark, DB, Visualization applications and everything is run on containers, it make sense to run Spark on containers and use Kubernetes to manage your entire pipeline.
Resource sharing is better optimized, Instead of running you pipeline on a dedicated hardware for each, it is very efficient and optimal to run on Kubernetes cluster, so that there is better resource sharing as all components in a pipeline are not running all the time.
Leveraging Kubernetes ecosystem. Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging. Best of all, it requires no changes or new installations on your Kubernetes cluster; simply create a container image and set up the right RBAC roles for your Spark Application and you’re all set. When you are using Kubernetes as you cluster manager you will get seamless integration with logging and monitoring solutions. The community is also exploring advanced use cases such as managing streaming workloads and leveraging service meshes like Istio.
Limitations using Standalone Scheduler. You can easily run Spark on Kubernetes by starting Spark cluster running in a standalone mode. This means you run Spark Master and workers and full Spark cluster on Kubernetes. This approach is very much feasible and works for many scenarios. It come with its own benefits as its a very stable way to run Spark on Kubernetes and you can leverage all cool features of kubernetes like resource management, variety of Persistence storage and its integrations for logging and service mesh(Istio) but Spark is not aware its running on Kubernetes and Kubernetes is not aware that its running Spark. This means that we need to repeat or duplicate some of the configurations like if we want to run Executors of 10 GB, we will need to start the Kubernetes PODs with these setting and specify Spark daemon memory accordingly. Note you will need to give some extra memory to Spark Daemon process like Spark Master and Spark worker and can not user entire memory of POD for driver or executor only. Thus there is resource wasteage and its easy to mess up configurations if not done by experts :). Another limitation of Standalone approach is that its difficult to manage elasticity — standalone does not give by default.
YARN vs Kubernetes : Although this is still not a right comparsion as kubernetes still does support all the frameworks which uses YARN as resource manager/negotiator. YARN is used for many production workloads and can be used to run any application. Spark treats YARN as a container management system to request with defined resource
once spark acquire container it builds RPC based communication between container to run driver and executors. YARN does not handle the service layer where as Spark need to handle the Service layer.
YARN can scale automatically by releasing and acquiring container. One major disadvantage of using YARN containers is that it always runs a JVM in container, Where as when you run on Kubernetes you can use you own image which can very minimal.
Kubernetes community support. If you as organization if you need to choose between container orchestrator, you can easily choose Kubernetes just because of the community support it has apart from the reasons that It can run “on Prem” as well as on “cloud provider” of your choice and there is no CLOUD lock down you need to suffer. Kubernetes is agnostic of container runtime and it as very vast feature list like support for running cluster application on containers and service load balancing, service upgradation without stopping or any disruption and well defined storage story.

Why you may want to avoid using Kubernetes as Spark cluster manager?

This is still a beta feature and not ready for production yet. There many features such as dynamic resource allocation, in-cluster staging of dependencies, support for PySpark & SparkR, support for Kerberized HDFS clusters, as well as client-mode and popular notebooks interactive execution environments are still being worked on and not available.

Many features which need more improvement is storing Executor logs, History server events on a persistent volumes so that they can be referred for later use.

Summary

Apache Spark is an essential tool for data scientists, offering a robust platform for a variety of applications ranging from large scale data transformation to analytics to machine learning.

Data scientists are adopting containers to improve their workflows by realizing benefits such as packaging of dependencies and creating reproducible artifacts.

Given that Kubernetes is the standard for managing containerized environments, it is a natural fit to have support for Kubernetes APIs within Spark.

Why run Spark on Kubernetes?

How it works?

But, Why we should use Kubernetes as cluster Manager?

Why you may want to avoid using Kubernetes as Spark cluster manager?

Summary

Written by Rachit Arora