Spark Cluster Deployment with Kubernetes and Helm Charts

Happy Data Science

(λx.x)eranga
Effectz.AI
4 min readNov 14, 2021

--

Background

In this post I’m gonna discuss about deploying Spark cluster with Kubernetes based cluster manager. I will discuss two methods of deploying Spark cluster on Kubernetes, 1) with traditional Kubernetes deployment(with simple service and pod), 2) with helm charts. As a use case of the deployment I have run Spark Cassandra Connector Job in the Spark cluster via Kubernetes cluster manger. All the deployments which related to this post available in gitlab. Please clone the repo and continue the post.

Spark Cluster Architecture

Spark cluster follows master-slave architecture. It comes with Master(Driver), Worker(Executor) and Cluster Manager. There is single master in the Spark cluster. Master node runs driver program which drives the Spark application/spark job. Spark job is split into multiple tasks(these tasks comes with partitioned RDD) by master node and distributed over the worker nodes.

Worker/Executor nodes are the slave nodes whose job is to execute the tasks which assigns by master node. These tasks are executed on the partitioned RDD. Executor stores the computation results data in-memory, cache or on hard disk drives. After executing the task, worker node return the result to master node(master node aggregate the results from all worker nodes).

Cluster Manager does all the resource allocating works. It allocates the resources to worker nodes based on the tasks created by master. Then it distribute the tasks to worker nodes. Once task finish it take the results back from worker nodes to master node. Spark can work with various Cluster Managers, like Standalone Cluster Manager, Yet Another Resource Navigator (YARN), Mesos and Kubernetes.

Kubernetes Deployment

I have run Kubernetes cluster with Minikube on dev machine. Read more about configuring Kubernetes with Minikube in here. Following is the Kubernetes deployments of the Single master/worker Spark cluster. The Spark master services expose via headless ClusterIP service named spark-master. The worker service exposed via spark-client service.

Spark Web-UI

The Spark master’s web-ui runs on port 8080(with headless ClusterIP service). To access the this service on host machine, I need to do Kubernetes port-forward. Then web-ui will be accessed via http://localhost:8080.

Spark Cassandra Job

I have used following Spark Job which reads/process the data on Cassandra storage. Read more about Spark and Cassandra integration from here. I have submitted and the job on Spark cluster via Kubernetes cluster manager. In the job, I did not set the SparkConfig’s master, cassandra hosts/ports configs. I have dynamically configured them when submitting the job to the Spark cluster via spark-submit.

I have built the .jar file(named sjobs.jar) from the Spark job via sbt assembly. Then the .jar file served with simple HTTP service(named sjobs) to use with spark-submit. The sjobs service running on port 8080, the .jar file will be accessed via http://<host>:8080/sjobs.jar. Instead of serving via HTTP service we can publish the .jar file on HDFS as well(for the simplicity I have used http server). To run this Spark job, I need to deploy the Cassandra instance. I have run Cassandra node via docker. Following is the docker-compose.yml to deploy the Cassandra and sjobs service.

Submit Spark Job

The final step is to submit the job to Spark cluster which runs on Kubernetes cluster manger. The Spark master can be accessed via Kubernets ClusterIP service spark://spark-master:7077. I have used separate docker image(bde2020/spark-base:3.1.1-hadoop3.2) to run the spark-submit command(the spark-submit command resides in the bde2020/spark-base:3.1.1-hadoop3.2 container). Following is the spark-submit command I have used to submit the job. It dynamically inject the Spark master address and Cassandra host/port configurations. Once submit the job the details/status will be available on the Spark web-ui.

Helm Chart Deployment

In above sections I have discussed about deploying the Spark cluster with traditional Kubernetes deployment(simple service and pod). In here I will discuss about deploying Spark cluster on Kubernetes with Helm Charts. Bitnami provides Helm repository for Spark. The deployment is straightforward.

The web-ui of the Helm deployment available on port 80, so we can do Kubernetes port-forward and access the web from local machine. The Spark job can be submitted to the cluster via connecting to the worker node. Once submitted the job, the status/results will be available on the web-ui.

Reference

  1. https://medium.com/rahasak/spark-cassandra-connector-24e5c8c7a03c
  2. https://medium.com/rahasak/hacking-with-apache-spark-f6b0cabf0703
  3. https://github.com/JahstreetOrg/spark-on-kubernetes-helm
  4. https://dzone.com/articles/running-apache-spark-on-kubernetes
  5. https://infohub.delltechnologies.com/l/architecture-guide-dell-emc-ready-solutions-for-data-analytics-spark-on-kubernetes/running-spark-on-kubernetes

--

--