Deploying Apache Spark on a Local Kubernetes Cluster: A Comprehensive Guide

Safouane Ennasser
15 min readJun 20, 2023

--

This is the 2/3 part of the artice series

Please refer to the first part if you are not familiar with building your own Spark docker image from binaries.

Sumarry :

1 - Introduction
2 - Set up a Local Kubernetes Cluster
3 - Install Kubectl
4 - Build a Docker Image for Spark and Push it to Kubernetes Internal Repository
5 - Deploy Spark Job Using spark-submit
6 - Monitor the Application

Introduction

Welcome to the second part of our tutorial on deploying Apache Spark on a local Kubernetes cluster. If you haven’t read the first part yet, where we explored deploying Spark using Docker-compose, we encourage you to check it out to gain a solid understanding of that deployment method. In this article, we will dive into deploying Spark on a Kubernetes cluster, leveraging the power and scalability of Kubernetes to manage Spark applications efficiently.

Kubernetes, a leading container orchestration platform, provides a robust environment for deploying and managing distributed applications. By deploying Spark on Kubernetes, you can take advantage of Kubernetes’ features such as dynamic scaling, fault tolerance, and resource allocation, ensuring optimal performance and resource utilization.

Before we proceed, we will guide you through setting up a local Kubernetes cluster using Kind (Kubernetes IN Docker), a tool designed for running Kubernetes clusters using Docker container “nodes.” We will then install Kubectl, the Kubernetes command-line tool, on Windows and ensure connectivity to the local Kubernetes cluster.

Once our Kubernetes cluster is up and running, we will move on to creating a Docker image for Apache Spark, including all the necessary dependencies and configurations. We will push the Docker image to the Kubernetes internal repository, making it accessible within the cluster.

With the Spark Docker image ready, we will explore how to deploy Spark jobs on the Kubernetes cluster using the spark-submit command. We will configure the required parameters and monitor the Spark application’s execution and resource utilization.

Throughout this article, we will emphasize monitoring and optimizing the Spark application deployed on Kubernetes. By leveraging Kubernetes’ monitoring tools and practices, we can gain insights into application performance, troubleshoot issues, and fine-tune resource allocation for optimal Spark processing.

By the end of this tutorial, you will have a comprehensive understanding of deploying Apache Spark on a local Kubernetes cluster. You will be equipped with the knowledge and skills to harness the power of Kubernetes for efficient and scalable Spark processing, enabling you to tackle large-scale data challenges with ease. So, let’s dive in and explore the world of Spark and Kubernetes deployment together!

Set up a Local Kubernetes Cluster

To set up a kubernetes cluster in our local machine, we have several options. We can either use a virtual machine based solution, or a docker-based one.
Docker base solution is better for starting, as it easily managed as a simple container.
For that purpose we’ll use KIND.
Kind (Kubernetes IN Docker) is a tool for running local Kubernetes clusters using Docker container “nodes.”
It is primarily designed for testing Kubernetes itself but can also be used for local development or CI.

1. Install Kind :

First, we have to download KIND binary from the official GitHub repository (https://github.com/kubernetes-sigs/kind/releases).

Then, rename the binary to “kind” (or “kind.exe” on Windows) and place it in a directory included in your $PATH.

Finally, open a new terminal and check if KIND is accessible

>kind version
kind v0.20.0 go1.20.4 windows/amd64

If you get errors of type command not found, please double check that you have included the binary location to your $PATH environment variable.
Once installed, KIND will allow us manage kubernetes cluster with simple commands.
Let’s create a cluster ..

2. Create a Kubernetes Cluster :

Creating a cluster with KIND is as easy as running

$kind create cluster

This will set up a local Kubernetes cluster with containerized nodes using Docker.

Check if the cluster is created :

kind get clusters
# Result
>kind

The result indicates that we have already a running cluster called ‘kind

  • Reader: Wait a moment !
    you said that KIND is managing Kubernetes cluster using a docker container. How can i check that ? what if i shut down my docker deamon process?

This is easily checkable by getting all running containers in our local machine

> docker ps
# Result
ca731dd7d65b kindest/node:v1.27.3 "/usr/local/bin/entr…" 30 hours ago Up 6 minutes 127.0.0.1:61508->6443/tcp kind-control-plane

You can see that we have a new container running kubernetes cluster, so to answer your question :

- YES, all the cluster is managed inside a docker container.

- YES, if you shut down your docker deamon, you will loose access to the cluster.
However you can still access it again by rebooting your docker deamon, as the cluster is not destroyd.

So the previous command kind create cluster creates a kubernetes cluster running in a docker container. In my case with the following infos
Container(id = ca731dd7d65b; image=kindest/node:v1.27.3)

Great! Now we have setup a kubernetes cluster in our local machine .. easyyy !

To communicate with the newly installed cluster, we need a client for kubernetes.
The client allows us to interact with the cluster by sending instructions, configurations and other managment operations.

Kubernetes provides a command line tool for communicating with a Kubernetes cluster’s control plane, using the Kubernetes API.

This tool is named kubectl.

Install Kubectl

1- Download the binary

Installation steps are well described in the official documentation
https://kubernetes.io/docs/tasks/tools/#kubectl

Choose the right installation depending on you local OS. (installation process is listed for Linux MacOs and Windows)

This is just a matter of downloading the binary file, and add it to your $PATH environment variable.
In my case (windows machine)

curl.exe -LO "https://dl.k8s.io/release/v1.27.2/bin/windows/amd64/kubectl.exe"

This will download Kubectl binary in my local folder.
Next we have to place it in a known location from $PATH.
Or just add the current folder to $PATH. But i advice to copy it into a known placement (you can create a lib folder if you are on windows and add it to you $PATH)

2- Check kubectl installation

Verify the installation by checking the state of our kubernetes cluster. Open a new terminal and run :

kubectl cluster-info - context kind-kind
# Result
Kubernetes control plane is running at https://127.0.0.1:61508
CoreDNS is running at https://127.0.0.1:61508/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

As you can notice, Kubernetes sent back a response with two lines

This indicates that Kubernetes cluster is up and ready to receive instructions.

Now we have installed both Kubernetes cluster and kubectl (kubernetes client).

Next steps are more exciting, as we’ll package a Spark application into a docker image, then use it to deploy the application in a cluster mode, managed by Kubernetes.

Build a Docker Image for Spark and Push it to Kubernetes Internal Repository

In our previous tutorial, we explored deploying Apache Spark using Docker-compose, which provided a convenient way to set up a Spark cluster for local development. However, when it comes to deploying Spark on a Kubernetes cluster, we need a different approach and a new Docker image optimized for Kubernetes.

In this part of the tutorial, we will walk you through the process of creating a Docker image specifically designed to run Spark on Kubernetes.

Do you still remember the docker image we created in the PART 1 of the tutorial ?
This was built using a Dockerfile created from scratch.
Now, our goal is to build another image based on Spark official tools (not from scratch).

Download Spark
Apache Spark provides a pre-built Dockerfile and a script that generates Docker images optimized for Kubernetes. To access these tools, simply download the Spark binaries from the official Spark website (https://spark.apache.org/downloads.html), ensuring that you select the appropriate versions of Spark and Hadoop (in this case, versions 3.4.0 and 3.3+ respectively).
(You can find these resources by cloning Apache Spark GitHub repository as well, but i assume you have already downloaded a spark version in the past)

Once you have downloaded the compressed file, such as “spark-3.4.0-bin-hadoop3.tgz,” you can proceed to extract its contents. If you are using Windows, you can use a tool like 7zip to perform the extraction. directory within the extracted files.

Now, let’s dive into building the Docker image that will enable us to run Spark on Kubernetes.

1- Build Spark docker image for kubernetes (Windows users are concerned by windows section)

Navigate to “spark-3.4.0-bin-hadoop3-scala2.13/kubernetes/dockerfiles/spark” and check its content.

You will find a Dockerfile optimized to create Spark images for Kubernetes, we’ll use that dockerfile as a base for our future spark jobs. We don’t have to build it manually as spark provides a script for that.
(don’t hesitate to take a look at that Dockerfile <SPARK_HOME>/kubernetes/dockerfiles/spark/Dockerfile)

To build a Docker image for Apache Spark, including the necessary dependencies and configurations. we have a script created for that.
Navigate back to SPARK_HOME, spark-3.4.0-bin-hadoop3-scala2.13 and run

./bin/docker-image-tool.sh -t our-own-apache-spark-kb8 build

This will build a docker image with a tag
our-own-apache-spark-kb8.
Now you can play with that script by creating versions adapted for Python or R. Check the script help for usage.

Check the newly created image

docker images
# Result
REPOSITORY TAG IMAGE ID CREATED SIZEut an hour 127.0.0.1:61508->6443/tcp kind-control-plane
spark our-own-apache-spark-kb8 5fc75c2efb99 35 hours ago 659MB

As you can see, the image is added to the internal Docker repository, so it can be used by a program running in our loacal machine, but it is still not accessible from Kubernetes cluster.
For that purpose, we have to make Kubernetes cluster aware of the local image.

Windows machine case (skip it if you are not on windows, go to 2- push the image to kubernetes repos)
As Spark didn’t provide any batch script for windows (for building the image), we can run the same command using WSL (https://learn.microsoft.com/en-us/windows/wsl/install)
This allow us to run linux commands from windows, using a kind of virtual machine, but don’t panic, it is an easy and cool tool.
- After downloading WSL open a wsl terminal and run

./bin/docker-image-tool.sh -t our-own-apache-spark-kb8 build

If you get this error

The command 'docker' could not be found in this WSL 2

We recommend to activate the WSL integration in Docker Desktop settings. For details about using Docker Desktop with WSL 2, visit: https://docs.docker.com/go/wsl2/
In my case i’m using an ubuntu WSL terminal (you can use whatever version you want). Once into that terminal, we have to add the current user to Docker group, this allows creating and pushing images by the ubuntu current user. run :

sudo gpasswd -a docker $USER

Just after modifying the current user to create docker images, build the image :

sudo ./bin/docker-image-tool.sh -t our-own-apache-spark-kb8 build

At this step you must have a docker image built and available in your local repository.

2- push the image to kubernetes repos

Kubernetes manages an internal docker images repository, so all we have to do, is sending the image to the cluster, so he becomes aware of it.
For that we’ll use KIND (installed in the previous section)

Push the Docker image to Kubernetes internal repository to make it accessible within the cluster.

kind load docker-image spark:our-own-apache-spark-kb8

Now that we have a fresh image ready to use and to be managed by kubernetes, let’s use it to deploy a spark app in a cluster mode (we’ll deploy the same application as the part 1 of this tutorial) which is Spark Pi exemple.

Deploy a Spark program Using spark-submit

This is the last step of the part 2 of this tutorial, it describes steps to deploy a spark job into Kubernetes.

While we can establish communication with the Kubernetes cluster, Apache Spark’s spark-submit program requires additional configuration to interact with the cluster effectively. This is because spark-submit is responsible for coordinating with Kubernetes to create workers and allocate resources. In order to facilitate this interaction, we need to create a service account that spark-submit can utilize.
For that purpose, we need to create a Service Account.
By creating that service account, we enable any users associated with it, specifically the spark-submit program, to perform operations on the Kubernetes cluster. This ensures that spark-submit has the required privileges to manage Spark application runs on Kubernetes.

1- Service Account

As defined in Kubernetes official documentation

A service account is a type of non-human account that, in Kubernetes, provides a distinct identity in a Kubernetes cluster. Application Pods, system components, and entities inside and outside the cluster can use a specific ServiceAccount’s credentials to identify as that ServiceAccount.

To create the service account, we will execute a couple of commands using kubectl, the Kubernetes command-line tool.
The first command will create the service account named “spark” :

kubectl create serviceaccount spark

The second command establishes a cluster-level role binding for the service account, granting it the necessary permissions within the default namespace :

kubectl create clusterrolebinding spark-role - clusterrole=edit - serviceaccount=default:spark - namespace=default

Once we have set up the service account for Spark, we can proceed with running the spark-submit command. This command accepts various arguments to manage the execution of the Spark application, including specifying the number of executors, linking necessary JAR files, and configuring the deployment mode.

With the service account in place and the spark-submit command ready, we are now equipped to submit Spark applications to Kubernetes and instruct it to execute the provided example or any other desired application.

2- Spark-submit

Here is the final request to run (but DONT run it now before understanding some parameters)

spark-submit --master k8s://<YOUR_KUBERNETES_CLUSTER_CP> --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=spark:our-own-apache-spark-kb8 --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 100

Before we proceed let’s list the needed arguments:

  • — master
    Kubernetes cluster’s entrypoint, which serves as the control plane for the cluster. You can find this information by running the command :
kubectl cluster-info

which will display the cluster information, including the control plane URL.
For example, the URL might look like https://127.0.0.1:61508

  • — deploy-mode = cluster
    This mode allows the Spark application to be distributed across multiple worker nodes within the Kubernetes cluster.
  • — conf
    We also need to configure additional properties using the “ — conf” parameter. Here are the properties we will set:
  • spark.executor.instances=2 specifies the number of executor instances to be used for the application.
  • spark.kubernetes.container.image=spark:our-own-apache-spark-kb8 specifies the Docker image to be used for the Spark application. This image should match the one we created earlier.
  • spark.kubernetes.container.image.pullPolicy=IfNotPresent specifies the policy for pulling the Docker image. In this case, the image will only be pulled if it is not already present.
  • spark.kubernetes.authenticate.driver.serviceAccountName sets the service account name to authenticate the Spark driver.

3- Run the app

Now, let’s deploy our app by running spark-submit command with the required parameters. Note that spark-submit must be accessible from $PATH environment variable.
You can either add spark binaries from (/bin folder) to your path, or run the script ‘spark-submit.sh’ or ‘spark-submit.cmd if you are on windows’ directly from it’s location

spark-submit --master k8s://https://127.0.0.1:61508 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=spark:our-own-apache-spark-kb8 --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0.jar 100

After running the command, you will see a description of the running process in your terminal. These descriptions provide information about what is happening within Kubernetes, but they do not include the Spark program’s logs.

During the execution, you will observe the container’s status, including the container name, image, and state.
Initially, it will show “ContainerCreating” as the container is being created based on the previously built image “spark:our-own-apache-spark-kb8”

container status:
container name: spark-kubernetes-driver
container image: spark:our-own-apache-spark-kb8
container state: waiting
pending reason: ContainerCreating

At the end of the execution, you will see the container’s final status, indicating that it has terminated, which means that the Spark application has finished executing.

container name: spark-kubernetes-driver
container image: docker.io/library/spark:our-own-apache-spark-kb8
container state: terminated

4- Access Spark logs

To verify the results of our Spark program, we need to access the logs from the Kubernetes pods where the driver and workers are running. There are three available pods: driver, executor_1, and executor_2. For this purpose, we are primarily interested in the driver logs.

To get the driver pod’s name, we can run the command

kubectl get pods

This will display a list of pods, so we need to locate the driver pod. It will have a name in the format “spark-pi-280f5b88d4315605-driver” (the alphanumeric code will be different in your case).
Take note of the name and run the command “kubectl logs <driver-pod-name>” to retrieve the logs.

For example, if the driver pod’s name is “spark-pi-280f5b88d4315605-driver” you can run

kubectl logs spark-pi-280f5b88d4315605-driver

After running the command, you will see the logs displayed in your terminal. It will show the execution progress, task completion, and other relevant information.

INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) in 48 ms on 10.244.0.23 (executor 1) (99/100)
INFO TaskSetManager: Finished task 99.0 in stage 0.0 (TID 99) in 33 ms on 10.244.0.24 (executor 2) (100/100)
INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.739 s
INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.785108 s
Pi is roughly 3.1409835140983513
INFO SparkContext: SparkContext is stopping with exitCode 0.
INFO SparkUI: Stopped Spark web UI at http://spark-pi-280f5b88d4315605-driver-svc.default.svc:4040
INFO KubernetesClusterSchedulerBackend: Shutting down all executors
INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
INFO MemoryStore: MemoryStore cleared
INFO BlockManager: BlockManager stopped
INFO BlockManagerMaster: BlockManagerMaster stopped
INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
INFO SparkContext: Successfully stopped SparkContext
INFO ShutdownHookManager: Shutdown hook called
INFO ShutdownHookManager: Deleting directory /var/data/spark-07edd225-28e0-44a3-a47f-26245aee2180/spark-82288145-f3ae-431e-936e-3de755cd1549
INFO ShutdownHookManager: Deleting directory /tmp/spark-80fdf968-3374-41e3-9237-00a4e2df5c86'

At the end of the logs, you will find the calculated value of Pi. For example, the log might indicate that “Pi is roughly 3.1409835140983513.”

Congratulations! You have successfully deployed your first Apache Spark program on a Kubernetes cluster.
By accessing the logs, you can monitor the execution and obtain the results of your Spark application.

Monitor the application

To monitor the execution of Spark jobs and access the Spark UI, we need to forward the ports from the driver pod to our host machine. Here are the steps to follow:

1- Run the command ‘kubectl get pods’ to retrieve the name of the running Spark driver pod, then take note of the pod name.

kubectl get pods

2- To forward the ports, use the command

kubectl port-forward <driver-pod-name> 4040:4040

Replace <driver-pod-name> with the actual name of the Spark driver pod. This command will establish a port forwarding connection between the driver pod’s port 4040 and the same port on your local machine.

For example, if the driver pod’s name is “spark-pi-XXXXXXXXXXX-driver” you can run “kubectl port-forward spark-pi-XXXXXXXXXXX-driver 4040:4040”.

3- Once the port forwarding is established, open your web browser and go to localhost:4040
This will allow you to access Spark UI.

Note that the Spark UI is only available when the pod’s status is “Running.” If the pod’s status is “Completed” or any other non-running state, the server will not respond as it is no longer active.

By accessing the Spark UI at “localhost:4040,” you can monitor the progress of your Spark jobs, view detailed job information, and analyze performance metrics.

Remember to keep the port forwarding active as long as you want to monitor the Spark UI. You can stop the port forwarding by terminating the command running in the terminal.

Enjoy monitoring your Spark jobs and exploring the Spark UI!

FAQ

  • Question: Why i can’t see workers pods when we ran ‘kubectl get pods’
  • Answer: At the end of execution, these pods gets destroyed and obviously you cannot access these pods.
  • Question: How to be sure that workers were created during the running process?
  • Answer: As we are running a small processing calculation, the program ends fast. However this will not prevent us to see the instanciation and the running status by running ‘kubectl get pods’ during the running of spark-submit pi calculation.
    So during the calculation you can run ‘kubectl get pods’ several times to see progress of pods statuses.

Conclusion

In this second part of the tutorial, we explored deploying Apache Spark on a Kubernetes cluster using a custom binary Spark image. We started by setting up a local Kubernetes cluster using Kind, a tool that allows us to run Kubernetes clusters using Docker container nodes. With the cluster up and running, we proceeded to create a Docker image for Spark optimized for Kubernetes.

We learned how to create a service account in Kubernetes to enable communication between Spark and the cluster. Then, we ran a Spark job using spark-submit, specifying the necessary parameters such as the Kubernetes cluster entrypoint, deploy mode, and the image for the Spark containers. We also checked the status of the Kubernetes cluster and observed the execution of the Spark job.

To view the logs and results of the Spark job, we accessed the logs of the driver pod. Additionally, we forward the ports to our local machine, allowing us to monitor the job’s progress and access the Spark UI. This provided us with a visual interface to analyze performance metrics and gain insights into the execution of our Spark job.

Now that we have successfully deployed and monitored a Spark job on Kubernetes using a custom binary Spark image, we are ready to explore another approach in the third part of this tutorial. In the next part, we will delve into deploying Spark on Kubernetes using Helm charts, a package manager for Kubernetes applications. This method offers an efficient and scalable way to manage Spark deployments, simplifying the deployment process and providing additional configuration options.

Stay tuned for the third part of our tutorial, “Deploy Spark on Kubernetes using Helm charts,” where we will explore the benefits of using Helm charts and guide you through the process of deploying Spark with ease and flexibility.

--

--