A Detailed Walkthrough
Build Your Own Big Data Ecosystem — Part 1
Running Spark on Kubernetes
Data is an integral part of everything that we do nowadays. Almost everything around us generates data in some form or the other. This data is useful to us only if we are able to extract meaningful insights from it. This is where Data Analytics comes in. We have a lot of products available out there which satisfy our Big Data needs but they usually come with hefty license costs.
In this multipart series, we will try to create our own Cloud Agnostic Big Data Eco System using all open source tools.
A typical Big Data eco-system consists of the following
- Core data analytics engine to run your big data queries
- Notebook/UI interface to create queries interactively
- Communication channel with an underlying Data Lake
- Framework to manage the lifecycle of ML Models
- Visualization tool to create graphs
Spark is very well known for its high query performance and is used widely in a lot of successful products like AWS EMR, Databricks, Cloudera, etc as the underlying data analytics engine. We too will use Spark as our Data Analytics Engine
To make the ecosystem de-coupled from the underlying infrastructure, we will run it on Kubernetes so that we can run it where ever we want and scale it on the go. Spark can natively run on Kubernetes since version 2.3.
This blog post in a nutshell
We will create a Spark Docker image from the official Spark distribution. We will then create a Kubernetes cluster in Azure and use the above created Docker image to run spark driver and executor pods inside of the cluster. We will run a sample Spark job in cluster mode to calculate the value of Pi in this Kubernetes cluster.
Pre-requisites
Although we can use a windows machine to set everything up, I strongly recommend using an Ubuntu-based machine for much larger support and ease of use.
Following things are required to be present in your dev machine (if not using a cloud shell)
- Docker
- kubectl
- minikube (Only if you plan to run Kubernetes on your local system)
- Java-11
- Docker Container Registry (ACR / ECR / Docker Hub etc.)
I have used Azure’s Cloud Shell to run most of my commands. This reduces the effort for setting up required binaries like Java, az cli, kubectl, python etc as it already comes pre installed with them.
Spark + Kubernetes — Some basic fundamentals
There are 3 major parts of the Spark Architecture.
- Driver — Responsible for breaking up an incoming spark job into smaller junks and govern the execution of those chunks
- Executor — Responsible for running an individual chunk of the overall job and reporting the result back to the driver.
- Cluster Manager — Responsible for managing the underlying infrastructure on which the Spark driver and Executor would run
Traditionally Spark used to come with Yarn, Mesos, and Standalone cluster managers but from Spark 2.3 onwards, Spark included support for Kubernetes as well to act as its cluster manager.
In our use case, the Kubernetes scheduler is responsible for creating the appropriate Driver and Executor Pods in the available nodes.
We can execute our jobs in 2 Spark modes, cluster mode, and client mode. In this blog post, we will cover the cluster mode where we use spark-submit to submit a job to the Kubernetes cluster which creates the spark context, driver, and executor pods on the go.
In Part-2 of this series, we will go deeper into the client mode.
Step 1 — Creating the Spark Docker Image
In this step, we would be creating Spark’s Docker image which supports pyspark. We would be using this image down the line for Kubernetes pods.
Download and Extract Spark
Run the below curl command to download the compiled spark distribution. Alternatively, you can also visit the URL in a web browser and initiate the download.
curl https://mirrors.estointernet.in/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz --output spark-3.0.1-bin-hadoop3.2.tgz
At the time of writing this post, the latest available version for Spark version is 3.0.1 and for Hadoop is 3.2
After downloading, extract the downloaded zip
tar -xvzf spark-3.0.1-bin-hadoop3.2.tgz
Build the Spark Docker image
Spark comes with an inbuilt script “docker-image-tool.sh” which creates the docker image for us. In the root level of the extracted folder, run the following command.
./bin/docker-image-tool.sh -r <<docker registry url>> -t v3.0.1-j11 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -b java_image_tag=11-jre-slim build
Note — Docker installation is required on the machine where the script is being run.
In the above command, we execute docker-image-tool.sh with parameters as follows
- -r is for the docker registry
- -t is the tag for the output docker images
- -p is the path of the actual Docker file. We use the python Dockerfile above which should be present at the given path. This adds support to pyspark.
- -b is the build args that we want to send to the Dockerfile. In our case, we mention the java base image to be used as 11-jre-slim
After the successful execution of the docker-image-tool.sh you should see 2 docker images present in your local registry. One with the name <<docker registry url>>/spark and other with the name <<docker registry url>>/spark-py
As we plan to run python queries in our upcoming parts, we will use the image spark-py in the subsequent steps
Push the Image to a Container Registry
The next step is to push the docker image to the Docker repository. We would refer this location of the image while running the spark-submit command in the latter part of this post.
docker push <<docker registry url>>/spark-py:v3.0.1-j14
Step 2 — Creating the Kubernetes Cluster
Now that our Spark docker image is ready, it's time to spin up a Kubernetes cluster for running Spark.
I am using Azure Kubernetes Service to create a cluster on Azure. But as this approach doesn’t really care where your cluster is, you can deploy it to the location of your choice.
To create an AKS cluster, following this official doc from Microsoft.
Setting up kubectl
The first task for us would be to configure kubectl for connecting to our cluster. kubectl works on a kubeconfig file normally present at $HOME/.kube folder on your system. Various Kubernetes services provide different ways of configuring the kubectl to work with the cluster.
For AKS cluster, run the following az-cli command to make kubectl point to it.
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
Note — You would need to run “az login” before running the above command if you are doing it from a local system
You can run this command directly on the Azure cloud shell and start using kubectl there.
Create a Kubernetes Namespace
Next, we create a namespace within the cluster. Although we can use the default namespace, it turns out to be a better and more scalable approach when we would deploy a lot of other tools as part of the same cluster.
To create a namespace run the below command
kubectl create namespace spark
Create a Service Account and Bind a Role to it
As part of the Spark execution cycle, the driver pod would require to change the state of the cluster by creating and deleting pods whenever required. We hence require a service account in Kubernetes which would have edit permissions on the cluster. We will use this account for running our spark-submit command.
kubectl create serviceaccount spark --namespace=sparkkubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark --namespace=spark
We create a service account with the name “spark” in our spark namespace with the cluster role as edit.
Assigning Image Pull permission on Docker Registry
For creating the containers in pods, Kubernetes needs access to your docker registry for pulling the docker image. For this, we create a secret in our cluster with the credentials and then associate that secret with the service account that we created in the previous step.
Note — While creating the AKS cluster, we can associate an Azure Container Registry with it from the creation wizard itself. If you have already done that, you can skip this step
Run the below command the create the secret
kubectl create secret docker-registry docker-repo-access --namespace=spark --docker-server=<<docker registry url>> --docker-username=<<username>> --docker-password=<<password>>
Note — If you are using ACR with a cluster not in Azure, you can create a Service Principal in AAD, provide ACR Pull role to it on the ACR and use the client id & client secret of the Service Principal in place of username & password
Assign the secret to our service account
kubectl patch serviceaccount spark -p '{"imagePullSecrets": [{"name": "docker-repo-access"}]}' -n spark
Step 3 — Running a Spark Job on the Cluster
With all of the actions we performed in step 2, our cluster is now ready to run a spark job.
Finally, it's time to run our first-ever Spark Job on this cluster.
If you are running from the Azure cloud shell, open another shell in a new tab and run the below command to open up a proxy connection with our cluster on the IP address 127.0.0.1:8001
kubectl proxy
Then use the command spark-submit to run our job.
./bin/spark-submit --master k8s://http://127.0.0.1:8001 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=3 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=ascendonk8s.azurecr.io/spark-py:v3.0.1-j14 local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar 1000
--master
This is the URL of the API server of our Cluster. k8s:// is a convention that Spark uses to identify the URL as a Kubernetes cluster endpoint.--name
This is the name that we want for our Spark job--class
This is the entry point Java class from which our execution will begin--conf
Using this we provide details in terms of the number of executor pods we need, the service account which we want to associate with the pods, and the container image to use for spinning up the pods.local://
This is a convention that tells spark that the code package for our job is present within the container itself. Using it we specify that we need to run the jar file named spark-examples_2.12–3.0.1.jar present at the respective path within the image.
After submitting the Command, you can put a watch on the cluster pods using the below command
watch kubectl get pods -n spark
The life cycle of the pods will be as follows
If you are able to see the pods coming up and going doing in the “watch” command above as per this lifecycle then it means everything went well.
For our last verification, we check the logs of the driver pod using the below command
kubectl logs <<driver_pods_name>>
You should see a line in the output as follows
Pi is roughly 3.152155760778804
With this, we wrap up our Part -1. We have now build up the foundation for our eco-system.
In the next part, we will explore the usage of a Jupyter notebook in the cluster for running interactive Spark queries in client mode.