Cloud native Data with Spark 2.3 and Kubernetes

The moment many folks, excited about the potential for Kubernetes in the big data space, had been waiting for finally arrived when the Apache Spark project released 2.3 at the end of February, and with it, native support for Kubernetes as an orchestrator.

In this blog post, I’ll walk through running a job like this with Spark on a Kubernetes cluster running on Microsoft Azure for this, but nearly all of the steps are equally applicable to the other public clouds.

For this I’m going to use a simple job that copies a set of Avro encoded data from one container in Azure’s object storage to another but is complex enough that it will show you how to build the Scala project and include dependencies that need to be submitted with the job. I’m using Scala because that’s the only language that’s supported in the Spark 2.3 release, but the maintainers are busy bringing full language and feature support to Kubernetes.

You’ll need to have a functional Scala 2.11 environment on your system, git, Docker installed on your system.

With that, let’s first clone this project:

$ git clone https://github.com/timfpark/copyData

Looking at src/main/scala/copyData.scala we can see that it is a standard issue Spark Scala job that at its core does a read and write of data serialized in the Avro format:

...
val data = spark.read
  .format("com.databricks.spark.avro")
  .load(conf.get("spark.copydata.frompath"))

data.write
  .mode("overwrite")
  .format("com.databricks.spark.avro")
  .save(conf.get("spark.copydata.topath"))
...

Let’s build a JAR from this that we can deploy. The cloned project includes the build.sbt definition for doing this, so all you need to do from the project root is:

$ sbt clean package

Which will build a JAR for the job at target/scala-2.11/copy-data_2.11–0.1.0-SNAPSHOT.jar

Native support for Kubernetes means that executing this JAR is done in the container way: by deriving from the base Spark container and inject your job’s JARs to run them.

As of this writing, there are no public Docker Hub images for Spark 2.3, so you need to download the distribution and do this ourselves, replacing timfpark below with your user name on Docker Hub:

$ wget http://apache.mirrors.hoobly.com/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz
$ cd spark-2.3.0-bin-hadoop2.7
$ docker build -t timfpark/spark:2.3.0 -f kubernetes/dockerfiles/spark/Dockerfile .
$ docker push -t timfpark/spark:2.3.0

With this, we can build our first image. I’ve included a Dockerfile in the sample project that looks like this:

FROM timfpark/spark:2.3.0
RUN mkdir -p /opt/spark/jars
COPY target/scala-2.11/copy-locations_2.11-0.1.0-SNAPSHOT.jar /opt/spark/jars

Again, adjust this to use your recently pushed image in your Docker Hub namespace, and then build a docker image:

$ docker build -t timfpark/copy-data:latest .
$ docker push timfpark/copy-data:latest

With this, we have everything we need to start the job. First, find your Kubernetes master’s endpoint:

$ kubectl cluster-info
Kubernetes master is running at https://test-cluster.eastus2.cloudapp.azure.com
...

And if you are running RBAC on your Kubernetes cluster, create a role binding for Spark.

$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

And then change directory back to the spark distribution we unpacked a few steps ago and execute the job, substituting all of your details from the previous steps:

$ cd ~/spark-2.3.0-bin-hadoop2.7
$ bin/spark-submit --master \
k8s://test-cluster.eastus2.cloudapp.azure.com:443 \
    --deploy-mode cluster \
    --name copyLocations \
    --class io.timpark.CopyData \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.copydata.containerpath=wasb://CONTAINERS@STORAGE_ACCOUNT.blob.core.windows.net \
--conf spark.copydata.storageaccount=STORAGE_ACCOUNT \
    --conf spark.copydata.storageaccountkey=STORAGE_ACCOUNT_KEY \
    --conf spark.copydata.frompath=wasb://CONTAINER1@STORAGE_ACCOUNT.blob.core.windows.net/PATH1 \
    --conf spark.copydata.topath=wasb://CONTAINER2@locationdata.blob.core.windows.net/PATH2 \
    --conf spark.executor.instances=16 \
    --conf spark.kubernetes.container.image=timfpark/copy-data:latest \
    --jars http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.2/hadoop-azure-2.7.2.jar,http://central.maven.org/maven2/com/microsoft/azure/azure-storage/3.1.0/azure-storage-3.1.0.jar,http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \
    local:///opt/spark/jars/copy-data_2.11-0.1.0-SNAPSHOT.jar

After a few moments the driver will spin up:

NAME                                               READY     STATUS     RESTARTS   AGE
copydata-df8de4b8e1753024ac8790d72ccf3c9b-driver   0/1       Init:0/1   0          4s

Followed by the 16 executor instances shortly afterwards:

NAME                                                READY     STATUS    RESTARTS   AGE
copydata-df8de4b8e1753024ac8790d72ccf3c9b-driver    1/1       Running   0          56s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-1    1/1       Running   0          46s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-10   1/1       Running   0          37s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-11   1/1       Running   0          29s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-12   1/1       Running   0          29s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-13   1/1       Running   0          29s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-14   1/1       Running   0          29s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-15   1/1       Running   0          29s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-16   1/1       Running   0          21s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-2    1/1       Running   0          46s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-3    1/1       Running   0          46s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-4    1/1       Running   0          46s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-5    1/1       Running   0          46s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-6    1/1       Running   0          37s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-7    1/1       Running   0          38s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-8    1/1       Running   0          38s
copydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-9    1/1       Running   0          37s

And there you have it — you’ve build, containerized, and executed your first job in Spark on Kubernetes.