Cloud native Data with Spark 2.3 and Kubernetes
The moment many folks, excited about the potential for Kubernetes in the big data space, had been waiting for finally arrived when the Apache Spark project released 2.3 at the end of February, and with it, native support for Kubernetes as an orchestrator.
In this blog post, I’ll walk through running a job like this with Spark on a Kubernetes cluster running on Microsoft Azure for this, but nearly all of the steps are equally applicable to the other public clouds.
For this I’m going to use a simple job that copies a set of Avro encoded data from one container in Azure’s object storage to another but is complex enough that it will show you how to build the Scala project and include dependencies that need to be submitted with the job. I’m using Scala because that’s the only language that’s supported in the Spark 2.3 release, but the maintainers are busy bringing full language and feature support to Kubernetes.
You’ll need to have a functional Scala 2.11 environment on your system, git, Docker installed on your system.
With that, let’s first clone this project:
$ git clone https://github.com/timfpark/copyDataLooking at src/main/scala/copyData.scala we can see that it is a standard issue Spark Scala job that at its core does a read and write of data serialized in the Avro format:
...val data = spark.read .format("com.databricks.spark.avro") .load(conf.get("spark.copydata.frompath"))
data.write .mode("overwrite") .format("com.databricks.spark.avro") .save(conf.get("spark.copydata.topath"))...
Let’s build a JAR from this that we can deploy. The cloned project includes the build.sbt definition for doing this, so all you need to do from the project root is:
$ sbt clean packageWhich will build a JAR for the job at target/scala-2.11/copy-data_2.11–0.1.0-SNAPSHOT.jar
Native support for Kubernetes means that executing this JAR is done in the container way: by deriving from the base Spark container and inject your job’s JARs to run them.
As of this writing, there are no public Docker Hub images for Spark 2.3, so you need to download the distribution and do this ourselves, replacing timfpark below with your user name on Docker Hub:
$ wget http://apache.mirrors.hoobly.com/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz$ cd spark-2.3.0-bin-hadoop2.7$ docker build -t timfpark/spark:2.3.0 -f kubernetes/dockerfiles/spark/Dockerfile .$ docker push -t timfpark/spark:2.3.0
With this, we can build our first image. I’ve included a Dockerfile in the sample project that looks like this:
FROM timfpark/spark:2.3.0RUN mkdir -p /opt/spark/jarsCOPY target/scala-2.11/copy-locations_2.11-0.1.0-SNAPSHOT.jar /opt/spark/jars
Again, adjust this to use your recently pushed image in your Docker Hub namespace, and then build a docker image:
$ docker build -t timfpark/copy-data:latest .
$ docker push timfpark/copy-data:latestWith this, we have everything we need to start the job. First, find your Kubernetes master’s endpoint:
$ kubectl cluster-info
Kubernetes master is running at https://test-cluster.eastus2.cloudapp.azure.com
...And if you are running RBAC on your Kubernetes cluster, create a role binding for Spark.
$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=defaultAnd then change directory back to the spark distribution we unpacked a few steps ago and execute the job, substituting all of your details from the previous steps:
$ cd ~/spark-2.3.0-bin-hadoop2.7
$ bin/spark-submit --master \
k8s://test-cluster.eastus2.cloudapp.azure.com:443 \ --deploy-mode cluster \ --name copyLocations \ --class io.timpark.CopyData \--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.copydata.containerpath=wasb://CONTAINERS@STORAGE_ACCOUNT.blob.core.windows.net \
--conf spark.copydata.storageaccount=STORAGE_ACCOUNT \ --conf spark.copydata.storageaccountkey=STORAGE_ACCOUNT_KEY \ --conf spark.copydata.frompath=wasb://CONTAINER1@STORAGE_ACCOUNT.blob.core.windows.net/PATH1 \ --conf spark.copydata.topath=wasb://CONTAINER2@locationdata.blob.core.windows.net/PATH2 \ --conf spark.executor.instances=16 \ --conf spark.kubernetes.container.image=timfpark/copy-data:latest \ --jars http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.2/hadoop-azure-2.7.2.jar,http://central.maven.org/maven2/com/microsoft/azure/azure-storage/3.1.0/azure-storage-3.1.0.jar,http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \ local:///opt/spark/jars/copy-data_2.11-0.1.0-SNAPSHOT.jar
After a few moments the driver will spin up:
NAME READY STATUS RESTARTS AGEcopydata-df8de4b8e1753024ac8790d72ccf3c9b-driver 0/1 Init:0/1 0 4s
Followed by the 16 executor instances shortly afterwards:
NAME READY STATUS RESTARTS AGEcopydata-df8de4b8e1753024ac8790d72ccf3c9b-driver 1/1 Running 0 56scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-1 1/1 Running 0 46scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-10 1/1 Running 0 37scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-11 1/1 Running 0 29scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-12 1/1 Running 0 29scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-13 1/1 Running 0 29scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-14 1/1 Running 0 29scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-15 1/1 Running 0 29scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-16 1/1 Running 0 21scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-2 1/1 Running 0 46scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-3 1/1 Running 0 46scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-4 1/1 Running 0 46scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-5 1/1 Running 0 46scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-6 1/1 Running 0 37scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-7 1/1 Running 0 38scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-8 1/1 Running 0 38scopydata-df8de4b8e1753024ac8790d72ccf3c9b-exec-9 1/1 Running 0 37s
And there you have it — you’ve build, containerized, and executed your first job in Spark on Kubernetes.












