Running Spark Jobs with Kubernetes on DigitalOcean High CPU Droplets

With the release high CPU droplets on DigitalOcean, running data workflows, streaming and processing, etc. can be considerably more efficient. One approach to deploying these data tools can be as a containerized service, and deploying things like Apache Spark can be highly automated and resources scheduled with Kubernetes.

Deploying Kubernetes on DigitalOcean

There’s a few fully automated options here with excellent control planes, but my favorite one is via Stackpoint. However, because its use does require a subscription, and this might not be ideal for a testing/development environment, or if you already use a method to deploy a cluster, you can use them again substituting the Node size for one of the now available high CPU sizes for this use case (in the case of my example, c-4 .)

Kris Nova has added a DigitalOcean profile to the excellent kubicorn tool which will deploy Kubernetes on DigitalOcean with and encrypted VPN service mesh:

The only thing you’ll need to do that differs from the above to facilitate scheduling your deployments to the high CPU droplets is to modify this line in the Go package:

to reflect the high CPU droplet size.

A less maintained deploy process, but fairly standard method, could also be to use my repo for the, now deprecated, kube-up.sh process with DigitalOcean as the provider, but this still functions on DigitalOcean to create a basic cluster:

Spark on Kubernetes

This segment comes from an excellent blog post on the subject, and I recommend reading it for the full context, but I’ll re-post the highlights here to get up and running with an example Spark job. The goal here is to run these Spark jobs on the high CPU worker nodes, and with this approach to running Spark, that’s definitely as easily done as said.

With the above Master IP noted, you can grab the Spark package:

wget https://github.com/apache-spark-on-k8s/spark/releases/download/v2.1.0-kubernetes-0.1.0-alpha.1/spark-2.1.0-k8s-0.1.0-without-hadoop.tgz

and from the extracted package’s directory, submitting a job can be done like this:

bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://YOUR_MASTER_IP:PORT \
--kubernetes-namespace default \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000

Making the most of Kubernetes

Scheduling your Spark jobs to run on the worker nodes specifically (if for example, you run an environment with mixed sizes, and workloads), or if you need to isolate workload traffic, you can use things like nodeSelector and different affinity options to optimize where these jobs are run after submission, etc.

For example, something like this can be used to match a grouping of hosts:

spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "whatever"

This is great, and provides some redundancy to a fairly robust system, so the obvious remaining point of failure is your Droplet fleet itself making up the cluster. Since pods (and other controllers) can, theoretically, be provisioned on the same node in a cluster, losing a node for whatever reason can result in an outage, or in the case of pods, complete pod loss, so you can further target your deployment by:

  1. Labeling your nodes (as in the example above)
  2. Using the nodeSelector key in your configuration to manually target deploy pods to specific nodes to keep them relatively isolated from too much of the rest of the MongoDB cluster, in this case, to remain online if a worker node drops out of the Kubernetes cluster.

Another (more automated) approach to this is to effectively reserve the node resources, so if you want to avoid deploying to a specific node, you can also begin using the tainting feature in kubectlon said node to prevent Kubernetes from (re)scheduling onto that cluster member unless it matches the defined behavior in your taint command (so, for example, reserving it for specific namespaces — this is helpful if your cluster is mixed-use for a number of different workloads that may not require the same scope as your Spark jobs).

Kubernetes, like Docker Swarm, has multiple affinity options, meaning how pod containers are scheduled adhere to algorithmically defined behavior (for example, like the binpack method in Swarm, but with a little more flexibility from your out of the box options):

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity

There’s many other amazing things that can be done with Kubernetes, and if you’re new to the ecosystem, I recommend checking out these resources if this is your first exposure:

Both are excellent, and efficient, resources to get you up and running on Kubernetes.