In the previous post, we configured our local environment to run spark on kubernetes, now we can take advantage of some features provided by this approach as controlling which machine runs spark driver and executors.

By the end of this post, you will be able to define spark driver on ON_DEMAND instances and spark executors on SPOT instances.

Constraining Spark Pod Application

As we all know, the spark driver serves as the execution core for a spark application; if we lose it, all execution will cease. So, how can we avoid losing a spark driver while also lowering cloud costs for batch processing?

Kubernetes allows pods to be bound to a specific node via a technique known as nodeaffinity.

While the executor will be created on nodes with the label kubernetes.io/instanceType=SPOT, the driver will only create the driver pod on nodes with the label kubernetes.io/instanceType=ON DEMAND.

The example that follows defines a spark job application that details the required instance node type.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
...
driver:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: 'kubernetes.io/instanceType'
operator: In
values:
- ON_DEMAND
executor:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: 'kubernetes.io/instanceType'
operator: In
values:
- SPOT

Run the spark job

k apply -f examples/node-affinity/node-affinity.yaml

Run the following command to verify while watching the spark execution.

What's going on in the background?

By following the SparkOperator documentation we must enable a webhook that mutates the pod before creation. We enabled it in the previous post.

https://kubernetes.io/blog/2019/03/21/a-guide-to-kubernetes-admission-controllers/#what-are-kubernetes-admission-controllers

Conclusion

In this post, we defined the instance type that will be used to run the spark application.

Spark is a good fit for using SPOT instances for executors to reduce cloud costs, and we can also benefit from nice kubernetes features.

I hope this post was useful and you can find a complete example in my git repository.

Thank you for your time!

Join me: https://www.linkedin.com/in/tiagotxm/

--

--

Tiago Xavier
Engenharia de Dados Academy

Data Engineer | I write about data engineering, kubernetes and open-source projects