Kubernetes for data engineering

Yes, it’s a good idea.

Helm Charts. Photo by Ibrahim Boran on Unsplash

What’s special about data engineering

Active infrastructure-as-code

helm upgrade --install --namespace metabase metabase stable/metabase
--values=values.yaml
image:
repository: <internal docker repo>/analytics/metabase
tag: 0.40.1-joom-1
database:
type: mysql
host: <internal rds database>.rds.amazonaws.com
port: 3306
dbname: metabase
... username and password settings ...
ingress:
enabled: true
hosts:
- metabase.<internal domain>
path: /*
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/group.name: analytics
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: <reducted>
- name: infra-b-k1-19
availabilityZones: [ "eu-central-1b" ]
instanceType: "m6i.2xlarge"
minSize: 1
maxSize: 7

Keeping it simple

The dark side of automation

Hardware returns

# Nodegroup definition
- name: spark-executors
instancesDistribution:
instanceTypes: ["r5d.2xlarge", "r5dn.2xlarge"]
...
labels:
purpose: spark-executor
# Pod definition
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: purpose
operator: In
values:
- spark-executor
# Nodegroup definition
- name: spark-executors
taints:
purpose: "spark-executor:NoSchedule"
# Pod definition for Spark executors
tolerations:
- key: "purpose"
operator: "Equal"
value: "spark-executor"
effect: "NoSchedule"

Conclusion

Big Data Engineer at Joom.