Reserving a Kubernetes node for specific pods

Kubernetes is a wonderful thing, but in my experience the project moves so fast that the documentation has a hard time keeping up with the changes.

We are running Kubernetes in top of CoreOS on baremetal, how we got there is a story for another day, but for now let’s focus on the fact that we wanted to have some of our baremetal nodes reserved for some very special kind of services without having to create a full k8s cluster altogether to achieve this kind of isolation.

Tainting (reserving) a node

If you head to the current (as of this writing 1.4) kubectl documentation you will find an entry about kubectl taint which sounded intriguing but the docs would not say much more about it. So as any stubborn person would do I headed over to Google and searched for more about the topic and I found this very intriguing document at github about taints. It is exactly what I needed. There is even a very cool and simple example

$ kubectl taint nodes foo dedicated=banana:NoSchedule
$ kubectl taint nodes bar dedicated=banana:NoSchedule
$ kubectl taint nodes baz dedicated=banana:NoSchedule

According to that document this will do the following:

Let’s say that the cluster administrator wants to make nodes foo, bar, and baz available only to pods in a particular namespace banana.

So I though, “interesting, taints restrict on a namespace level” but as you might have guessed this was wrong. Also the NoSchedule (edit: Used to be NoScheduleNoAdmitNoExecute) is still commented out in the code for the current master branch of Kubernetes in github. So those examples won’t really fly, but there still a way of achieving our goal.

So instead of the example from the docs, please try

kubectl taint nodes  node05.example.com dedicated=search:NoSchedule

According to the comments on the code NoSchedule does the following:

Do not allow new pods to schedule onto the node unless they tolerate the taint, but allow all pods submitted to Kubelet without going through the scheduler to start, and allow all already-running pods to continue running. Enforced by the scheduler.

That is good enough for me right now, the upcoming NoScheduleNoAdmitNoExecute will do that plus evict any already-running pods that do not tolerate the taint, but we will have to live with NoSchedule for now.

If you would like to have a list of the nodes with taints… it is not really that straightforward, I had to come up with the following inline template to achieve that:

kubectl get nodes -o template --template='{{printf "%-50s %-12s\n" "Node" "Taint"}}{{range.items}}{{printf  "%-50s %-12s" .metadata.name ( "None" | or (index .metadata.annotations "scheduler.alpha.kubernetes.io/taints")) }}{{ "\n" }}{{ end }}'\n

Using the tainted node

Now this is where the docs falls flat, it is just mentioned that the toleration must be included in the PodSpec but no example was provided on how to achieve that.

Turns out that what that means is the that one should add the following annotation (replace search to whatever you have used above)

scheduler.alpha.kubernetes.io/tolerations: '[{"key":"dedicated", "value":"search"}]'

Please note that by including that in your PodSpec will allow some pods to run in that node, but it won’t ensure that all pods only use the tainted nodes, if you want to achieve that then you will need to label the nodes and use the affinity selectors.

kubectl label nodes node05.example.com example/nodetype=search

you can view existing labels on the pods of your namespace with

kubectl get pods --show-labels

Now we must add another annotation to our PodSpec to select only nodes with this label, using the new affinity selectors

annotations:
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"dedicated", "value":"search"}]'
scheduler.alpha.kubernetes.io/affinity: >
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "example/nodetype",
"operator": "In",
"values": ["search"]
}
]
}
]
}
}
}

How would this look in a deployment?

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: weather-app
spec:
replicas: 3
template:
metadata:
labels:
app: weather-app
track: stable
annotations:
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"dedicated", "value":"search"}]'
scheduler.alpha.kubernetes.io/affinity: >
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "example/nodetype",
"operator": "In",
"values": ["search"]
}
]
}
]
}
}
}
spec:
containers:
- name: weather-app
image: registry.example.com/alex/k8s-demo:0.1.0-SNAPSHOT
ports:
- containerPort: 8080

Wrap up

The future looks really bright with a lot of very cool features coming for Kubernetes over the next releases, sadly it is very hard to figure out what is already available and what is still on its way. By looking at code and reading github issues you can figure it out as well.

Hope you find this small guide helpful, have fun!