Herding pods: taints, tolerations and affinity in kubernetes

Mark Betz
10 min readMay 12, 2018

--

The general theory of pod scheduling in kubernetes is to let the scheduler handle it. You tell the cluster to start a pod, the cluster looks at all the available nodes and decides where to put the new thing, based on comparing available resources with what the pod declares it needs. That’s scheduling in a nutshell. Sometimes, however, you need a little more input into the process. For example you may have been asked to run a thing that requires more resources than any single node in your cluster offers. You can add a new node with enough juice, maybe using a nodepool if you’re running on GKE, but how do you make sure the right pods run on it? How do you make sure the wrong pods don’t run on it?

You can often nudge the scheduler in the right direction simply by setting resource requests appropriately. If your new pod needs 5 GB of ram and the only node big enough is the one you added for it to run on, then setting the memory request for that pod to 5 GB will force the scheduler to put it there. This is a fairly fragile approach, however, and while it will get your pod onto a node with sufficient resources it won’t keep the scheduler from putting other things there as well, as long as they will fit. Maybe that’s not important, but if it is, or if for some other reason you need positive control over which nodes your pod schedules to then you need the finer level of scheduling control that kubernetes offers through the use of taints, tolerations and affinity.

Taints and tolerations

Taints and tolerations graduated to beta in kubernetes 1.6. You can think of a taint as a bad smell that is added to a node. Once a node is tainted pods that don’t declare a toleration for the taint won’t be scheduled to that node. Depending on how bad the smell is (the strength of the taint) the prohibition may be soft or hard, and running pods that don’t tolerate the taint may be evicted from the node.

Nodes are tainted using kubectl:

kubectl taint nodes nodename key=value:effect

If you’re running on GKE you can also taint the nodes in a nodepool at creation:

gcloud beta container node-pools create newpool --cluster mycluster --machine-type n1-standard-4 --node-taints=key=value:effect ...

A taint consists of a key, a value for the key, and an effect. The key and value can be anything and act just like key:value pairs act throughout kubernetes: something matches if it specifies the same key and value. The effect can be one of PreferNoSchedule, NoSchedule or NoExecute. The first tells the scheduler to prefer not to schedule intolerant pods to the tainted node, the second prohibits it from doing so, and the third tells it to also evict intolerant pods that are already running there. So you can think of these as the three levels of strength for a taint: soft, hard and hardest.

For example let’s say that I have added a node to run elasticsearch, and I don’t want anything else running on that node:

kubectl taint nodes es-node elasticsearch=false:NoExecute

Once the node is tainted we can see the taint when using kubectl to describe the node:

$kubectl describe nodes es-node
Name: es-node
...
Taints: elasticsearch=false:NoExecute

Let’s look at the structure of the taint. First, I’ve chosen effect NoExecute because this node was already created without the taint: it’s likely things are already running there and I want them evicted. Second, the key name and value are completely arbitrary, but it’s natural to try to say something about the taint when creating these. In this case I want to say “if you’re not elasticsearch don’t schedule here” but as we’ll see below that metaphor doesn’t hold up very well. The key:value could as easily have been “foo:bar” or “dalek:bad” as long as it matches a toleration on my elasticsearch pod.

A toleration is how a pod declares that it can stand the smell of a taint. Tolerations are a property of the PodSpec and a toleration for the taint above might look like this:

tolerations:
- key: elasticsearch
operator: Equal
value: false
effect: NoExecute

The toleration consists of the key name, an operator, a value and an effect. To tolerate a specific taint the key name should be set to the same name used for the taint’s key, “elasticsearch” in this example. The operator can be one of Equal or Exists. If set to Equal then the value is required and must match the key value on the taint. If set to Exists then the value should be omitted and the toleration will match any taint with the specified key name. If effect is provided it must be PreferredNoSchedule, NoSchedule or NoExecute and should match the effect on the taint. If effect is omitted then the toleration will match any taint with any effect as long as the key and value match.

Clear? Yeah it’s a little odd. When I first started using taints and tolerations the strangest thing for me was coming to grips with the fact that the toleration doesn’t really say anything about the pod other than that it matches a specific taint. Going back to my point about the “meaningfulness” of the taint, if you made a taint with the key “elasticsearch” and the value “false” you might read that as “if the pod is not an elasticsearch pod don’t schedule it here.” When you get to the toleration you’d expect to say something like “I am an elasticsearch pod” by putting “elasticsearch=true” somewhere. That’s just not how it works. The taint is a labeled thing that has to be matched by a toleration, period. Picking meaningful key:value pairs is left as an exercise for the reader.

As an aside you can also specify a toleration that will match any taint:

tolerations:
- operator: Exists

That toleration is basically a pass that will allow the pod onto any node with any taint. You might want to do this, for example, if you want to run a daemonset and make sure it will get scheduled onto all nodes in the cluster regardless of taints. For example in a GKE cluster running google’s Stackdriver logging agent the fluentd-gcp daemonset has the following toleration to make sure it gets past any node taint:

tolerations:
-operator: Exists
effect: NoExecute
-operator: Exists
effect: NoSchedule

This could have been accomplished with a single toleration by omitting the effect, but presumably this is slightly more future proof, in the case where an effect is added later that they do want to allow to be applied to this daemonset.

So that is how we can use a taint to shoo pods away from nodes we don’t want them on. However, a taint will not make sure the elasticsearch pod in our example gets scheduled onto the node where we intend it to run. Taints are repulsion and are targeted at the pods we don’t want. In order to get the pods we do want onto this node we could rely on resource requests as mentioned above, but again that’s a fragile approach: a node could be added later that also has sufficient resources but is intended for something else. In order to ensure the elasticsearch pod runs on the right node we need attraction as well as repulsion, and we can get that from affinity.

Affinity

Taints are a way to repel intolerant pods from a node or set of nodes, and are a property of the tainted nodes. Affinity, which also graduated to beta in 1.6, is a property of a pod and provides a powerful mechanism for either attracting it to or repelling it from other pods and nodes. If we wanted to control the scheduling of our hypothetical elasticsearch pod prior to kubernetes 1.6 we could have taken advantage of an earlier pod controller feature called a nodeSelector. Node selectors have been in kubernetes at least since 1.2 and are a field of PodSpec that signals the scheduler to prefer or require nodes with certain labels or conditions when scheduling a pod. As we’ll see below affinity can accomplish the same things but provides a more flexible and useful mechanism.

Affinity comes in two flavors: node affinity and pod affinity. It’s important to read these as properties of a pod. So node affinity attracts a pod to certain nodes, and pod affinity attracts a pod to certain pods. Pod affinity furthermore has its opposite in pod anti-affinity, which as you would expect repels a pod from other pods. This last feature can be pretty handy when working out scheduling for a distributed stateful application like elasticsearch, where you probably don’t want your data es nodes collocating with each other. Node anti-affinity can also be achieved using the node affinity syntax and I’ll look at that briefly below, but first let’s look specifically at how to solve our hypothetical scheduling problem.

For this example we want to force the scheduler to place our elasticsearch pod onto the node we created for it, and which we already tainted so that it would not be occupied by other pods. Since the elasticsearch pod will have a toleration for the taint it can schedule there, but to make sure it does we have to add a node affinity:

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- es-node

That’s quite a chunk of stuff to say “put me on the node named es-node” so let’s tear it down. Affinity is a property of PodSpec, and there are three types of affinity: nodeAffinity, podAffinity and podAntiAffinity. Here we’ve added a nodeAffinity because we want the pod to be attracted to a node. The type of nodeAffinity being added is requiredDuringSchedulingIgnoredDuringExecution. You can’t fault the kubernetes contributors for being insufficiently descriptive in their naming. Following the type of nodeAffinity are the nodeSelectorTerms, in this case a single match expression that requires the label “kubernetes.io/hostname” to be set to the value “es-node.” This expression relies on one of the default labels that kubernetes assigns to every node. In a real-world scenario you’d typically use a custom label.

The requiredDuringSchedulingIgnoredDuringExecution nodeAffinity works as you might expect: it requires the scheduler to select a node that matches the match expression, but will not evict a running pod whose affinity has been changed so that it no longer matches the node it is on. The “softer” version of this is preferredDuringSchedulingIgnoredDuringExecution, which includes an integer weight that is added to each node’s scheduling preference if the node matches the match expression. The match expression itself can take advantage of a number of operators in addition to “In.” For example “NotIn” can be used to achieve node anti-affinity as mentioned above. There are far too many possible permutations to go into them all here. The official documentation is a good source for additional details.

The use case for node affinity seems pretty straightforward. Together with taints and tolerations it allows us to utilize our compute resources in a way that best maximizes efficiency and reliability. Consider an extreme case where you have 100 small services running in a cluster, none of which requires more than 1GB of ram. You run 10 relatively small nodes with four cores and 15 GB of ram, and that is over-provisioned on memory. Now you want to run a memory intensive pod that will need 64GB of ram. If you want to keep nodes homogeneous you could remake the cluster on three 64GB nodes, but the downside of this is that the big app will take one node, the services will get scheduled on the other two, and a node down event will take out 50% of your capacity. From a reliability perspective it makes more sense to keep the 10 small nodes and partition off one or two nodes for the more memory-intensive pods.

The case for pod affinity or anti-affinity seems less straightforward, but it is a useful tool whenever you do or don’t want two pods to collocate on the same node. In my experience pod anti-affinity has been much more useful than affinity: it’s not hard to imagine scenarios where you want to discourage or prohibit two pods from collocating. The previously mentioned elastic search data nodes are a good example, as would be any critical service where you don’t want a node-down event to take out more than one pod. Here’s an example of a pod anti-affinity declaration to keep two elasticsearch data nodes from ending up on the same cluster node:

affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- elasticsearch
- key: role
operator: In
values:
- data
topologyKey: kubernetes.io/hostname

This anti-affinity relies on the elasticsearch pods having two labels applied: component:elasticsearch and role:data. The second label reflects the main rule we’re trying to achieve: don’t collocate elasticsearch data pods; and the first keeps us from accidentally applying the rule to other pods that might have the label role:data. Just specific enough to target the right things is the main rule to follow when constructing match expressions.

Probably the most interesting thing here is the topology key. Essentially it determines the granularity at which the rule will be applied. More formally, with respect to a pod A which has the above anti-affinity rule, and a pod B which matches the match expression in the rule: if a candidate node shares the same value for the topology key as a node already running pod B, do not schedule pod A on that node. In the example above the topology key is “kubernetes.io/hostname.” Only one node will have any given value for this key, so any other node would be a candidate as long as it is not running pod B. If you set the topology key to “failure-domain.beta.kubernetes.io/zone” for example, and a node in zone “us-central1-a” is running pod B, any other node in that zone will have the same value for this key and will not be a candidate, so the scheduler will look for a node in a different zone.

There is a lot more that can be done with taints, tolerations and affinity than I have had space to explore in this post, which really just scrapes the surface. These features are in that class of tools that you don’t need until you need them, but once you do they begin to solve problems that you will find it hard to grapple with any other way. The larger your clusters, and the more heterogeneous your workloads, the more valuable these ideas will become. To begin exploring in more detail a good place to start is with the official documentation on taints and tolerations, and affinity/anti-affinity.

--

--

Mark Betz

Senior Devops Engineer at Olark, husband, father of three smart kids, two unruly dogs, and a resentful cat.