Resilient Affinity Assistant

Priti Desai
Tekton Pipelines
Published in
4 min readJun 27, 2023
How to be resilient? Source: covunicareers

This article explains one of the fixes to the Tekton Pipelines which allows cluster maintenance without loosing pipelineRuns.

What is an affinity assistant in Tekton Pipelines?

Affinity assistant in Tekton Pipelines is created one per pipelineRun to constrain running the taskRun pods on a single node. This constraint is necessary for the tasks which are sharing the same workspace. Affinity assistant is a Kubernetes StatefulSet object.

If you are relying on the Tekton Pipelines workspaces, please follow along to understand how a cluster operator can cordon a node without loosing the running pipelineRun.

Here is an example pipelineRun to work with to understand the issue with affinity assistant during a cluster maintenance. This example pipelineRun has two tasks (first-task and last-task) and a shared workspace (source).

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
generateName: pipeline-run-
spec:
workspaces:
- name: source
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Mi
pipelineSpec:
workspaces:
- name: source
tasks:
- name: first-task
taskSpec:
workspaces:
- name: source
steps:
- image: alpine
script: |
echo $(workspaces.source.path)
sleep 60
workspaces:
- name: source
- name: last-task
taskSpec:
workspaces:
- name: source
steps:
- image: alpine
script: |
echo $(workspaces.source.path)
sleep 60
runAfter: ["first-task"]
workspaces:
- name: source

I have a kind cluster with three nodes:

kubectl get node
NAME STATUS ROLES AGE VERSION
kind-multinode-control-plane Ready control-plane 1d v1.26.3
kind-multinode-worker1 Ready <none> 1d v1.26.3
kind-multinode-worker2 Ready <none> 1d v1.26.3

Now, let’s create the pipelineRun and get the list of pods:

kubectl get pods -o=custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME NODE
affinity-assistant-6d8794b076-0 kind-multinode-worker1
pipeline-run-kcsr4-first-task-pod kind-multinode-worker1

The Tekton controller creates a pod named affinity-assistant-* on the same node as the taskRun pod.

This example is a simpler demonstration of a real world scenario. Many real world CI/CD pipelines run for hours and sometimes might run for days. A Kubernetes cluster generally hosts many such long running pipelineRuns. A cluster might have multiple nodes and there is an often a need to bring a node down. It’s a challenge for a cluster operator to find that perfect time to bring a node down without impacting any running pipelines. Now, with this example, if a worker1 needs some kind of attention and cordoned for maintenance, the running pipelineRun will not be able to schedule any subsequent tasks and will wait until times out to eventually fail.

kubectl cordon kind-multinode-worker1
node/kind-multinode-worker1 cordoned

kubectl get node
NAME STATUS ROLES AGE VERSION
kind-multinode-control-plane Ready control-plane 1d v1.26.3
kind-multinode-worker1 Ready,SchedulingDisabled <none> 1d v1.26.3
kind-multinode-worker2 Ready <none> 1d v1.26.3

kubectl get pods
NAME READY STATUS RESTARTS AGE
affinity-assistant-6d8794b076-0 1/1 Running 0 117s
pipeline-run-kcsr4-first-task-pod 0/1 Completed 0 117s
pipeline-run-kcsr4-last-task-pod 0/1 Pending 0 26s

Tekton Pipelines introduced a fix in v0.48.0 and v0.47.3 (LTS) which addresses a pod going in a pending state. Tekton Pipelines now implements an affinity assistant such that the controller checks the health of the node on which affinity assistant pod is created. If the node is cordoned for some reason, the Tekton controller deletes the affinity assistant pod such that StatefulSet can maintain the necessary replicas by creating a pod on any other available node.

kubectl get pods -l app.kubernetes.io/component=affinity-assistant -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
affinity-assistant-c7b485007a-0 1/1 Running 0 49s 10.244.1.144 kind-multinode-worker1 <none> <none>
affinity-assistant-c7b485007a-0 1/1 Terminating 0 70s 10.244.1.144 kind-multinode-worker1 <none> <none>
affinity-assistant-c7b485007a-0 1/1 Terminating 0 70s 10.244.1.144 kind-multinode-worker1 <none> <none>
affinity-assistant-c7b485007a-0 0/1 Terminating 0 70s 10.244.1.144 kind-multinode-worker1 <none> <none>
affinity-assistant-c7b485007a-0 0/1 Terminating 0 70s 10.244.1.144 kind-multinode-worker1 <none> <none>
affinity-assistant-c7b485007a-0 0/1 Terminating 0 70s 10.244.1.144 kind-multinode-worker1 <none> <none>
affinity-assistant-c7b485007a-0 0/1 Pending 0 0s <none> <none> <none> <none>
affinity-assistant-c7b485007a-0 0/1 Pending 0 1s <none> kind-multinode-worker2 <none> <none>
affinity-assistant-c7b485007a-0 0/1 ContainerCreating 0 1s <none> kind-multinode-worker2 <none> <none>
affinity-assistant-c7b485007a-0 0/1 ContainerCreating 0 2s <none> kind-multinode-worker2 <none> <none>
affinity-assistant-c7b485007a-0 1/1 Running 0 4s 10.244.2.144 kind-multinode-worker2 <none> <none>

This allows the cluster operator to cordon a node anytime without loosing any workloads.

The issue reported in Tekton Pipelines: https://github.com/tektoncd/pipeline/issues/6586

The PR which helped fix this issue: https://github.com/tektoncd/pipeline/pull/6596

Hope this was helpful!

A special shout out to Lee A Bernick for all the support, starting from authoring a bug report, prototyping a potential fix, and carefully reviewing the PR.

Thank you all the contributors for contributing and maintaining the Tekton project.

--

--

Priti Desai
Tekton Pipelines

Developer Lead @IBM. Tekton maintainer. Co-founder of License Scanner @Cyclonedx