Resilient Affinity Assistant

Published in

Tekton Pipelines

4 min readJun 27, 2023

How to be resilient? Source: covunicareers

This article explains one of the fixes to the Tekton Pipelines which allows cluster maintenance without loosing pipelineRuns.

What is an affinity assistant in Tekton Pipelines?
Affinity assistant in Tekton Pipelines is created one per pipelineRun to constrain running the taskRun pods on a single node. This constraint is necessary for the tasks which are sharing the same workspace. Affinity assistant is a Kubernetes StatefulSet object.

If you are relying on the Tekton Pipelines workspaces, please follow along to understand how a cluster operator can cordon a node without loosing the running pipelineRun.

Here is an example pipelineRun to work with to understand the issue with affinity assistant during a cluster maintenance. This example pipelineRun has two tasks (first-task and last-task) and a shared workspace (source).

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  generateName: pipeline-run-
spec:
  workspaces:
  - name: source
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Mi
  pipelineSpec:
    workspaces:
    - name: source
    tasks:
    - name: first-task
      taskSpec:
        workspaces:
        - name: source
        steps:
        - image: alpine
          script: |
            echo $(workspaces.source.path)
            sleep 60
      workspaces:
      - name: source
    - name: last-task
      taskSpec:
        workspaces:
        - name: source
        steps:
        - image: alpine
          script: |
            echo $(workspaces.source.path)
            sleep 60
      runAfter: ["first-task"]
      workspaces:
      - name: source

I have a kind cluster with three nodes:

kubectl get node
NAME                           STATUS       ROLES           AGE   VERSION
kind-multinode-control-plane   Ready        control-plane   1d    v1.26.3
kind-multinode-worker1         Ready        <none>          1d    v1.26.3
kind-multinode-worker2         Ready        <none>          1d    v1.26.3

Now, let’s create the pipelineRun and get the list of pods:

kubectl get pods -o=custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME                                NODE
affinity-assistant-6d8794b076-0     kind-multinode-worker1 
pipeline-run-kcsr4-first-task-pod   kind-multinode-worker1

The Tekton controller creates a pod named affinity-assistant-* on the same node as the taskRun pod.

This example is a simpler demonstration of a real world scenario. Many real world CI/CD pipelines run for hours and sometimes might run for days. A Kubernetes cluster generally hosts many such long running pipelineRuns. A cluster might have multiple nodes and there is an often a need to bring a node down. It’s a challenge for a cluster operator to find that perfect time to bring a node down without impacting any running pipelines. Now, with this example, if a worker1 needs some kind of attention and cordoned for maintenance, the running pipelineRun will not be able to schedule any subsequent tasks and will wait until times out to eventually fail.

kubectl cordon kind-multinode-worker1
node/kind-multinode-worker1 cordoned

kubectl get node
NAME                           STATUS                     ROLES           AGE   VERSION
kind-multinode-control-plane   Ready                      control-plane   1d    v1.26.3
kind-multinode-worker1         Ready,SchedulingDisabled   <none>          1d    v1.26.3
kind-multinode-worker2         Ready                      <none>          1d    v1.26.3

kubectl get pods
NAME                               READY   STATUS      RESTARTS   AGE
affinity-assistant-6d8794b076-0     1/1     Running     0          117s
pipeline-run-kcsr4-first-task-pod   0/1     Completed   0          117s
pipeline-run-kcsr4-last-task-pod    0/1     Pending     0          26s

Tekton Pipelines introduced a fix in v0.48.0 and v0.47.3 (LTS) which addresses a pod going in a pending state. Tekton Pipelines now implements an affinity assistant such that the controller checks the health of the node on which affinity assistant pod is created. If the node is cordoned for some reason, the Tekton controller deletes the affinity assistant pod such that StatefulSet can maintain the necessary replicas by creating a pod on any other available node.

kubectl get pods -l app.kubernetes.io/component=affinity-assistant -o wide -w
NAME                              READY   STATUS              RESTARTS   AGE   IP              NODE            NOMINATED NODE   READINESS GATES
affinity-assistant-c7b485007a-0   1/1     Running             0          49s   10.244.1.144    kind-multinode-worker1   <none>           <none>
affinity-assistant-c7b485007a-0   1/1     Terminating         0          70s   10.244.1.144    kind-multinode-worker1   <none>           <none>
affinity-assistant-c7b485007a-0   1/1     Terminating         0          70s   10.244.1.144    kind-multinode-worker1   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     Terminating         0          70s   10.244.1.144    kind-multinode-worker1   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     Terminating         0          70s   10.244.1.144    kind-multinode-worker1   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     Terminating         0          70s   10.244.1.144    kind-multinode-worker1   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     Pending             0          0s    <none>          <none>                   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     Pending             0          1s    <none>          kind-multinode-worker2   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     ContainerCreating   0          1s    <none>          kind-multinode-worker2   <none>           <none>
affinity-assistant-c7b485007a-0   0/1     ContainerCreating   0          2s    <none>          kind-multinode-worker2   <none>           <none>
affinity-assistant-c7b485007a-0   1/1     Running             0          4s    10.244.2.144    kind-multinode-worker2   <none>           <none>

This allows the cluster operator to cordon a node anytime without loosing any workloads.

The issue reported in Tekton Pipelines: https://github.com/tektoncd/pipeline/issues/6586

The PR which helped fix this issue: https://github.com/tektoncd/pipeline/pull/6596

Hope this was helpful!

A special shout out to Lee A Bernick for all the support, starting from authoring a bug report, prototyping a potential fix, and carefully reviewing the PR.

Thank you all the contributors for contributing and maintaining the Tekton project.

Resilient Affinity Assistant

Written by Priti Desai