10 Things You Should Know Before Writing a Kubernetes Controller

12 min readDec 6, 2021

The controllers in this article respond to APIs defined by Custom Resource Definitions (as opposed to Aggregated API servers). We will be using controllers built with controller-runtime and kubebuilder.

1. It’s All About State

Controllers track at least one resource and are responsible for reconciling the current state to the desired state described by the watched resource. The desired state is defined by the spec of the resource (see more conventions here).

Reconciliation is triggered by cluster events related to the resources watched by the controller. By design, the event is not passed to the reconciler so that it is forced to define and act on a state instead. This approach is referred to as level-based, as opposed to edge-based.

For example, a controller could reconcile state declared by a custom resource Mykind that looks as follows:

apiVersion: grp.example.com/v1alpha1
kind: Mykind
metadata:
  name: mykind-sample
spec:
  numpods: 1

The controller’s purpose would be to ensure that the desired number of pods numpods are running at any given moment.

Special cases may seem to require the controller to know what type of event triggered the reconciliation. Once the reconciler satisfies the initial request of creating a pod, if the reconciler gets triggered on a create event for pods then the reconciler would know to delete pods. Similarly if it was triggered on a delete event, the reconciler would know to create pods. Right?

This is not the Kubernetes way of thinking.

Our controller needs to look at the current number of running pods and compare it to the desired number specified in the CR, then reconcile that difference by adding or deleting pods. In this way, even if an event is missed, we would eventually come to the desired state by checking state on a regular interval (which controller-runtime does). And, if many events happen in a short time they can all be covered by a single reconciliation iteration.

What the event is should not matter when writing the reconciliation logic. You can actually notice that a Delete is being performed by looking at the state of the CR:

if !cr.ObjectMeta.DeletionTimestamp.IsZero() {
    return h.deletionReconciler(ctx, cr)
}

Here is a high-level reconciler template:

2. How to Think About The Actual State

Controllers respond to declarative APIs defined on the cluster. When writing code for Kubernetes we need to adopt a declarative mindset (as opposed to an imperative mindset). Instead of specifying how, specify what.

This impacts how state should be characterized. The actual state need not be captured simply by what is. It can also be captured by what has been declared by our controller as desired to other controllers. That is, what has been successfully scheduled or offloaded to another controller or component by the previous reconciliation.

For example, let’s look at how a restaurant operates. There are usually two actors (or controllers): the waiter and the cook. The waiter takes your order and gives it to the cook. When they have done that, they update the status of your order (ordered). Next time, when the waiter looks at the tables and sees that you have not yet been served, they also see that the food has been ordered. No need to retake your order.

In the meantime the cook is working hard and your food may eventually be ready. The waiter is also watching that. They then bring you the food and may also update the status of your order (served). Served orders are of no more interest to the waiter and they can be ignored in the future.

In our example of creating pods, by the time the next reconciliation starts, the pods created by the previous reconciliation may not have started and the pods deleted by the previous reconciliation may not have been removed. So if the controller asks how many pods there are, it may get an accurate representation of the current state but not one that is useful because that state is scheduled to change.

On one hand, this latency may be unexpected by the controller, in which case it might be best to reschedule pods for creation / deletion on the next reconciliation to ensure that the desired numpods are running.

On the other hand, this latency may be expected (because of the specifics of what each pod is doing) and scheduling additional pods for creation / deletion won’t help.

In either case, the actual state must capture both what is and what has been declared as desired to other controllers.

3. How to Use `Status`

Status should be used to at least reflect the outcome of the reconciliation: what was done, did it succeed etc.

It’s important that the controller take special care in updating the status. If not done properly, updates to the status will trigger update events which in turn trigger the reconciler in an infinite loop. This is the incorrect way:

cr.Status.LastUpdate = metav1.Now()
cr.Status.Reason = reason
cr.Status.Status = metav1.StatusSuccess// incorrect update
updateErr := r.Update(ctx, cr)if updateErr != nil {
    log.Info("Error when updating status. Let's try again")
    return ctrl.Result{RequeueAfter: time.Second * 3}, updateErr
}
return ctrl.Result{}, nil

The infinite loop is actually harder to run into than you would think because:

By default, controller-runtime filters out update event for which the object’s Generation has not changed.
The Generation is only changed (and update events are only triggered) if the object changes

So an infinite loop is possible if the new status is guaranteed to be distinct from the previous status. Still, the controller would likely be reconciling more often than it should.

To circumvent this, status should be modeled as a subresource. This will allow updating the status without incrementing the Generation.

First you must enable status subresource in the CRD by adding the comment //+kubebuilder:subresource:status above the type definition:

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
type Mykind struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`    Spec   MykindSpec   `json:"spec,omitempty"`
    Status MykindStatus `json:"status,omitempty"`
}

Then you can update the status as follows:

cr.Status.LastUpdate = metav1.Now()
cr.Status.Reason = reason
cr.Status.Status = metav1.StatusSuccess// correct way
updateErr := r.Status().Update(ctx, cr)if updateErr != nil {
    log.Info("Error when updating status. Let's try again")
    return ctrl.Result{RequeueAfter: time.Second * 3}, updateErr
}
return ctrl.Result{}, nil

Please see more status conventions here.

Note:

Once the status subresource is enabled, the controller will no longer be able to update the status using r.Update(ctx, cr).
You can still create an infinite loop by using a predicate that does not filter out update events where the Generation has not changed (more on predicates below).
Status should not be used to determine the current state.

4. Prepare for Updates to Fail

As you may have noticed above, when the update to status fails we decide to requeue the request at a fixed interval of time instead of erroring and utilizing exponential backoff (more on this later). This is because Kubernetes operates with optimistic concurrency and writes to resources are expected to fail.

Kubernetes resources have a resourceVersion field as part of their metadata which is used by clients to determine when objects have changed. When a record is about to be updated, its version is checked against a pre-saved value, and if it doesn’t match, the update fails. Unlike generation, resourceVersion is updated even if only the status was updated.

This means that when there are multiple actors attempting to write to the same resource, optimistic concurrency chooses one and rejects all others. It’s then up to the actors that were rejected (here our controller) to try again.

5. Use Predicates

We mentioned above that we can filter events that trigger the reconciler. In general, it’s a good idea to set up such filters to have better control over what triggers the reconciler and limit resource consumption.

Filtering should be refined throughout the development of the controller — be cautious not to prematurely filter out events.

In our example of creating pods, we may not want our reconciler to be triggered by update events involving pods, we may only care about updates to mykind. We can achieve this by creating a predicate:

func myPredicate() predicate.Predicate {
  return predicate.Funcs{
    CreateFunc: func(e event.CreateEvent) bool {
      return true
    },
    UpdateFunc: func(e event.UpdateEvent) bool {
      if _, ok := e.ObjectOld.(*core.Pod); !ok {
        // Is Not Pod
        return e.ObjectOld.GetGeneration() != e.ObjectNew.GetGeneration()
      }
      // Is Pod
      return false
    },
    DeleteFunc: func(e event.DeleteEvent) bool {
      return !e.DeleteStateUnknown
    },
  }
}

and passing it to the controller as follows:

func (r *MykindReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&grpv1alpha1.Mykind{}).
        Owns(&core.Pod{}).
        WithEventFilter(myPredicate()).
        Complete(r)
}

Note: when getting the object type in the predicate, you should not use e.Object.GetObjectKind().GroupVersionKind().Kind as this field may be empty (see #1735).

If you’re looking to filter out objects to limit memory utilization this is not the place to do that. This is where the cache comes in handy.

We can override the default cache to add a Label or Field to filter on as follows:

resyncInterval, _ := time.ParseDuration("5m")
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    NewCache: cache.BuilderWithOptions(cache.Options{
        Resync: &resyncInterval,
        SelectorsByObject: cache.SelectorsByObject{
            &core.Pod{}: {
                Label: labels.SelectorFromSet(labels.Set{
                    myLabelKey,
                    myLabelVal,
                }),
            },
            &grpv1alpha1.Mykind{}: {},
        },
    }),
})

The reconciler should then make sure to add this special label to the pods it creates. This is especially helpful if your controller is tracking resources that many other controllers on the cluster may also be tracking in order to limit memory utilization of the controller (more on this below).

6. Create Webhooks

There are two types of webhooks that can be enable: conversionWebhooks and admissionWebhooks. Note these processes should be served externally to the controller because these are intended to be Highly Available (whereas you may need to uninstall / reinstall the controller) as you will see below.

Conversion webhooks allow for managing multiple versions of the APIs your controller has defined.

New versions of an API are created in order to drastically change the interface specifying the desired state. Providing additional optional fields (leaving all other fields the same) is not considered a breaking change and can be done without changing the API version. However if a field of the API needs to be renamed, its type changed, completely removed, or a new required field needs to be added, then an API version change is required. More on API compatibility here.

For example we may decide to change numpods to replicas. In the v1alpha2 version of our mykind API you can specify:

apiVersion: grp.example.com/v1alpha2
kind: Mykind
metadata:
  name: mykind-sample
spec:
  replicas: 3

We can create a conversion webhook to convert from v1alpha1 to v1alpha2 and back:

// ConvertTo converts v1alpha2 to v1alpha1.
func (src *Mykind) ConvertTo(dstRaw conversion.Hub) error {
    dst := dstRaw.(*v1alpha1.Mykind)
    dst.Spec.NumPods = src.Spec.Replicas
    return nil
}// ConvertFrom converts v1alpha1 to v1alpha2
func (dst *Mykind) ConvertTo(srcRaw conversion.Hub) error {
    src:= srcRaw.(*v1alpha2.Mykind)
    src.Spec.Replicas = dst.Spec.NumPods
    return nil
}

Even if there are multiple versions of the API that the controller can respond to, only one version of the API can be stored in etcd. This is referred to as the stored version which can be specified by adding the following comment above the Go type for that version +kubebuilder:storageversion.

Note:

Avoid tightening schemas between API versions it can cause invalid versions to be stored in etcd.
In practice, the above example would take at least two versions to rollout properly so that users have at least one version of the API that supports both specifying numpods and replicas. See more here about API conventions.

There are two types of admission webhooks: mutating and validating.

Mutating webhooks allow for optional fields in the CR to be populated with default values. This can also be done to an extent within the CRD schema. This separation of concerns helps prevent over complicating the controller code.

For example we may decide to add a new optional field that allows users to provide a pod template:

// MykindSpec defines the desired state of Mykind
type MykindSpec struct {
    Replicas int                  `json:"replicas"`
    Template core.PodTemplateSpec `json:"template,omitempty"`
}

If users don’t specify a template the mutating webhook will default to a specific template. We can write the following:

var _ webhook.Defaulter = &Mykind{}

func (r *Mykind) Default() {
    if r.Spec.Template == "" {
        r.Spec.Template = defaultTemplate()
    }
}

Which could generate the following defaults:

apiVersion: grp.example.com/v1alpha2
kind: Mykind
metadata:
  name: mykind-sample
spec:
  replicas: 3
  template:
    metadata:
      labels:
        mylabel: mylabel
    spec:
      containers:
      - name: busybox
        image: busybox

Validating webhooks allow for semantic validation of the CR. This is in contrast to syntactic validation performed by the OpenAPI v3 Schema defined on the CRD.

Semantic validation is complex validation logic that can’t be done using syntactic validation. You should encode as much of the logic as possible in the syntactic validator. For example, while you can create a validating webhook to verify that the number of replicas is not negative:

var _ webhook.Validator = &Mykind{}

func (r *Mykind) ValidateCreate() error {
    if r.Spec.Replicas < 0 {
        return errors.New("spec.replicas: value must be positive")
    }
    return nil
}func (r *Mykind) ValidateUpdate(old runtime.Object) error {
    if r.Spec.Replicas < 0 {
        return errors.New("spec.replicas: value must be positive")
    }
    return nil
}func (r *Mykind) ValidateDelete() error {
    return nil
}

You can also do this in the structural schema:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: mykinds.grp.example.com
spec:
  group: grp.example.com
  versions:
    - name: v1alpha2
      served: true
      storage: true
      schema:
        type: object
          properties:
            spec:
              type: object
              properties:
                replicas:
                  type: integer
                  minimum: 0

7. Don’t Expect Webhooks

The controller can’t guarantee that admission webhooks will be enabled and or working unless the webhooks are delivered alongside the CRD and controller.

As a fallback mechanism, it’s good practice to create an initialization method for the reconciler that mutates and validates (in this order) the CR within the controller. If the resource is not valid we should report the error in the same way we would a controller error (see below).

The difference between this and webhooks is that by the time initialization is done by the controller, etcd has already stored the version of the CR. And once it is in etcd, the controller that owns it cannot do anything to reject it from etcd if the CR is not valid.

Note: It’s dangerous for the reconciler to touch the spec of the resource because spec should describe the desired state and the controller should not be modifying that desired specification. It’s also dangerous because of the infinite loop possibility.

8. Manage Errors Gracefully

Failures should be expected and should be handled gracefully by the controller. This is done by:

Reporting errors in the CR status instead of infinitely printing to the log
Communicating errors by generating events so these can be captured by kubectl describe

For example, you can create the following helper functions:

The reconciler should in many cases re-queue the event to try again. By default, controller-runtime requeues the event with exponential backoff (i.e. with exponential requeue time). So, even if the error seems unrecoverable, you can requeue the event and exponential backoff will ensure that you don’t keep re-trying too often. For example:

An image ref in the spec gives a 404 from the registry. That seems unrecoverable right? wrong… that image might show up in the future.
An image ref in the spec gives a 401 unauthorized. That won’t be resolved until the user sets credentials, right? wrong… that image might be made public in the future.

9. Leave No Trace

If a Mykind CR is deleted, all the pods that were created by the CR should also be deleted. In order to achieve this, owner references can be set on the pods created by the controller for a particular CR:

ctrl.SetControllerReference(crInstance, pod, r.Scheme)

This relies on Garbage Collection which is only possible for internal resources to the Kubernetes cluster. There are also quite a few restrictions for owner references relating to resource scope which you can find here.

If the CR creates external resources, the controller must make use of finalizers to specify clean up logic.

10. Be Aware of Utilization

Controllers should be cautious of resource scarcity and avoid requesting or consuming more resources than what is needed.

Define Resource Requests: Many clusters specify ResourceQuota configurations to prevent tenants from being able to consume all resources or block other workloads from being scheduled. As a result, controllers should define resource requests for the controller pod and any other pod managed by the controller.

Controller memory utilization: By default the in-memory cache will be set up with informers for all resources the controller tracks which can exhaust the controllers memory. By creating labels specific to the controller on the resources it creates (and referencing these labels in a label selector in the cache) you can substantially limit the cache size (this is different from owner references). One can also make use of paginated List() calls.

Kube-Apiserver memory usage: Depending on the resources tracked by the controller, it’s possible to exhaust the apiserver’s memory allocation. Pay special attention when watching resources like secrets and configmaps which can be numerous on large clusters. You may consider using immutable configmaps as a possible solution.

References

Acknowledgements

Thank you to Joe Lanford, Camila Macedo, Frederic Giloux, Alex Greene, and Varsha Prasad for their contributions.