One-off jobs on Kubernetes Nodes

Published in

NN Tech

4 min readJul 15, 2021

In this article, I will walk you through how we ran one-off jobs on Kubernetes nodes as soon as they join the cluster.

Disclaimer: This is going in for a mutating approach which is against a lot of the principles Kubernetes stands for and you should do this only if you have no other option. We had no other option since our cloud provider did not allow custom OS images at the time of writing this.

Problem Statement

To drop CA certificates into the trust store of the container runtime of the node so that images can be pulled from registries running with certificates signed from a CA that is not well known.

Future Support

The solution we wanted to pick had to support both docker and containerd as container runtimes. This was because we were just a couple of weeks away from a Kubernetes version upgrade which also meant a change in container runtime from docker to containerd.

Docker had a simple task to be executed, mount the /etc/docker/certs.d folder in a container and create a folder mapping to the name of each registry domain, under which you copy the public CA certificate.

For containerd, the tasks were a little tricky, since containerd did not have a way to “hot reload” its trust store neither through its config.toml nor through the OS trust store. So here we needed a privileged container to restart containerd post copying of the certificates.

Solution

Scheduling

Given the goal was to run a task once and then terminate, Kubernetes jobs felt like the right candidate. The challenge though was that the task had to be run once on every node as soon as it joins the cluster. We needed a sort of node controller here which would create jobs and control the scheduling bypassing nodeName/nodeSelector in the job spec. Addition of the controller seemed an overkill and recovering from failures also seems quite challenging in terms of retries, reporting, etc.

Due to the execution on every node aspect, a DaemonSet felt like a better fit but that meant we had to wrap it up with an infinite sleep else the control plane will launch a new pod once the DaemonSet pod reaches the Completed state. Also, after one execution, the pod was no longer needed and running a privileged container when not needed is bad from a security perspective. To tackle this, we took a 3-step approach.

Add an additional label to the node configurations in the cloud templates for all nodes.
Set that label as the nodeSelector for the DaemonSet.
After the execution of the certificate deployment, have the pod unlabel its node. The node’s name couple is passed to the node using the Downward API or Environment Variables.

This did need an extra ClusterRole and ClusterRoleBindingbut allowed us to piggyback on the internal Kubernetes implementations for scheduling and for clean termination of the pod once execution was complete. Additionally, retries in case of intermittent failures became trivial by just exiting with a non-zero exit code and the pod would be restarted without any extra logic.

Sequencing

Of course, we had to pull this image from a registry with a well-known CA certificate else we would be in a chicken-and-egg situation.

This task had to be run very early in the lifecycle of the node since the image pull of all other pods depended on the execution of this container. We used a system-node-critical priority class to ensure that these pods are deployed first. In general, always have a priority class defined for all DaemonSet pods higher than any other workloads so that you can ensure that all DaemonSet pods are always scheduled on necessary nodes.

Limitations

Couple of limitations observed in the deploying of certificates in this approach were with nodes having containerd as the container runtime. When pods were scheduled while containerd was still starting we observed pods going into ImagePullBackoff or ImageInspectError statuses since the kubelet could not connect to containerd sock. Once containerd restarted, the execution would continue seamlessly. This was something we decided to live with since we were eventually consistent and the only solution that came to mind was to stop the kubelet before containerd restart and start it after, but we knew that failure in one of the intermediate steps would result in node failure and recovery would mean a new node scale up and/or manual intervention.

Conclusion

I repeat, the above approach was purely to tackle the limitation of the cloud provider not allowing custom OS images and as much as possible these configurations should be baked in, but when it is not possible, this hack worked well for us.