Kubernetes, GitOps, and Odd Jobs

Jack
Mothership
Published in
5 min readApr 27, 2022

We love Kubernetes at Mothership, but one area we’ve found lacking is its ability to run one-off jobs. It’s not that Kubernetes can’t run one-off jobs — that’s why we have the Job resource after all — it’s that these one-off jobs don’t fit neatly into a GitOps deployment workflow.

The GitOps Problem

GitOps works incredibly well for slow-changing or persistent state, but starts to fall apart when talking about rapidly-changing or ephemeral state.

GitOps in a nutshell: updates in git trigger automation pipelines that ultimately update the runtime environment to match the source of truth.

I like to think of GitOps as storing your ideal system state in git. The system may not always be able to reach the desired state, but the intention is there.

Let’s say I’m maintaining my service’s replica count in git — if I push a commit that scales that service to three replicas, I generally want the system to maintain three replicas until I push a change to scale it back down. A lack of compute resources or a crashed server doesn’t mean I want any less than my three replicas.

But in this example, we consider replica count to be persistent state. We have no requirement to change it frequently. Now imagine we want to scale throughout the day according to the load on the service. If we continue to think of replica count as persistent state, this would require close monitoring and pushing commits as the load changes.

If we reframe replica count as ephemeral state, we also want to reframe our ideal system state to make it work well in a GitOps model. Kubernetes gives us the HorizontalPodAutoscaler which does just that. The configuration for our HPA becomes persistent state that we can manually manage, while replica count becomes ephemeral state managed by the controller.

One-off Jobs

Just like replica counts, one-off jobs fall into this category of ephemeral state. If I want to run a script to backfill a new database, I generally want it run once to completion and then never again. Once it completes, it no longer has a place in my ideal system state.

If I’m treating one-off jobs as persistent state, this means I have to go remove my job from git after it runs. If I forget to do this, this leads to two potential problems on the next commit: blocked deploys, or jobs running more than once. Nobody wants this kind of surprise when they’re trying to ship an unrelated/urgent bug-fix.

The possible flows for a Kubernetes Job running twice, or blocking future deployments.
Leaving a job checked in can block deployment pipelines, or cause it to run twice!

When we looked at how our developers were using jobs, we found most had the following in common:

  • The job was truly single-use, backfilling a database or vendor, repairing data from a bug, etc.
  • The job was a custom script, but made use of the service’s type definitions and business logic
  • The job needed the same configuration/secrets as the service, or at least a subset of it

In a way, most of this is already persistent state managed under GitOps — it’s the configuration and code that make up our Deployment in the first place. The ephemeral state is just the presence of the job in our cluster. In that case, we’d need some kind of controller to manage this state, creating jobs from deployments in the cluster on demand.

OddJob: The Mothership Job Service

That is, in a nutshell, what our job service does for us. OddJob is our solution to creating and running jobs in the cluster on demand. It can take an existing Deployment in our cluster, and build/deploy a usable Job manifest from it.

For the most part, this gives us the best of both worlds. Our engineers can write one-off tasks the same way they’d develop anything else. Their code will get reviewed, tested and deployed using the workflow they’re already familiar with — and there is no cleanup step or concern that their job will run twice. The difference is that we no longer use git to manage when a job is run, or the command run in the container.

Since it would be dangerous to let anybody create a job from any deployment, we let teams manage permissions through annotations on the resource. Service owners can opt-in to OddJob by adding annotations specifying which teams or users can create jobs for a their deployment. If no annotations are present on a given deployment, OddJob will ignore it and refuse to let anyone create a job from it.

An allowed-teams / allowed-users annotation on the deployment dictates who can create jobs based on the deployment
These annotations on a deployment would allow our carrier and marketplace teams to create jobs from it.

OddJob logs every command that gets run and the user who ran it. While we trust our team and have a way to audit what has been run, it’s still more access to the cluster than we feel comfortable with in the long-run. A more ideal system would have a way to audit commands before they are run, rather than after. We have plans to revisit this soon, but decided we can live with this tradeoff while proving out the developer experience.

So what does this look like in action?

Terminal recording showing our job service in action.

What’s next for OddJob?

There are a few features we’d still like to add, but we’re either waiting on upgrades to our clusters, or more bandwidth. A few ideas that we’ve thrown around:

  • Creating our own ‘Job’ CRD that lets us save a job template, including the job’s command. This would let the team review and pre-approve the commands being run in the containers. It would also let us cover jobs that need different settings (ex: more resources) than the deployment they’re based on.
  • Once we upgrade our cluster, we’ll have the suspended job feature. Using this, we could create a suspended job, then introduce a Slack approval workflow which would also let us audit commands before they’re run, as well as when they are run.
  • Maybe a stretch, but indexing the file paths in our container images. This would allow us to auto-complete paths to scripts or files, making it even easier for engineers to find the right path to their script in the container.
  • Open sourcing — this will come once we’ve tightened up our security model, cleaned up the code, and written ample instructions for deploying and running. But we’ve already seen a lot of value and think others might too!

Come help us build it!

--

--