Migrating our cron jobs to Kubernetes

Published in

Kudos Engineering

8 min readJul 28, 2022

You may have seen in previous blog posts that back in 2020 the Engineering team at Kudos completed an epic called ‘cattle not pets’, which involved migrating our Ruby application away from our Amazon Web Services Elastic Cloud Compute (EC2) ‘pet’ servers to our ‘cattle’-like Google Kubernetes Engine (GKE) cluster. The end result was hugely successful, but it still left us with an EC2 instance of the Ruby application, whose sole purpose was to run all our scheduled tasks. Fast forward to the dawn of 2022 and the inception of our current engineering roadmap, where this frustrating quirk was the most popular item on the agenda!

Why now?

Although we knew this migration needed to happen, all tasks appeared to be running and it appeared to be business as usual, which meant it never really featured in our list of priorities. That is until issues began in the second half of 2021, which highlighted the flaws of this approach and what the main benefits migrating to our Kubernetes cluster would offer:

Cattle not pets — You may have seen in previous posts that we have a drive to move away from pet servers. Our EC2 instance for scheduled tasks is a prime example of a pet, which consumes our time which could be better spent.
Better monitoring — The current observability of scheduled tasks within our EC2 instance does not meet our standards and makes it challenging to identify problems early.
Improved performance — Running tasks within Kubernetes allows us to easily run each task in isolation, with its own resource allocation. Our current setup provides completely the opposite. As a result, jobs have a negative impact on the performance and completion of one another.

The plan

Having identified a problem we all felt so passionately about, the next question was: how should we fix it? These scheduled tasks are crucial to ensure the information we serve is correct, meaning we couldn’t afford to decommission a task on our EC2 instance without knowing that it’s running successfully in our Kubernetes cluster. In stark contrast, we also couldn’t afford to have the same job running multiple times, at the same time, from different locations.

The tasks themselves are Ruby Rake tasks, and need to be run in the latest version of our monolithic app, which is stored as an image in Google Container Registry. After looking at our scheduling and job run times we discovered that the majority complete in less than 12 hours, always starting before 06:00, on an EC2 instance which is very stretched for resource. So it was safe to assume they would always take less time in Kubernetes. A CronJob, which runs automatically on a schedule, felt like a big first step for any migration. So we decided to run each task as a Job, which must be triggered manually, 12 hours after the start of the scheduled task on the EC2 instance, making successful completion the requirement to allow conversion to a CronJob.

Overview of the pod and containers used by the Ruby application.

Dependencies

On the surface the initial run appeared to work, as the logs confirmed that the process had finished. However, some of the containers within the pod still existed, and as a result the job was not marked as complete.

Why? Dependencies. The two remaining containers were sidecars to the main container running the Rake task, and the current implementation gave no way to notify them of job completion that they could also gracefully terminate. This meant we needed to be able to control pod container dependencies. Enter Kubexit, a command supervisor for coordinated pod container termination. This allowed us to create death dependencies, meaning on exit of the main container we could trigger the exit of the two sidecar containers. Once all the containers had exited error free, the job would be able to mark itself as complete allowing for the removal of all the containers and the pod which was encompassing them.

env: [
    {
        name: 'KUBEXIT_NAME',
        value: 'logging-agent',
    },
    {
        name: 'KUBEXIT_DEATH_DEPS',
        value: 'ruby-application',
    },
],

Declaring a death dependency in the environment variables for the FluentD logging agent container.

But as well as death dependencies, Kubexit also offered us another benefit. We could now control birth dependencies too, to ensure certain containers didn’t attempt to start until one or more other containers within the same pod were up and running. In our case we ideally wanted the reverse of the death dependencies, as it’s pointless writing to the logs or attempting to communicate with our SQL database if the relevant sidecar containers aren’t ready for traffic.

env: [
    {
        name: 'KUBEXIT_NAME',
        value: 'ruby-application',
    },
    {
        name: 'KUBEXIT_BIRTH_DEPS',
        value: 'cloudsql-proxy',
    },
],

Declaring a birth dependency in the environment variables for the main Ruby application container.

After implementing birth and death dependencies via Kubexit, we were in a position to try again, but the job still didn’t complete. Kubexit needs to be able to get and list the containers within a pod. This is a permission the default GKE service account, which is used to run jobs unless otherwise specified, doesn’t have. Adding the extra permissions to the default service account was quickly ruled out, and it was decided that a new service account would be created with the correct permissions. This meant we’d also need to declare a new role and then bind the service account to the role for the job. After more in depth discussions, we decided that the best solution would be:

A new role — To be used by all jobs to allow Kubexit to get and list the containers within a pod. As there is only one, this was added to our infrastructure repository for future use.
A new service account per job — To allow us to be able to change the permissions given on a job-by-job basis.
A role binding — To bind the permissions allocated in the role to the service account used by the job.

This solution offered the best balance of minimal code and future-proofing our implementation.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:  
    name: pod-reader  
    namespace: production
rules:
- apiGroups:  
    - ""  
    resources:  
    - pods  
    verbs:  
    - get  
    - list  
    - watch

New role added to our cluster to allow Kubexit to function correctly.

Once the service account, role and role binding had been created, we tried the job again. The Rake task completed, all pods terminated gracefully, thanks to the services of Kubexit, and the job was finally able to mark itself as complete.

Templating

Following the success of the first job, we converted the task from a Job to a CronJob within Kubernetes, the process of which was fairly simple. The next step was to review the process and begin to think about what would be required to repeat the task for the remaining jobs. Whilst carrying out this process we realised that a large percentage of the YAML file containing the Kubernetes config for a job would be the same, with minor modifications for schedule, job name, and so on. This naturally led to thoughts of some form of cron job template, to help streamline the process and keep the codebase DRY.

Our first port of call was Kustomize, as it has been supported by Kubectl since Kubernetes version 1.14. We quickly discounted it, as it didn’t seem to offer the template based approach we were looking for, instead suiting better to other areas, such as deployments in different environments. We then looked into YQ but also chose not to pursue that. YQ describes itself as a YAML command line processor, based on JQ for JSON, which on the surface looked like it could assist with some of what we were trying to achieve. However, after a deeper dive, we again realised that YQ wasn’t designed with the template-based approach we had in mind.

After more searching we were fortunate enough to come across this 2 part blog post which documents using Jsonnet with Kubernetes. Jsonnet describes itself as a data templating language for apps and developers and a simple extension of JSON. This sounded promising as it billed itself as a templating language, but designed for JSON. With YAML being a superset of JSON, and also due to Jsonnet being able to produce output in YAML, it wasn’t an issue. Even better, the Kubernetes community had already used Jsonnet and had also produced a small library of resources along the way.

After reading the docs and completing the tutorials, we were able to see that Jsonnet would allow us to define all the Kubernetes Kinds we require for each cron job as their own Jsonnet template. These templates could accept variables, via function input parameters, and could therefore be passed a config object containing the information which was job specific, such as schedule and job name. Exactly the approach we had in mind! In the end we decided not to make use of any additional tooling listed in the community resources and stuck with a standard Jsonnet implementation, as out the box it supported our needs and the Jsonnet files retained a very similar look and feel to standard Kubernetes YAML configuration files.

Final template structure used for our cron jobs.

After successfully implementing the structure above we were able to create a Jsonnet file for the cron job we’d already migrated. The new file was only 8 lines long, down from 262 and the new structure was clean and easy to understand.

local cronTemplate = import 'template/cron-job-template.libsonnet';cronTemplate({
  cronName: 'kudos-track-clicks',
  schedule: '02 00 * * *',
  rakeTask: 'articles:kudos_track_clicks',
  serviceAccountSecretName: 'kudos-track-clicks-token-rvbp7'
});

New Jsonnet file for the job already migrated to the cluster

Building in the cloud

With a templating solution now in place, the next question was: when and where should the Jsonnet files be run through the interpreter? One option was to do it locally, and commit the YAML output to our repository. This seemed like a good idea, to prevent running the process every build in our CI/CD pipeline, but the image of the main container changes every build due to us using the latest commit hash in the image tag name. This meant this approach would still require some form of substitution at build time. To prevent splitting the process it became apparent the best way to handle this was via a separate build, to prevent increasing the time of the main build, with the main build triggering the new one.

The new build process contains the following steps:

Generate a cloud build image containing an installation of Jsonnet
Run all Jsonnet files through the Jsonnet interpreter, saving the YAML output
Apply the saved YAML to the Kubernetes cluster.

With the additional build process in place and fully functioning, the first migration was complete. We now have solid foundations to use for the remaining jobs. The benefits of migrating the job to Kubernetes were visible in the first week. The job completed in record time due to better resourcing and on the one occurrence it failed to run, due to a third party issue, we were immediately alerted, allowing us to reschedule it for later that day. All that’s left now is to pick another job from the ones remaining to migrate, and start the (now simple) process again.