Managing Long-Running Tasks with Celery and Kubernetes (or, Keeping Your Sanity During Deploys)

Lee Wang
Merge
5 min readJun 23, 2022

--

You have critical processes running in asynchronous tasks which are blocking important activities like customer onboarding or data synchronization, but restart said tasks on deploys, but don’t want to have to choose between confident deploys and responsive application behavior

The Problem

At Merge API, our backend services run AWS EKS and use the celery framework for scheduling asynchronous work. Many of our tasks, especially ones that respond to the initial creation of a customer’s account, can take hours to run. The reasons for this are many, and vary per API that we integrate with, but as a unified API platform, there are certain things out of our control such as rate limits on the APIs we call on our customers' behalf.

One issue that has become increasingly important to us as our customer base has grown is the fact that we have to restart these long-running tasks on each deployment of a Kubernetes cluster. You can imagine how choosing between deploying our code and restarting tasks that could run for an hour or more was like the two-buttons meme.

The feeling of the high-wire act of balancing different deployments with Kubernetes

This was ultimately due to a combination of 3 factors, all linked to how we restart our servers:

  • The requirements of the Kubernetes shutdown process. This involves: sending SIGTERM to the container, waiting a configured amount of seconds (terminationGracePeriodSeconds) hoping that the container stops itself, then sending SIGKILL to the container, forcing the end of all running processes.
  • Celery SIGTERM behavior is that the worker will stop trying to pull in new tasks, while continuing to work on tasks it has already pulled from the Celery broker (global, persistent task queue). This is known as a “warm shutdown” in celery lingo.
  • The default terminationGracePeriodSeconds configuration is 30 seconds, which while plenty for most tasks, is sometimes barely enough to run even a single API request depending on the Accounting / ATS / HRIS / Ticketing / CRM platform we are unifying

If the way Kubernetes restarts containers is troubling to you, but you’re using a different framework or language, do not despair! Plenty of companies have experienced and solved similar issues, such as our friends over at Brex. Even though our root problems and frameworks differed: Brex wanted to keep their dependencies in sync with Elixir/Kafka, we want to keep long-running tasks moving with Python/Celery, and the methods to solve these issues were the same.

I highly recommend reading Thomas’ article on how they solved their issues in conjunction with this article.

The Fix: Rolling Updates

Let’s start with what didn’t work for us. A naive approach involves just increasing the terminationGracePeriodSeconds setting to several minutes. While this would work from a Celery and Kubernetes perspective, it would not play well with our continuous integration scripts which waited along with Kubernetes for that duration:

Successful removal is awaited before any Pod of the new revision is created.

Rather than have our CI controlling the entire process and waiting alongside Kubernetes, we’d rather have Kubernetes be intelligent about its own restarts and let the CI scripts hand off the responsibility to the cluster.

Thankfully, such a capability exists in the RollingUpdate deployment type. Prior to our current configuration, we were using the Recreate deployment type which had the undesired behavior above. However, the magic of RollingUpdate crucially lets you configure:

maxSurge: the percentage above your normal pod count that you can have during the deployment. So if you normally have 10 pods, and configure a maxSurge of 20%, you will have up to 12 pods during the deploy.

maxUnavailable: the percentage of your normal pod count that can be anything OTHER than READY state during the deploy. So for 10 pods, and a maxUnavailable of 20%, at least 8 will be READY because the setting stipulates that at most 2 may be non-READY. The double negative may be confusing, but you can think of it as ~minAvailable if your name is De Morgan or something.

These additional properties are enough to let the Kubernetes cluster handle the rollout on its own, which means no more sweating between two equally important goals! With our rolling updates, we can have confidence that long-running tasks have enough time to wrap themselves up as well as ensure that we can deploy the latest versions of our services to the necessary clusters.

Complications

Astute readers and thorough link clickers will note that we came to a different conclusion than Brex did, in terms of our settings. For them, the key to the whole process was the maxUnavailable setting that allowed them to guarantee all pods were running and alive, rather than being stuck in a loop of mismatched deployments. It’s worth pointing out that their Elixir libraries do not behave in the same manner as our Celery tasks with a warm shutdown thankfully matching Kubernetes' expected behavior on SIGTERM, a fortuitous alignment for Merge.

As Thomas says in Brex’s article, “Graceful shutdown is hard to implement right, as all the pieces of the infrastructure and the software must be properly configured and implemented in harmony.”

Graceful shutdown is hard to implement right, as all the pieces of the infrastructure and the software must be properly configured and implemented in harmony.

Couldn’t have said it any better. Time will tell if we run into cross-dependency issues for other reasons though! For example, we only do backward-compatible database changes but if there ever was a database migration that was not backward compatible then we’d be in trouble with our non-terminating long-running tasks.

On Using Helm (an added bonus)

Merge also uses helm for managing our Kubernetes configurations. While great to work with, I found it strangely difficult to find the documentation for where RollingUpdate goes in a chart.

So, here is where both settings are in our chart:

spec:
replicas: ###
strategy:
type: "RollingUpdate" (formerly Recreate)
rollingUpdate:
maxSurge: "## %"
maxUnavailable: "## %"
selector:
matchLabels:
template:
metadata:
...
spec:
...
terminationGracePeriodSeconds: ###
...

--

--