Health checks for Celery in kubernetes

Providing a solid and functional way to build readiness and liveness probes for k8s

Ronny Vedrilla
ambient-digital
6 min readNov 21, 2022

--

Once upon a time, we decided to migrate our Hetzner root server architecture in the cloud — taking a best guess with AWS and kubernetes (k8s). The upcoming migration of all our hosted projects brought along the regular pain you experience when migrating from a bare-metal server with what feels like infinite resources to a cloud-based pod which barely surpasses a laptop from 2010.

Nevertheless, we stayed on the ball and after a couple of weeks, everything was up-and-running smoothly. Everything? Well, except for the health checks of Pythons “Celery” library. We heavily rely on it for async processing and things like talking to external APIs.

Photo by GKVP on Unsplash

Health checks in k8s

You might be wondering now about how health checks in k8s work. Well, it’s quite easy. When a pod gets fired up, the cluster needs to know when it’s up and running. It uses the “readiness probe” to determine this. Once it’s running, k8s will test from time to time if the pod is still alive — using the “liveness probe”. Straight forward, isn’t it?

The challenge

So why is it more complicated to set up a health check for Celery services than for our back- and frontends? Web-based applications have to be accessible via a certain port. If the application is responding to this, it’s more or less certain ready/alive. Imagine, you browse to any URL — if you see the website, you are good to go and interact with it. If it’s not loading, well, in that case we have a problem.

Celery services — explicitly beat and worker — are not accessible via a certain port. They are constantly running and either waiting for tasks provided by the broker or listen to the clock and schedule new tasks. Therefore, we need a different approach.

Photo by Danka & Peter on Unsplash

RTFM?

When googling around for Celery health checks for k8s, you find a number of GitHub posts and endless threads which are hard to grasp if new to the topic and/or not being an absolute Celery pro.

The Celery docs for monitoring and management commands unfortunately don’t state anything related to this topic. If you keep at it, you eventually find a command which sounds promising for your worker:

celery -A apps.config.celery_settings inspect ping

Perfect, we are good to go, right?

Long-running tasks

The command from the documentation works fine. If your worker is alive, the cluster will detect it. But — and that’s a big bummer — the command will NOT respond when the worker is busy. For example, when the worker is processing a long-running task.

You might argue at this point, that long-running tasks are to be avoided anyway and therefore shouldn’t pose a problem.

Well, in the real world, often tasks which used to be fast get slower over time when the number of records increases, they must deal with. It occurs quite often that the fact that your task takes longer won’t be detected. In addition, even if you detect it, you might not have the time and/or budget to fix the issue at hand.

So, as a conclusion, we must find a different approach to check if our workers are still alive.

Photo by Towfiqu barbhuiya on Unsplash

File-based probing

I have to admit, we struggled for some time finding this solution but finally we discovered this GitHub thread providing a completely different approach utilising Celery tools I never had heard of: Signals and bootstraps.

The general idea is to create a file when the worker is up and running and touch a second file in a short interval to prove that it’s still alive.

Celery provides signals for worker start-up and shutdown which we can use to realise the readiness check. For the liveness probe, we register a timer-based bootstrap class.

The probing itself is realised via simple Python scripts. It comes in VERY handy to put in some “print” statements so you can check manually without any hassle if your scripts are successful or not.

The solution

Worker readiness setup

Registering worker signals: Put the constant on top and the following signals at the bottom of your celery configuration file.

# backend/apps/config/celery_settings.pyfrom pathlib import Path
from celery.signals import beat_init, worker_ready, worker_shutdown
# File for validating worker readiness
READINESS_FILE = Path('/tmp/celery_ready')
....@worker_ready.connect
def worker_ready(**_):
READINESS_FILE.touch()
@worker_shutdown.connect
def worker_shutdown(**_):
READINESS_FILE.unlink(missing_ok=True)

Now create a script for the cluster.

# backend/scripts/celery_readiness.pyimport sys
from pathlib import Path
READINESS_FILE = Path('/tmp/celery_ready')if not READINESS_FILE.is_file():
sys.exit(1)
sys.exit(0)

Worker liveness setup

Create a class utilising Celery bootstraps: Create a new file called celery/bootstraps.py somewhere in your codebase and add the following contents.

# backend/apps/config/celery/bootstraps.py
from pathlib import Path
from celery import bootsteps# File for validating worker liveness
HEARTBEAT_FILE = Path('/tmp/celery_worker_heartbeat')
class LivenessProbe(bootsteps.StartStopStep):
requires = {'celery.worker.components:Timer'}
def __init__(self, parent, **kwargs):
super().__init__(parent, **kwargs)
self.requests = []
self.tref = None
def start(self, worker):
self.tref = worker.timer.call_repeatedly(
1.0, self.update_heartbeat_file, (worker,), priority=10,
)
def stop(self, worker):
HEARTBEAT_FILE.unlink(missing_ok=True)
def update_heartbeat_file(self, worker):
HEARTBEAT_FILE.touch()

Register the probing class

Add this line after initialisation of your celery instance.

# backend/apps/config/celery_settings.py...# Setup celery
celery_app = Celery('my-project', broker=CELERY_BROKER_URL)
# Add liveness probe for k8s
celery_app.steps["worker"].add(LivenessProbe)
...

Now create a script for the actual probing.

# backend/scripts/celery_worker_liveness.pyimport sys
import time
from pathlib import Path
LIVENESS_FILE = Path('/tmp/celery_worker_heartbeat')if not LIVENESS_FILE.is_file():
print("Celery liveness file NOT found.")
sys.exit(1)
stats = LIVENESS_FILE.stat()
heartbeat_timestamp = stats.st_mtime
current_timestamp = time.time()
time_diff = current_timestamp - heartbeat_timestamp
if time_diff > 60:
print("Celery Worker liveness file timestamp DOES NOT matches the given constraint.")
sys.exit(1)
print("Celery Worker liveness file found and timestamp matches the given constraint.")
sys.exit(0)

Commands for kubernetes

Now you can use the following commands in your k8s yaml files to set up the health checks.

Worker Readiness probe

bash -c "python scripts/celery_readiness.py"

Worker Liveness probe

bash -c "python scripts/celery_worker_liveness.py"

What about the beat?

You might also use celery-beat to schedule tasks on top of the workers. The general idea is similar, and the readiness check can be easily set up via a Celery signal. For the liveness check, we have to set and probe a specific PID file.

Readiness probing

You can register a signal for the beat initialisation like this. You’d do this in the same file where the worker signals live.

from celery.signals import beat_init@beat_init.connect
def beat_ready(**_):
READINESS_FILE.touch()

Liveness probing

We have to extend the Celery beat command with the “pidfile” option.

> celery -A - celery_settings beat -l info --pidfile=/tmp/celery-beat.pid

Celery will then create a file which contains the process ID of the beat and will be touched every few seconds. The following script will then validate the timestamp of this file to ensure that the process is still running.

# backend/scripts/celery_beat_liveness.pyimport sys
from pathlib import Path
PID_FILE = Path('/tmp/celery-beat.pid')if not PID_FILE.is_file():
print("Celery beat PID file NOT found.")
sys.exit(1)
print("Celery beat PID file found.")
sys.exit(0)

Register the probes like the worker and you are good to go.

Conclusion

Using Celery in a cloud setup is quite common and works pretty well. I find it curious that it’s so hard to get information about how to set up proper health checks. Especially the worker happens to hang itself up there and then and if you don’t have any protection against that, all the stuff that you think is happening, is actually not. You don’t want to be in this position, believe me.

I hope this article will help you quick start your setup. Feel free to contact me for suggestions or comments.

--

--

Ronny Vedrilla
ambient-digital

Tech Evangelist and Senior Developer at Ambient in Cologne, Germany.