Three quick tips from two years with Celery

A few small tweaks might help you sleep through the night while running Celery task queues in production


Hi, I’m Taylor Hughes. I’m building LaunchKit, a set of tools to help mobile makers launch apps. I also built Cluster, an app for small groups. Once upon a time, I was the web frontend lead on the YouTube homepage.


LaunchKit and Cluster use Celery extensively (using Redis as a broker) to handle all sorts of out-of-band background tasks. We have sent millions of push notifications, generated and delivered an insane amount of email, backed up millions of photos and much more all using Celery tasks over the past few years.

As a result, I have been woken up in the middle of the night many times by various Celery-related issues — and there are a few essential configuration tips I’ve taken away from the experience.

Here are a few tips for sleeping through the night while running a Celery task queue in production:

1. Set a global task timeout

By default, tasks don’t time out. If a network connection inside your task hangs indefinitely, your queue will eventually back up and something about your service will mysteriously stop working.

So you should set some large global default timeout for tasks, and probably some more specific short timeouts on various tasks as well.

In a Django project, you can set a global timeout by adding this line to settings.py:

# Add a one-minute timeout to all Celery tasks.
CELERYD_TASK_SOFT_TIME_LIMIT = 60

… which you can override in specific tasks:

@celery_app.task(soft_time_limit=5)
def send_push_notification(device_token, message, data=None):
notification_json = build_notification_json(message, data=data)
...

This will prevent unexpectedly never-ending tasks from clogging your queues.

Read more: Soft time limits in Celery

2. Use -Ofair for your preforking workers

By default, preforking Celery workers distribute tasks to their worker processes as soon as they are received, regardless of whether the process is currently busy with other tasks.

If you have a set of tasks that take varying amounts of time to complete — either deliberately or due to unpredictable network conditions, etc. — this will cause unexpected delays in total execution time for tasks in the queue.

To demonstrate, here’s an example: Let’s say you have 20 tasks, each of which calls some remote API, and each takes 1 second to finish.

You set up 4 workers to run through these 20 tasks:

celery worker -A ... -Q random-tasks --concurrency=4

This will take about 5 seconds to finish. 4 subprocesses, 5 tasks each.

But, if instead of 1 second, the first task (task 1 of 20) takes 10 seconds to complete, the total amount of time this queue will take to execute? It’s not 10 seconds — it’s 14 seconds.

That’s because the tasks get distributed evenly, so each subprocess gets 5 of the 20 tasks.

-Ofair results in more predictable task distribution behavior at a relatively small performance cost.

The -Ofair option disables this behavior, waiting to distribute tasks until each worker process is actually available for work.

This option comes with a coordination cost penalty, but results in a much more predictable behavior if your tasks have varying execution times, as most IO-bound tasks will.

Read more: Optimizing Celery, Gist demonstrating this behavior

3. Use exponential retry delays

Hey, you should really retry that task when it fails — I bet your third-party provider will be back up in a jiffy.

The best way to retry a task is soon, but not over and over if the service you’re depending on is currently down. So your first retry should be quick, but you should back off fast as failures persist.

The canonical way to do this is by exponentially increasing the delay between retry attempts. To do that, you need to find the number of times you’ve already retried and calculate the next countdown using that as an exponent on some base retry time.

Here’s an example:

@celery_app.task(max_retries=10)
def notify_gcm_device(device_token, message, data=None):
notification_json = build_gcm_json(message, data=data)

try:
gcm.notify_device(device_token, json=notification_json)
  except ServiceTemporarilyDownError:
# Find the number of attempts so far
num_retries = notify_gcm_device.request.retries
seconds_to_wait = 2.0 ** num_retries
    # First countdown will be 1.0, then 2.0, 4.0, etc.
raise notify_gcm_device.retry(countdown=seconds_to_wait)

Note that you should set some sane number of max_retries, both on the task and globally as well.

Read more: Exponential backoff, Stack Overflow question

4. Sleep soundly

That’s it. Celery has been fast and reliable, but these things have made the process so much more so for us.

Beyond these simple lessons, our task queues have operated normally with very little fanfare — Redis in particular has been a powerful and reliable broker for the service.

I hope these tips save you some sleep while on pager duty!


Thanks for reading. If you enjoyed this article, I would really appreciate you hitting the recommend button below. Connect with me on Twitter @taylorhughes with any comments or thoughts.