Celery in Production: How to be ready and quickly resolve issues

Published in

The Andela Way

4 min readOct 15, 2019

The original photo by Daniel Cheung on Unsplash

Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages while providing operations with the tools required to maintain such a system. In fact, Celery has become the defacto standard for such a task in the Python ecosystem.

Many large companies rely on it for different tasks from sending emails to processing large files or generating image thumbnails of different sizes in the background. Some of these companies are Instagram, Mozilla, Udemy, Bitbucket … etc You can check more here.

What this means is that it’s a reliable and battle-tested technology. Awesome, right? but here comes the problem, Celery is a lot different when you run in a development mode than production mode, things go wrong, you discover you misconfigured it, scalability issues, duplicated task executions and so on and none of these issues are addressed in the getting started tutorials. Things get really dirty.

I don't want to sound so pessimistic, in fact, I find it a learning process and yes you will learn how to cope with that by experimenting production issues and fixing them, but this really gets hard if you don’t have a good control plane and the right tools to help you with first knowing that there is an issue then identifying the issue and fixing it and making sure it won’t happen again.

So, What is this post about?

This post is about how to be ready when bad things and issues happen in your production environment and the right celery setup that will help you with such incidents. Although the things mentioned might sound intuitive for some especially experienced, It might be considered as a best practices or checklist for a healthy Celery production setup.

This post is not about a specific issue, there are lots of articles from big and small companies about some production issues they ran into and how they resolved them and I might write about the issues I experienced in other posts.

Enough introduction, Let’s go

#1 Choose your message broker wisely

Celery supports several message brokers, RabbitMQ is the default one but you might want to use Redis or Amazon SQS. My point is to know and decide which one to use before starting development. Despite they all do the job but they have some differences especially their configurations. I learned this the hard way, I joined a project and they were using Redis and we had a critical task (Payment) that is expected to run once but we found out that it will run more than 20 times!, thousands of dollars would have been lost because of what we discovered later a misconfiguration of Redis Visibility Timeout.

Takeaway:

First check Celery supported brokers here and once you decide the one you will go with, Read carefully through its configurations.

#2 Monitoring

Nothing will help you when incidents happen, like a monitoring tool even when nothing happens, the ability to track and monitor your tasks queues and workers in real-time is valuable. Celery comes with few options you should familiarize yourself with because they will be your swiss army knife.

Here are the fundamental ones:

Flower: this is the most important one. it’s a real-time web dashboard in which you can monitor task progress and history, see Graphs and statistics and remotely control everything including tasks lifecycle, queues, and workers .. etc
Broker client: depending on the broker you can use its own client, for example, redis-cli or rabbitmqctl, these can be helpful sometimes for inspecting the queues and seeing what’s inside them.
Management command-line utilities: these are celery commands for inspecting and controlling the workers. for example, you can see a list of the currently live executed tasks using the command `celery -A proj inspect active`

You can check everything about Celery monitoring here

#3 Logging

Most of the time your logs will be the source of truth to get back to when something unexpected happens that’s why it’s a key feature in modern software applications. so make sure you properly set up Celery log configurations. Also, make sure you have enough information logged when something happens in your code for example if you expect an exception make sure you log the item details like the `ID` to make easier for debugging and troubleshooting.

#4 Application errors real-time monitoring

“Software errors are inevitable. Chaos is not.” sentry.io

Errors in production will always happen but you can early discover them and quickly resolve them before they become monsters. So you need a system that monitors your application, and instantly alerts your team via different channels like email or Slack, not only that, it can assign who is responsible for this issue, what is its priority and give you a full context about the issue.

Nowadays, Sentry, is the defacto standard for application errors realtime monitoring, companies like Uber, Microsoft, Paypal and more are leveraging it.

Summary

I tried to list the things I’ve learned down the road while using Celery, There might be more things I missed or should have mentioned or something you disagree with. I’d like you to share it in the comments section so that we can all learn and grow.