Solving RabbitMQ High CPU/Memory Usages Problem With Celery

Md. Al-Amin
Jun 30 · 4 min read

Recently, our company started to reduce unnecessary burn. So we reduced our GKE (Google Kubernetes Engine) node resources to a certain limit so that production services can run smoothly.

One day after the reduction, Food dispatch service started to fail at the peak hour (6 pm to 9 pm GMT+6.00). Food SRE team got notified a few minutes later of service failing. After an initial investigation, we found out that the RabbitMQ went down and couldn’t automatically be recovered. We were running RabbitMQ in Kubernetes and thought that maybe it couldn’t claim enough resources at the peak hour. To take immediate action, we boot up another RabbitMQ stateful set using helm and used it as a broker for our celery based dispatcher.

The next day, we faced that same issue around 9 pm, our dispatch Rabbit pod went down again. This time, we took a bit of time to investigate the issue and figured out Rabbit CPU and Memory usages went high and it couldn’t claim more resources from the Node. When we tried to restart, Rabbit pod was failing to start because it couldn’t recover all messages. So we had to drop old messages and then restart the pod.

We were prepared for another Fire Drill (system down) the next day. As a precaution, we moved our dispatch RabbitMQ to a VM to give it a dedicated VM so that it has enough resources. Starting from 6 pm, we were monitoring CPU usages in the VM. At 7 pm we were seeing a spike in CPU usages and within half an hour it went up high to 90% usages (of 2vCPU).

CPU usages of RabbitMQ VM

To prevent the disaster, we immediately boot up another RabbitMQ in Kubernetes and point the producer to push new messages in new RabbitMQ. So that in the meantime, Consumer workers can process and clear up old messages. Luckily this time we had enough resources in this node. But the CPU usages and memory usages was really bad.

dispatch-rabbit-rabbitmq-0                                             6037m        5179Mi

We were asking ourselves questions — Really? That much CPU? and RAM?

We also took a look at the worker CPU and RAM usages. Which were also bad, superbad. It’s like a nightmare to us

worker-5d47c97b96-8bm9g                                     
571m 1428Mi
worker-5d47c97b96-gtjd4
362m 1434Mi
worker-5d47c97b96-hh8l8
511m 1425Mi
worker-5d47c97b96-kqj66
616m 1429Mi
worker-5d47c97b96-z2mhr
424m 1426Mi

So we started to dig down the real issue. By looking at the VM network I/O and RabbitMQ incoming/outgoing messages, we found our first clue to the problem. Seems like Rabbit is receiving more message than it should. Our initial thought was that the no of retry of dispatch tasks might cause this high volume of the message. We knew it was not conclusive, however, we reduced no of retry to see the impact. But the impact was still the same.

After researching on online by reading some articles, we figured out the actual issue. Here is the relevant piece of documentation

Old results will not be cleaned automatically, so you must make sure to consume the results or else the number of queues will eventually go out of control

Which means we were storing celery task results in RabbitMQ and messed up the RabbitMQ as we didn’t clean up properly. The StackOverflow discussion helped a lot https://stackoverflow.com/questions/6362829/rabbitmq-on-ec2-consuming-tons-of-cpu

Another fact was, Celery sends a lot of “unnecessary” messages, due to their gossip, mingle and events.

Solution

  1. As we figured out our fault, immediately made a hotfix to our codebase. We removed the celery backend from our code.
Celery Code Segment

Another way to solve this by disable storing results with CELERY_IGNORE_RESULT = True in the Celery config file.

2. We disabled celery gossip, mingle and events by adding command line arguments. Also enabled celery worker autoscale feature.

Before:

celery -A consumer worker -l ERROR -c 50

After:

celery -A consumer worker -l ERROR --autoscale=50,5 --without-heartbeat --without-gossip --without-mingle

You can find more details in celery documentation

Result

Last 12 hours — RabbitMQ (VM) CPU
Workers CPU and Memory Usages

Md. Al-Amin

Written by

Senior Engineering Manager @pathao, Competitive Programmer, Security Researcher, Traveller