Recently, our company started to reduce unnecessary burn. So we reduced our GKE (Google Kubernetes Engine) node resources to a certain limit so that production services can run smoothly.
One day after the reduction, Food dispatch service started to fail at the peak hour (6 pm to 9 pm GMT+6.00). Food SRE team got notified a few minutes later of service failing. After an initial investigation, we found out that the RabbitMQ went down and couldn’t automatically be recovered. We were running RabbitMQ in Kubernetes and thought that maybe it couldn’t claim enough resources at the peak hour. To take immediate action, we boot up another RabbitMQ stateful set using helm and used it as a broker for our celery based dispatcher.
The next day, we faced that same issue around 9 pm, our dispatch Rabbit pod went down again. This time, we took a bit of time to investigate the issue and figured out Rabbit CPU and Memory usages went high and it couldn’t claim more resources from the Node. When we tried to restart, Rabbit pod was failing to start because it couldn’t recover all messages. So we had to drop old messages and then restart the pod.
We were prepared for another Fire Drill (system down) the next day. As a precaution, we moved our dispatch RabbitMQ to a VM to give it a dedicated VM so that it has enough resources. Starting from 6 pm, we were monitoring CPU usages in the VM. At 7 pm we were seeing a spike in CPU usages and within half an hour it went up high to 90% usages (of 2vCPU).
To prevent the disaster, we immediately boot up another RabbitMQ in Kubernetes and point the producer to push new messages in new RabbitMQ. So that in the meantime, Consumer workers can process and clear up old messages. Luckily this time we had enough resources in this node. But the CPU usages and memory usages was really bad.
dispatch-rabbit-rabbitmq-0 6037m 5179Mi
We were asking ourselves questions — Really? That much CPU? and RAM?
We also took a look at the worker CPU and RAM usages. Which were also bad, superbad. It’s like a nightmare to us
So we started to dig down the real issue. By looking at the VM network I/O and RabbitMQ incoming/outgoing messages, we found our first clue to the problem. Seems like Rabbit is receiving more message than it should. Our initial thought was that the no of retry of dispatch tasks might cause this high volume of the message. We knew it was not conclusive, however, we reduced no of retry to see the impact. But the impact was still the same.
Old results will not be cleaned automatically, so you must make sure to consume the results or else the number of queues will eventually go out of control
Which means we were storing celery task results in RabbitMQ and messed up the RabbitMQ as we didn’t clean up properly. The StackOverflow discussion helped a lot https://stackoverflow.com/questions/6362829/rabbitmq-on-ec2-consuming-tons-of-cpu
Another fact was, Celery sends a lot of “unnecessary” messages, due to their gossip, mingle and events.
- As we figured out our fault, immediately made a hotfix to our codebase. We removed the celery backend from our code.
Another way to solve this by disable storing results with
CELERY_IGNORE_RESULT = True in the Celery config file.
2. We disabled celery gossip, mingle and events by adding command line arguments. Also enabled celery worker autoscale feature.
celery -A consumer worker -l ERROR -c 50
celery -A consumer worker -l ERROR --autoscale=50,5 --without-heartbeat --without-gossip --without-mingle
You can find more details in celery documentation