What’s wrong with SQS and why we gave it up

Michal Haták
Aug 18, 2017 · 4 min read

Last week, I tweeted that we gave up using SQS for our Celery workers because we determined that SQS is one of the worst options you have. And, I promised a blog post in comments, so here we are :)

Before we began testing SQS, this was our stack at AUTOPROP: We ran completely on AWS (except for a few services, such as user management). In the front we had two Webservers behind ELB, shooting new tasks to a bunch of workers — all managed by Celery. As a queue, we are used Redis (We were using RabbitMQ, but we had issues with Celery’s chords). Usually, we have 2 workers running all the time, and, in the daytime, we spin-up spot instances to increase power. In our case, we treat tasks a bit differently because our users expect to results ASAP. Most tasks are done in less than a second. Some of them could run up to 3–4 mins. As you would expect, it all depends on the actual task itself (some rendering, or geo query into our MongoDB replica).
Results from the workers are sent back to the browser through Websockets, which has worked fine for us.

For us, this architecture was designed a few years ago, so we felt it was time to do a general review to see if we could improve efficiency, reliability, and AWS spend. So, two weeks ago, we started re-visiting our servers and eventually we got to analyzing the queue. The main idea behind our switch to a new queue — besides reduced cost — was because we feel that it’s better to use specialized services vs. deploying a server and managing a service ourselves. (As long as the specialized service works!) And, because we are running in AWS, we decided to try SQS. A bit of foreshadowing: we have some bad experience with SQS in the past, but it has been more than 2 years now and many other AWS services had improved since then.

So, as always, we started with some research into SQS and, unfortunately, there were a couple red flags right from the start.

Not a backend ?

http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html#polling-interval

Problem. This means you cannot fetch results back from the worker. As mentioned in the beginning — we are sending results back through Websockets, but we found 2 or 3 cases where we need to get results back. We could keep Redis (or RabbitMQ) as a backend, but we will miss the point of migrating then. What are the options then? AWS S3 — I know, it sounds so, so weird, but given that we need the result < 20 times per day, why not (and in future we will rewrite/change those tasks anyway). So we found a library, fixed it and it worked — yay!

Polling

http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html#polling-interval

Like, really ? In the end, it means your queue will be way slower. But, we said we can live with this, because of others trade-offs, such as Monitoring (I will get to that later). (And, I hope you are seeing that already we really tried to make SQS work for us…)

CloudWatch

So, after we set up the queue, and after few other ‘little struggles’ we were ready to go. We moved right along to setting up alarms and putting metrics into our CloudWatch dashboard. After few minutes, and quick look into AWS doc, I found that you cannot have any detailed monitoring for queue, just regular one, which means metrics just every 5 minutes. This alone is unacceptable. I was so sure that SQS would have this, and it didn’t… My whole team thought this feature would be a sure thing. On top of this failing, you cannot use celery-events (for custom monitoring) or remote commands.

It somehow gets worse

After a few hours, the workers stopped working. And, as it turned out, it was not the primary fault of SQS nor tooling, but…

What happened was, SQS somehow ‘messed up’ the response and kombu (lib from boto3 — unofficial aws python sdk). It could not recover connections which resulted in non-working workers. You can query 599 on github.com/celery/celery or checkout directly this issue — https://github.com/celery/celery/issues/3990.

It was at this time that we killed any dreams of using SQS, and immediately switched back to Redis. SQS was just too full of huge disadvantages. In my opinion, this makes SQS one of the worst option for Celery. I personally think it’s better to spend money for a Rabbit-as-a-service subscription, or just spend the time to build and run it yourself.

The positive

There is one positive thing about our journey into SQS. We learnt a lot more about how it works ‘under the hood’, and also cleaned up part of our codebase which needed some dusting.

Thanks for the reading. If anyone has had different experience, please let me know. I will gladly hear what was wrong with our approach, and will also will try to answer questions in comments — eventually.