Notes from processing 480 Million SQS messages per week

2 min readSep 19, 2017

Setup

Producer/Writes to the queue by a Hadoop cluster of 8 m3.large EC2 instances, Consumer 16 processes on 6 m3.large EC2 instance. SQS in the same region as the Hadoop cluster.

Scaling SQS writes
1. Have multiple connections and writers / producers.
2. Write in batch.
3. If you can logically pack more messages in a single message of 256KB do that.
4. Compress message if it offers good compression ratio based on your message content. Per message it’s insignificant. With 480 Million it’s a different story.
5. Write latency varies anywhere between 70ms to 210ms given you are in the same aws region.
SQS Reads
1. Read in Batches
2. Receive latency varies from 180 ms to 4 minutes. 4 minutes is a boundary condition. Avg latency is 340ms. Yup SQS is slow with read if you are coming from ZeroMQ, RabbitMQ world.
3. If your consumer makes network IOs or CPU intensive tasks after reading the message you want to have a secondary process/thread pool and IPC setup. Factor in time to compute total consumer’s throughput.

Tips

Every API call will fail. Be it reading from sqs, deleting messages from queue, connection failure etc. The occurrence varies from 1 in 100k to 1 in Million. There is no pattern to it. But expect everything to fail.
Things will fail and land up in dead letter queue. Have a transfer script in place that will insert message to main queue from dead letter queue.
Having a timestamp inside every message is helpful for debugging.
Always set your VisibilityTimeout keeping in mind the rate at which your consumers can consume all messages in the queue.

Notes from processing 480 Million SQS messages per week

Written by >>> import python