Lessons learnt from using SQS and Lambda.

Ravi
3 min readDec 18, 2021

--

Concurrency, Throttling, DLQ and best practices.

Photo by Christina @ wocintechchat.com on Unsplash

Tl;dr:

1. Set reserved concurrency to at least 5. 
2. Set Queue Visibility Timeout to 6 times lambda function time out + MaximumBatchingWindowInSeconds.
3. Set Maximum Receive Count to at least 5.
4. Set MaximumBatchingWindowInSeconds to 1.
5. Use Simple Queue Service dead-letter queue redrive to source queues.
6. Increase lambda function memory.

In a recent work project, I used SQS with lambda triggers. While, the solution provided an out of box serverless, scalable and highly fault tolerant solution, it also introduced an interesting problem around lambda getting throttled and messages ending in dead letter queue.

What happens when the queue has more messages than the lambda can process ? Shouldn’t lambda scale up to handle this ? What happens if you have a third party api that your lambda is interfacing with has a throttle. How should you handle it without loosing messages in the queue ?

Basic Application Architecture

Basic application architecture as you can see above, is that the users action triggered the application to send one or many SQS messages. SQS has a lambda trigger configured. Lambda can scale up based on the requests coming through and one would imagine this is a solid fault tolerant scalable solution. So, why did this architecture end up with messages in dead letter queue after all ?

Understanding How Lambda — SQS trigger works!

Lambda Long Polls SQS

When a SQS trigger is added to Lambda, Lambda begins to long poll sqs with 5 parallel connections. Each connection passes on the messages it picked up from SQS to a lambda function. If there are messages to be processed, lambda increases the connections by 60 every minute reaching a maximum of 1000. Lambda poller will also decrease connections based on the messages available. This behavior is managed by AWS and cannot be changed.

Lambda poller increasing or decreasing connections and handling messages in SQS is not directly co related to your lambda concurrency. This disconnect leads to the issue where you experience throttling and messages ending up in DLQ.

In my work project, I had set up the concurrency to 5 and batch size of messages to 1. When we received a burst of request ( 1000 messages ), the architecture could only processes 5 batches at a time ( because of lambda concurrency ) and each batch had one message. This severely impacted throughput. On top of that, most of these messages get invoked into a lambda but because of concurrency limits it gets throttled, they end back in the queue with receive count incremented. This kept repeating until max receive count reached and most messages ended up in DLQ before they could be picked up by a lambda function to be processed.

How do we solve for it:

  1. Reserved Concurrency: AWS recommends setting this value greater than 5 so it is not throttled handing more messages concurrently.
  2. Queue Visibility Timeout: AWS recommends setting this value to atleast 6 times the lambda function time out plus the value of MaximumBatchingWindowInSeconds.
  3. Maximum Receive Count: Set max receive count value to atleast 5 so the messages do not end up in DLQ because of throttling. This gives the message enough retries for it to be processed by a lambda function.
  4. Use the newly announced Simple Queue Service dead-letter queue redrive to source queues. This allows you to manage the lifecycle of unconsumed messages in DLQ.
  5. MaximumBatchingWindowInSeconds: Set the value of MaximumBatchingWindowInSeconds to atleast 1 second for larger batches; so lambda does not start the invocation with batch smaller than batchsize.
  6. Increase the memory of the lambda function so it can processes messages faster. Note that increasing memory also increases vCPU. If the lambda can end up processing the messages faster, it helps increase the throughput.

--

--