How to successfully create a push queue using SQS + lambda

juanjolainez
CreditorWatch
Published in
5 min readApr 6, 2021

Due to the nature of our company, in CreditorWatch, we have to deal with a vast number of jobs and deferred tasks (from 300k to 3 million per day). The tasks that we perform are very different and go from performing a data audit for a company to update several data points in our DB or generating a credit score for our company records.

CreditorWatch is currently implementing micro-services that live in Docker containers, so we wanted a task system that allowed to have the codebase in one place, was inexpensive, easy to monitor, easily testable and replaceable if needed.

Ideally, the code wouldn’t need to know anything about the infrastructure from where the tasks come from.

This implementation of the last concept has been quite successful to us, since we are using the same principle to implement the listeners for queues and our event driven architecture. If the events look the same, and the task to execute is the same, we can re-use the code that is already tested.

This means that our listeners would look the same if we used SQS, RabbitMQ, Kafka or Kinesis Streams, which we already use extensively (you can read about it here: https://medium.com/swlh/aws-dynamodb-triggers-event-driven-architecture-61dea6336efb)

Also, we wanted to be able to send the same event to multiple listeners. For this, ActiveMQ or RabbitMQ would have helped, but it also implied having to maintain our own queue servers and accommodate it to our usage peaks, which means that it might have been under-used for lower load periods.

Our solution:

It’s for all those reasons that we came up with the idea to combine SQS and Lambda to create a push queue, a queue that will push the content of the SQS message to a POST endpoint of any of our micro-services.

With this implementation, our code does not know (or needs to know) the queue infrastructure that is behind, which allows us to completely replace it without changing the code (we effectively use Kinesis Streams and SQS jobs seamlessly)

Basic push queue configuration

As mentioned before, we might want to have a configuration similar to ActiveMQ or RabbitMQ, where 1 message goes to several subscribers. To do this, we’d use Kinesis streams by default (we wouldn’t use SQS), but we’ll illustrate how we’d use this approach if we wanted that functionality, as an exercise.

In this case, we can attach a lambda function to ActiveMQ (explained here: https://aws.amazon.com/blogs/compute/invoking-aws-lambda-from-amazon-mq/) or use the schema below:

SQS with multiple listeners by using a Kinesis stream

Another one of the virtues of this design is that we don’t need to add a different monitoring and alerting that the one that we use for our micro-services. Since we are, at the end, monitoring an endpoint, we can use our monitoring and alerting tools. Of course, we’d need to tweak response times alerts and others, but that happens with every endpoint as well.

In case of failure to execute the code, lambda will keep retrying the failing process over and over up to some point (it can be configured in the Maximum Receives parameter of the SQS configuration). Once it reaches this point, the message will go to a Dead Letter Queue for further inspection and debugging. Having an already built-in retry mechanism might seem like a clear advantage, but it also means that the lambda stream will be blocked during the retries. So, if you have a failing job, it will prevent other perfectly fine jobs to process. To monitor that, it’s a good idea to keep an eye on the Approximate Age of Oldest Message (and probably add a CloudWatch Alarm on it) to make sure you detect these situations as soon as possible.

With a pull queue, you can have supervisord to control concurrency on how many listeners do you have at the same time processing jobs in the queue. With lambda, this process is as seamless as it is with supervisord. In that case, with Lambda, you can control it with the Reserved and Provisioned concurrency (https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html) the same way. Just tell lambda how many processes you want to be processing at the same time, and you are done.

So, what happens if you have more messages in queue that the ones you can process (due to your concurrency limitations)? SQS will try to trigger the lambda functions, but, since you are on your limit, the invocation will fail, and the messages will go back to your SQS, with their receive attempts incremented. Be careful while setting the receive attempts parameter in your SQS because it doesn’t mean in this case “how many times the message has been re-tried”, but “how many times the message has failed in successfully invoking and running a lambda function”. We learnt this lesson early on and luckily before going to production, but it can be a big headache if you don’t.

On other bad news, lambda has a 15 minutes execution time limitation, so, if your processes run for longer than that, this approach is not for you. So far this seems like an AWS hard limit, but, who knows, it may change in the future.

Pros

  • Seamless scalability
  • No maintenance needed
  • Retries already implemented
  • Easily monitored using existing APM tools
  • Easy to create alerts in AWS

Cons

  • Cost
  • Retries can block processes
  • 15 minutes process hard limit
  • Lambda throttling can be tricky

We have, with this approach, processed hundreds of millions of processes in a very robust, scalable and monitored way with very little setup effort and maintenance.

On top of this approach, we even have lambda processes that change the number of concurrent processes of other lambda listeners to be able to process more jobs after-hours than during business hours, but that’s probably another story for another article so, stay tuned!

--

--

juanjolainez
CreditorWatch

Software Architect, tech enthusiast, probably average writer