Lambda Dead Letter Errors: When You Can’t Successfully Fail

Published in

pageup-tech

3 min readMay 11, 2017

In my team at PageUp we’ve recently started delving into serverless architectures (like AWS Lambda). These architectures help solve problems of scale and to introduce more event-driven behaviours into our system.

Fail safely — our focus

Naturally, a large focus of our implementation has been to ensure any errors occurring will cause the system to fail gracefully and pass the bad data away for later examination and diagnosis by a developer. Thankfully, a lot of our needs in this regard can be taken care of with AWS’s other integrated services, for example:

Dead Letter Queues to take in all messages that cannot be successfully handled by Lambda (now directly supported with SQS in Lambda configuration)
CloudWatch Alarms at every step in the process to notify of errors (e.g. failure to deliver an SNS message, or messages appearing in the Dead Letter Queue)

Armed with these and our own pre-existing alerts set up for our internal systems, we eagerly got everything set up and began testing our infrastructure for as many failure cases as we could.

Failing at failing - hitting a snag

While testing out our Dead Letter Queue config, we found that none of the failed messages were appearing in the configured Dead Letter Queue- what could we have gotten wrong? We double-checked our configuration- both in the AWS Console and our Terraform scripts that had set everything up- but could find no fault.

Cue some trawling through the documentation for Lambda, whereupon we eventually found this tidbit:

If for some reason, the event payload consistently fails to reach the target [Dead Letter Queue] ARN, Lambda increments a CloudWatch metric called DeadLetterErrors and then deletes the event payload. (http://docs.aws.amazon.com/lambda/latest/dg/dlq.html)

That’s to say, even writing to your Dead Letter Queue might fail. If this happens, the contents of the erroneous message are lost.

So what can you possibly do about it?

Well, for starters, you can set up an additional CloudWatch alarm against the DeadLetterErrors metric that can send you an email whenever this occurs, just like all your other failure cases.

Additionally, if you log the contents of all messages that pass through your Lambda code, you may well be able to locate those in CloudWatch and match up the time of the Dead Letter Error against your logs in order to determine what message failed to process.

We had implemented the Dead Letter Queue configuration initially expecting guaranteed storage for every failed message- which, if not possible with SQS alone, can at least be achieved with the above two additions.

Wisdom gained…

It could be argued we should have realised even Dead Letter Queues can fail, but I think it’s fair to say that it’s an understandable scenario to overlook, especially when it happens incredibly rarely. In fact, ever since our first setup, I’ve never seen it happen again!

At the end of the day, the moral of the story is to always set up alarms on DeadLetterErrors for your Lambdas.

Because even your failure handlers can fail sometimes.

Lambda Dead Letter Errors: When You Can’t Successfully Fail

Written by Chris Lewis