Event Failures (and Retries) with AWS Serverless Messaging Services

Dhaval Nagar

Published in

AppGambit

9 min readDec 24, 2020

This is the 3rd post in the Serverless messaging series.

[POST #1] Select the right Event-Routing service from Amazon EventBridge, Amazon SNS, and Amazon SQS

[POST #2 ] PAY per USE can derail your Serverless (dream) budget

[POST #3] Event Failures (and Retries) with AWS Serverless Messaging Services

Everything fails, all the time — Dr. Werner Vogels, CTO, Amazon.com

Loose-coupling comes with a price

The serverless architecture has the flexibility to have independently isolated micro-services that can run, scale, and sustain on their own. Although this can make the overall environment hard to manage (observe) where thousands of events are flowing through.

Let’s take this example flow from previous post.

A single event will pass through 4 communication services, processed by 4 Lambda functions, and saved into 2 database services.

Because each endpoint is connected to a messaging service and any failures will basically break the event flow, drop the events from the service, and may possibly lose the event.

Logs, Traces, and Insights — Monitoring & Observing

The first place where you look for failures is Logs. Logs, Traces, and Insights summarise what exactly the system is doing, and if it’s working or not.

AWS provides different services to check logs, traces, and insights (stats). Although if you check these at the detail level, you will find that they are structured differently.

CloudWatch Logs maintain a log stream for each running Lambda function. If a function is re-used for an event, the log stream will capture those logs, and if a new Lambda container is created, then a new log stream will be created.

This makes the logs distributed at the lower level between multiple log streams for an event flow.

Similarly for X-Ray and Lambda Insights, each function has to be enabled explicitly. For Lambda Insights, you will have to enable and configure a Lambda Layer.

Function needs to enable the Tracing and Lambda Insights separately.

Idempotent Functions

Serverless promotes the stateless programming principle, where the functions (ideally) should not keep any state of previous executions. So the function itself wouldn't know how many times it is called for the same event.

As a best practice, we should write the Lambda functions that are repeatable (idempotent) in nature.

Idempotency means that the operation will not result in different results no matter how many times you operate the same operation.

To make it simple, every event received, the Lambda function will have an exact output effect.

As easy this is to WRITE, it’s equally difficult to CODE.

SQS Standard, SNS, and EventBridge all by default provide “at-least-once-delivery” for messages.

That means the receiver MAY get the messages more than once.

The function should be written to ensure that multiple deliveries of the same message do not change the overall state of the application.

For example, if the Order Process function receives an order placed event multiple times, it should still treat it as one order, instead of processing the order multiple times.

Dead-letter Queue and Retry Policy

As failures are inevitable, we need to make sure that we fail gracefully and also protect important information to recover from the failures.

Dead letter mail or undeliverable mail is mail that cannot be delivered to the addressee or returned to the sender. A dead letter office (DLO) is a facility within a postal system where undeliverable mail is processed. — Wikipedia

Messages may fail due to a variety of issues like, erroneous conditions in the code, bad message data (poison message), unexpected state change that causes an issue with the code, function timeouts, or simply downstream service failure.

Each of the messaging services has an option to configure the Retries with redrive policy.

The redrive policy specifies the source, the dead-letter queue, and the conditions under which the service moves messages from the source to the DLQ if the consumer of the source fails to process a message a specified number of times.

It’s likely that a persistent error will never allow a message to fully process, keeping it stuck inside the queue until the message reaches the maximum retention period. But we don’t want to process the same message infinitely, expecting failures all the time. This is the place where Retries and Dead-letter queues (Amazon SQS Queue) help. After exhausting all the retries, the service will move the message to a DLQ.

Dead-letter queues are useful for debugging your application or messaging system because they let you isolate problematic messages to determine why their processing doesn’t succeed.

The retry policy works a bit differently for different services, so let’s check how Amazon SQS, SNS, and EventBridge manages that.

Failures in SQS Queue receiver

Amazon SQS persists all the messages and relies on the receiver to pull, process, and remove messages from the queue.

In the case of SQS and Lambda integration, the Lambda service internally polls the queue and invokes your Lambda function synchronously with an event that contains queue messages.

Failures in the receiver function will block the SQS queue from processing newer messages as the SQS queue only removes the messages after successful execution.

Once the message will exceed the configured receive count, it will be moved to the associated Dead-letter queue.

SQS Message Failure Redrive Policy with DLQ

Amazon SQS can deliver the messages in batch, this increases the efficiency and performance but also makes the failures complicated to manage.

If one of the messages from the batch fails, the failure will still be treated for the entire batch of messages and they will become visible again in the SQS queue after the visibility timeout expires.

Use the SQS FIFO Queues if you want “exactly-once” delivery of critical events.

With the “exactly-once” delivery method, it’s still the application’s responsibility to ensure the message is not getting re-processed by using the Message Group ID and Message Deduplication ID provided as additional attributes for each message.

For Reference: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html

To summarise,

Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages.
Lambda service can poll with a batch of messages and a Lambda function may receive a batch of messages.
In the case of throttling, the Lambda poller will internally retry the execution until the messages hit the Visibility Timeout limit, at which point the message will reappear in the SQS queue again.
In case of function failures, the Lambda poller will drop the messages and wait for the Visibility Timeout limit, at which point the message will reappear in the SQS queue again.
While processing a batch of messages, Lambda function failure will by default put all the messages back in the queue. Use the (bulk) delete message API to confirm the successful execution for messages in a batch.

Failures in SNS Topic subscriber

Amazon SNS by default does not persist the messages, so if the receiver fails for any reason the message is practically lost (after retries).

Amazon SNS message delivery policy defines how it retries the delivery of messages when error occurs on the receiver side. If the delivery policy is exhausted of retries, Amazon SNS stops retrying the delivery and either discards the message or transfer it to a Dead-letter queue if one is configured.

Lambda service buffers the messages received from SNS and then call the Lambda functions for each message.

SNS Lambda Subscription Redrive configuration

On failure, Lambda service can do up to 2 retries with total 3 attempts for a single function.

SNS Lambda Trigger with Retries configuration

The difference between the Maximum age of event and Retry attempts is how the error is generated.

For example, if the Lambda service throttles because of not enough functions are available to process the new messages, it will retry up to the Maximum age of the event. This is the service level issue.
In case of a Lambda function throws an error, the message will be retried based on the configured Retry attempts. This is the execution level issue.

Lambda functions also have an option to configure Destinations. This is a great feature to orchestrate the event-flow based on the Lambda function’s success or failure result.

Lambda Destination with On Failure Target options

To summarise,

SNS to Lambda is an asynchronous invocation, it will send the message to the Lambda service and forget.
Lambda service internally manages a queue, and Lambda service poller pulls the messages from the internal queue and invokes Lambda function for each message
In the case of throttling, the Lambda service poller will retry the execution until the messages reach to Maximum Age Limit
In case of function failure, the Lambda service poller will retry the execution up to 2 more times.

Failures in the EventBridge Rule targets

Amazon EventBridge can configure event rules with up to 5 targets. The targets might not be available or throw errors due to some reasons.

When an event is not successfully delivered to a target, EventBridge retries sending the event. By default, EventBridge retries sending the event for 24 hours and up to 185 times with randomized delay. When retry attempts are exhausted, EventBridge will either drop the event or transfer the event to a Dead-letter queue, if one is configured.

Note: Not all event errors are handled in the same way. Some events are dropped or sent to a dead-letter queue without any retry attempts.

EventBridge Rule Retry policy configuration

To summarise,

EventBridge to Lambda invocation works the same way as the SNS to Lambda
EventBridge rule allows to configure the Maximum Age, Retry attempts, and DLQ for each of the Targets

Detect Failures with Lumigo

With many options to fail, retry, and manage failures. How would you detect failures!!

AWS provides preliminary services like CloudWatch and X-Ray to record and analyze the request failures.

However when many requests are being processed in parallel and multiple services are used to transmit events, it makes it really difficult to identify and put all the distributed logs together when one of the downstream services fails.

The following screenshot is taken from the Lumigo.io transactions screen where you can check all the executions and check the flow of execution including any errors.