Handle SQS message failure in batch with partial batch response feature
Recently, AWS has announced, “AWS Lambda now supports partial batch response for SQS as an event source”. Before we go through, how it works let understand that how SQS messages were handled before, and then we will go through how the partial batch response feature will add value.
Assumption: The SQS trigger is configured for the lambda function. The exception is not handled in case of any message processing failure for a given batch.
The lambda function is triggered with the batch of 5 messages and if the lambda function fails to process any of the messages and throws an error then all the 5 messages would be kept on the queue to be reprocessed after a visibility timeout period. In this case, either the batch processing would be completely successful and the messages would be deleted from the SQS queue or it would completely fail and put the whole batch in the queue for reprocessing.
Re-processing already processed message is not feasible, so to avoid it at some level let’s delete each message from the batch once it is processed.
Assumption: The SQS trigger is configured for the lambda function. The exception is not handled in case of any message processing failure. But the delete functionality is added to delete each message after it is processed successfully.
In this case, let’s say the first 2 messages in the batch are processed successfully and deleted. The 3rd message failed and lambda returns an error, so for this failure, the 3rd, 4th & 5th messages will be set for a retry since the 1st & 2nd messages are processed successfully and deleted from the queue. Hence, the processing of already processed messages will not happen.
The lambda failed to process the 3rd message from the batch and due to that further processing of the rest of the messages is interrupted. But let’s say we want all the messages to be processed even if any message is failing and the successfully processed messages should be deleted and only the failed message should be retried.
Assumption: The SQS trigger is configured for the lambda function. Exception handling for failed messages is configured on top of delete functionality.
Here, we will maintain a flag to determine if any message is failing. Let’s say again the 3rd message failed. Now, since we have error handling in place, it will handle the failed message and process the rest of the messages in the batch followed by the deletion. Finally, the manual exception will be raised based on the flag condition which will cause the failed message to retry since the rest of the messages are processed and deleted successfully. In this scenario, we cannot control which messages we want lambda to retry with and if we want to control the messages that lambda should retry then the partial batch response feature is the answer.
All the 3 scenarios were about how lambda can handle the messages depending on the requirements before the partial batch response was introduced.
With the partial batch response feature, a lambda can identify the failed messages from the batch and return the identified messages back to the queue which will allow reprocessing of only the failed or asked messages. This will make the SQS queue processing more efficient, kill the need for repetitive data transfer with increased throughput, improve processing performance, and on top of that it does not come with any additional cost beyond the standard price.
While using this feature, exception handling should be in place and the lambda function has to return the message ids of the messages that requires reprocessing in the particular format given below.
To enable and handle partial batch failure, check the Report batch item failures option under Additional settings while adding the SQS trigger.
After the SQS event source configuration, the response part in the code should be in a particular format that is given below for the partial batch response failure to work.
Here is the video tutorial on the same.
For most of the situations/scenarios adopting the implementation used in the 4th scenario would be beneficial and have added advantage of efficient & fast processing, reduced repetitive data transfer hence increased throughput.