AWS lambda now supports partial batch failures. But is it the best approach?
--
At Re:Invent 2021, AWS announced a new feature of lambdas supporting partial batch responses. The idea is if you are invoking a lambda from a queue with a batch of messages, and only a subset of the messages your lambda receives are successfully processed, you can return a “batchItemFailures” array from your lambda listing the failed messages. So if the lambda was invoked with 20 messages and two of them failed during processing, this tells AWS to delete the 18 successful messages from the queue and retry only the two that failed.
How this was handled before: Approach #1
The simplest solution was to throw an exception from the lambda, which failed the batch and caused all 20 messages to be retried. This relies on your lambda’s idempotency implementation to prevent re-processing the 18 successful messages (if it matters).
How this was handled before: Approach #2
Another solution was to delete each message from the queue as it was successfully processed. As your lambda is looping over the messages in the batch, it deletes each message as soon as it successfully processes it. Your loop should also keep some flag indicating whether any message failed during processing. Once all messages are processed, the lambda throws an exception if any messages failed, causing AWS to retry. But since the 18 successful messages are gone from the queue, they will not be retried; only the 2 failed messages are retried.
The new way: Approach #3
And now we have the new solution, which involves tracking your failed message IDs and returning the batchItemFailures array. This causes only the failed messages to be retried.
Advantages and disadvantages of the three approaches
Approach #1 is the simplest to implement, but has the downside of causing retries on all the successfully processed messages. Since failures are usually rare, if you’re working with large batches, this could cause a lot of unnecessary retries. And if your lambda doesn’t do idempotency checking (which it may not if the transaction can be re-run with impunity), this might be costly.
Approach #2 has the advantage that successfully processed messages are not retried. It also has the advantage that if the lambda exits prematurely — for instance, due to a timeout — the successfully processed messages are still not retried. However, it has the disadvantage that each message has to be individually deleted from the queue. Each delete is an https service call, which takes time.
Approach #3 is the fastest solution in almost all cases, since there aren’t any unnecessary retries, and you don’t have to delete messages from the queue. The only case where it may be slower is if the lambda exits prematurely, such as a timeout condition, since the lambda cannot return a batchItemFailures array. If this happens late in the batch, many messages will get retried unnecessarily. This approach is also a more proprietary, less portable solution than either Approach #1 or #2, which have a better chance of working as-is with some other FAAS provider should you need to move in the future.
Is portability a serious consideration?
In reality, most serverless/FAAS applications are not very portable. You can (and should) modularize your code to attempt to keep most of the non-portable code localized. But unless your application development team includes experts on all the platforms, chances are it won’t be easy to port to another serverless provider.
In all honesty, I have experience only with AWS. I have not tried Azure or Google FAAS, and I am not sure I can speak intelligently about the costs of porting. I only know that the serverless applications I have worked on tend to use many AWS-specific features, and I’m upfront in disclosing to my management that such a future move would likely incur significant porting costs.
AWS is constantly adding new features that its competitors don’t have. They call it smart business to differentiate their product. Some might call it vendor lock-in.
The truth is, I can’t really predict the costs. Probably there are more experienced engineers out there with experience in multiple platforms who can speak to this better.
Recommendations
If your message processing is fast and you don’t mind occasionally reprocessing an entire batch, even if only one message failed, consider Approach #1. But for most situations, the new Approach #3 will be the fastest. In my book, fast processing beats portability (to some fictitious future FAAS provider) most of the time.