Managing Partial Batch Failure in AWS Lambda

Shivang Kar
Fasal Engineering
Published in
3 min readDec 22, 2021

We at Fasal, constantly explore new technologies in order to improve our services. Background Job processing is one such core service that we have improved a lot from the early days of our journey. Being an Agritech platform, our many crucial offerings depend on the advisories that get generated in the background.

With the rapid growth of Fasal, we had to think about the scalability and reliability of our technical infrastructure. And one of the bottlenecks was our precious background job processing! Thus, what started out as a heavily blocking cron on the Meteor server of our main application, moved to a dedicated Meteor server (using Fasal-Tech synced cron) and then latest on AWS serverless. Our existing infra would have been soon on the verge of collapse, had it not been upgraded at the right time! One of our engineers has explained it in detail here.

Of course, moving to serverless was great and all, but this did not mean all our problems had vanished! In order to handle massive quantities of tasks in an efficient way, we were taking advantage of batch processing; but it had problems of its own! What happens if one (or more) task fails in a particular batch? If you guessed that the whole batch gets processed again, then you were right on the spot. So the next challenge was to manage partial batch failures.

Enter Middy

Middy is a middleware engine that allows us to simplify our AWS Lambda code when using Node.js. And one of its middleware that solves our problem of handling failed events is sqs-partial-batch-failure. It manually deletes successful messages from the queue. On function failure, the remaining undeleted messages automatically get retried and then eventually be automatically put on the Dead Letter Queue (DLQ) if they continue to fail. This prevents the entire batch from being retried/DLQd.

https://github.com/middyjs/middy

Although it has documentation for the implementation of sqs-partial-batch-failure middleware, it’s of a very basic level good enough to help us get started. But in the enterprise-level software, there are database transactions involved, along with other necessary async calls. To successfully implement it on a worker, whose purpose is to run the async function foo for multiple events in a batch, add the following npm packages in the project and import them into the worker:

@middy/core
@middy/sqs-partial-batch-failure
@middy/do-not-wait-for-empty-event-loop

const middy = require('@middy/core');
const sqsBatch = require('@middy/sqs-partial-batch-failure');
const doNotWaitForEmptyEventLoop = require('@middy/do-not-wait-for-empty-event-loop');
const foo = async function(db, messageBody){
// Performs db transactions
// Performs some other tasks
};

You might have noticed do-not-wait-for-empty-event-loop dependency. This is also another middleware from middy. It will prevent Lambda from timing out because of open database connections, etc.

Our handler which was originally the entry point of our worker, let’s call it originalHandler, will handle the database connection first for the batch (since we do not want to open separate connections for each event).

const originalHandler = async (event, context, callback) => {
let db;
try {
db = await connectToDatabase();
} catch (err) {
callback(err);
}
const recordPromises = event.Records.map(async record => {
await foo(db, record.body);
});
const settledPromises = await Promise.allSettled(recordPromises); return settledPromises;
};

sqs-partial-batch-failure depends on the response of Promise.allSettled().
If all messages were processed successfully, it lets the messages be deleted by Lambda’s native functionality.

Finally, wrapping originalHandler in middlewares in the below manner will do the job.

export const fooHandler = middy(originalHandler)
.use(sqsBatch())
.use(doNotWaitForEmptyEventLoop({
runOnBefore: true,
runOnAfter: true,
runOnError: true
})
);

Note : Your worker is required to have IAM permission for sqs:DeleteMessage

In conclusion, with Middy and sqs-partial-batch-failure middleware, we can now manage and handle partial batch failures with just a few lines of code.

Also read —

--

--

Shivang Kar
Fasal Engineering

I’m a Lead Product Engineer at Fasal, where I focus on building Tech/Product that would help farmers in increasing yield, optimize resource utilization etc.