Lambda: How I Finally Won My Race Against Timeout

Mathieu Tamer
precogs-tech
Published in
6 min readJul 7, 2017

Edit: Updated with async/await version (19/02/2024)

5 months ago, I chose to join Precogs. A great team with a nice project using a cool stack. I’ve quickly been in charge of the refacto of a monolithic source code into microservices. Just perfect to have a clear understanding of a code base.

We are using Lambda and unfortunately, I’ve been rapidly concerned about Lambda timeouts. The problem has arisen when I started doing tests with Kinesis stream (the AWS streaming service). AWS Kinesis streaming service is designed to receive a stream of messages and to maintain the ordering of the records, so that if the application consuming the stream doesn’t acknowledge the processing, the same record will be delivered again. In our case, the Lambda function timed out at each call, and the same record was read over and over again. Until all records expired. That’s when we started thinking about how to anticipate Lambda timeout.

Welcome to Lambda racing

Lambda is designed for microservices. Each function has to win a race within a short laps of time (set between 1 second and 15 minutes). Unfortunately, for a few functions requesting the use of other services like databases or APIs, I often lost that race…

In this article, I’ll share the issues I’ve encountered with Lambda timeouts, bring out why the solution provided by AWS (DLQ) was not satisfying, and provide the solution I’ve put in place (for Node.js).

My issue with timeout

The key point about a Lambda timeout is that you don’t know when Lambda stops.

See this simple piece of code:

await initConfig(event);
await parseData();
await db.connect();
await saveDataInTransaction();
await db.disconnect();
await sendSnsMessage();

It’s a simplified version of what we used. It parses data from the event, stores the data (in a transaction) and sends a message thru SNS. When this Lambda ran, I sometimes got the error “Task timed out after 10.00 seconds” but I was not able to know what was the last succeeded action! Worse: this sometimes ended with duplicated datas…

The explanation lies in two behaviors. First, depending on how the function is called (in our case after an asynchronous non stream-based event), Lambda retries after a failure. In addition, once a query is sent to RDS, it runs till the end even if the client launching this query fails over. See the chart bellow:

Duplicated datas because of Lambda timeout

Lambda had enough time to initiate a retry, and save again the same data before the first one rolled back... If only it rolled back! In fact, since the first query was still running, save and commit of data could succeed. The fonction may have saved data twice and, in the end I had duplicated datas…

After that, we tried to launch our transaction in three separate queries, instead of a single one. But, the Lambda function could time out during the commit… As a result, the issue remained exactly the same. The commit of the first function was finishing even if the Lambda function was not active, and a retry was initiating… We still had duplicated datas.

Lambda is a microservice solution!

It may be obvious but the conclusion for us was that we must think and use Lambda in a microservice’s way.

I think there are two ways to ensure that. Either each step of the function can be re-run multiple times regardless of previous failures, either the function must behave like a transaction. In the first case, checking that data has not already been saved is mandatory before trying to insert. In the second case, at the first error, the system must roll back to the state before the function was called.

At Precogs, we adopted the second approach. Each microservice must not interfere with other functions. A successful function is a function that has done everything we were expecting. And if a function fails, we know which is the state of the system: as if nothing happened. It might not be the best solution, but at least we do not have to track every step of the function to know the current state. It’s exactly like an RDB transaction (that we use with RDS). So anticipating the timeout allows us to rollback easily, before any retry or following call.

AWS provides a solution to deal with failures (and possibly, undoing what was done): DLQ (for Dead Letter Queue). DLQ can be easily linked to a Lambda function that is called when your function fails. It’s great, and it’s simple to configure. But, because of the timeout issue, we still didn’t know in which state the function was when it failed. So when we needed to undo things, we needed to check each step one by one. Moreover, in case of a transaction, the Lambda function called by the DLQ, doesn’t share the same connection as the main one, so it is not as easy to roll back as sending a “ROLLBACK” command to the database.

The npm package that solves it: throw-error-on-timeout

This is why we developed a npm package in order to anticipate the Lambda timeout. It’s on Github, open-source and easy to use:

It’s an object that can just be initialized with a timeout period:

const ThrowErrorOnTimeout= require('throw-error-on-timeout');
const timeoutError = new ThrowErrorOnTimeout(context.getRemainingTimeInMillis() - 500);

As you can see in this example, the function is initiated using the synchronous method getRemainingTimeInMillis() of the context object provided by Lambda. So no need to change the timeout inside the function if the Lambda timeout is modified. Pretty useful, no? :)

And we just have to encapsulate the async functions to anticipate the timeout:

try {
const globalResult = await timeoutError.raceWithTimeout(async () => {
//
// Put your async function(s) here
//
// Any value return here will be returned by
// the "raceWithTimeout" function (in case of success)
});
} catch(err) {
// "err" will be "Async function timeout", if global async function time-outs
// or the error thrown by the function otherwise.
}

Our first version only had this function. But in most cases, async function is a sequence of awaited functions. And even if we anticipate the Lambda timeout, the event loop is still processed until the function times out. And so the issue remains the same: we didn’t know in which state the function was when it failed. That’s why we implement a method that encapsulates each sub-function in order not to process any other step in case of timeout. And so, if there is a rollback in the event loop, Lambda will wait until it finishes.

This second method allow to add break points in order check if the global function has timed out:

try {
const globalResult = await timeoutError.raceWithTimeout(async () => {
await oneAsyncFunction();

timeoutError.checkExpiration();

// Will be executed only if global function has not time-outed yet
// at the time of the "checkExpiration", the line before
await anotherAsyncFunction();
});
} catch(err) {
// "err" will still be "Async function timeout", if global async function time-outs
}

So with that this method, we precisely know where the function timed out and so we can roll back accordingly. Now we have all we need to anticipate and deploy all the required corrections before a Lambda timeout.

Disclosure on Lambda event loop

Instead of using this second method, one solution for us might be to use the callbackWaitsForEmptyEventLoop parameter of the context object. This parameter can force lambda to stop (almost) as soon as the callback is called. But the process is simply frozen and it may be called again in a new execution of the function. So we cannot be sure of what is executed or not. If you’re interested in how lambda handle the event loop, an article about how we struggle with that will be written soon. Stay in touch! :)

I hope our package can be as useful for you as it is for us. And if you think of any improvement to the package, you can open issues or submit pull-requests. Also, do not hesitate to share your point of view in the comments and share our articles, it would mean a lot to us ❤

You can also subscribe to our monthly newsletter to be kept informed of Precogs news and events.

--

--