AWS SDK Timeouts for Lambda

Beware the default timeout, it’ll get you in the end.

Yesterday I was reminded that even AWS hasn’t attined 100% network reliability (although they’re often so close that it is easy to forget). And one of the ways this manifests itself is lambda halts and timeouts.

The symptoms are hard to spot. The lambda function starts fine, but at some point it halts for a long time (typically longer than the life of the lambda function). The occurrences are typically a low percentage of invocations, but are usually focused on calls to a specific service (S3 and Dynamo have been our biggest culprits, but nothing is safe.)

The cause is fairly simple. The AWS-SDK defaults to the semi-standard node/unix socket timeout of two minutes. I’m not convinced two minutes was ever a sensible amount of time for even the slowest of clients to wait, but it’s a lethal amount of time for a lamba function.

With the typical lambda configured to timeout well before 30 seconds, the AWS-SDK default options translate to “succeed, or just wait out the life of the lambda”.

The solution isn’t anything special, you just need to change the default http timeouts on the service configuration. You can do this across the aws-sdk (checkout AWS.Config), however I find it’s often better to be explicit about these things:

It’s worth noting the auto-retry mechanism baked into the AWS-SDK. This means that if either of those timeouts are met, it’ll automatically retry. Arguably this does expose you an edge case where the same action affects twice, however however if you’d timed your lambda out the chances are the trigger/client would have retried anyway.

Of course, this option goes both ways. If you are ignoring some better practice and just having your lambda synchronously invoke another 5 minute lambda, then you’ll need to increase this so you don’t timeout mid-request.

The lesson: Make sure you tune the AWS-SDK timeouts for lambda. The default two minutes is just not sensible.