Error handling in AWS Lambda

Ensuring robust error reporting in large scale systems built on AWS Lambda

Published in

Creating TotallyMoney

12 min readMay 24, 2023

When building large scale systems on AWS Lambda, it is easy to underestimate the complexity and nuances of correctly dealing with failures. Getting it wrong can lead to failures being swallowed and silently ignored, or can lead to over reporting where the developer is too flooded with errors to be able to distinguish what is real from what isn’t. Here I shall outline some guiding principles and common gotchas.

This article is not intended as a comprehensive documentation of error handling tools available in AWS; those are adequately documented. Rather this is intended more as a broad overview of error handling considerations and strategy when developing on AWS Lambda.

At TotallyMoney we develop on AWS mostly using F#. Unfortunately, Medium’s code snippet feature doesn’t support F#, so where appropriate I will provide examples in C#. In any case the principals are universal so it shouldn’t make a difference.

An error reporting system should aim to be exhaustive but discerning. It is important to be informed when systems start failing. Without visibility, the system will likely be left broken until reported by a customer, this can be very costly. On the other hand if your errors are not discerning, and report all failures no matter what level of severity equally, then developers will quickly switch off and start ignoring them.

Before getting into the details of how to handle errors, we must first discuss the various categories of error. It is important to correctly categorise errors, as failing to will lead to an alerting system that isn’t discerning.

What is an error?

An “Error” is a broad term that is often used to describe things that aren’t actually problems. An example I like to use is the divide by zero result in Excel; when you do a calculation in Excel that leads to a divide by zero an error is shown in the result cell. This is a kind of error, but it’s not a failure in the system. It’s one of the possible outcomes allowed by the freedom given to the client. When Excel shows the error result on a divide by zero, it is working correctly. Imagine if some poor developer at Microsoft received an alert every time that happened! For the sake of this article, I shall refer to these as non-exceptional errors.

Non-exceptional errors should not be reported as an error; nothing has actually gone wrong, the system is behaving correctly. Instead the developer must ensure that the possible states of the response they provide on success supports the possible non-exceptional errors.

Transient errors and Systematic errors

The remaining errors can be divided into two categories: transient and systematic. Transient errors are temporary issues; connection failures, database timeouts, random acts of God, that can be resolved with a retry. Systematic errors on the other hand cannot be fixed with a retry; these are usually caused by bugs in the code, but also by problems in the system design.

Transient errors are, on large systems, to some extent unavoidable. Because of this a large scale system must be able tolerate them when they happen. Systematic errors are avoidable, and should be fixed when they arise. In order to resolve systematic errors efficiently, there must be an alerting system that can pick up on them.

With the various types of errors categorised, we can now move onto the specifics of errors in AWS.

Ensuring the lambda actually fails

Before looking at anything else, it is important to understand one of the most basic principle of error handling in AWS. Ensuring that the lambda is actually failing correctly.

It is important to remember that a completed lambda will be logged as a success or a failure depending on the outcome of the execution. When developing with the .NET SDK, a lambda is recorded as having failed in CloudWatch when the execution ends with an unhandled exception.

Perhaps this sounds like an obvious point, but it is also an easy detail to get wrong. Often developers will put a try catch statement around the handler to do logging, but forget to re-throw the error. This means that as far as AWS is concerned, the lambda has succeeded:

public class Handlers
{
  [<LambdaSerializer(typeof<DefaultLambdaJsonSerializer>)>]
  public async Task<ResponseDTO> Handler(RequestDTO request)
  {
    try
    {
      var firstSetOfData = await GetFirstDataSet(request);
    
      await SendNotification(firstSetOfData);
    }
    // Even though an exception has been thrown, this lambda will be seen
    // as having succeeded in CloudWatch as the exception isn't rethrown.
    catch (Exception e)
    {
      LogError(e);
    }
  }
}

This anti-pattern often comes from an ingrained assumption among developers that uncaught exceptions are bad. When a lambda fails with a transient or systematic error, we should aim for the lambda to always correctly register as a failure in CloudWatch. In these cases always allow the exception to terminate the execution.

When a lambda fails due to an exception, the stack trace is included in the response. This can be useful, but be aware that a stack trace can be useful to attackers if it leaks to the client. Consider logging the stack trace and squashing it in the exception handler in these cases.

Categorising invalid input errors: APIGateway vs Direct Invoke

The categorisation of non-exceptional errors is usually specific to the domain. In AWS in particular though there is one area of ambiguity that often causes confusion: Are input validation errors non-exceptional or systematic?

In the case of APIs developed using APIGateway, errors caused by invalid input should be considered non-exceptional. The API is publicly available and it is the responsibility of the developer to ensure all possible inputs are handled gracefully. This should be handled by returning an appropriate HTTP status code; for example, return BadRequest if the request data cannot be parsed into valid JSON. Not doing so may allow potential attackers to flood your alerting systems by intentionally sending junk requests.

You might assume that lambdas that are directly invoked would follow the same principle for handling invalid inputs as APIGateway, however this is not the case. For direct invoke lambdas, invalid input should in general be treated as a systematic error, cause an exception to be thrown and the lambda to fail. To understand why this is, one must understand the key difference between direct invoke lambdas and APIGateway lambdas: Directly invoked lambdas are not publicly exposed, but rather are (usually) only called internally within the system. Any errors in the input would therefore be caused by systematic errors in the system, and in these cases one should follow the principle of failing fast. This behaviour is already built in to how the default JSON deserialiser works in the .NET SDK.

A parallel to this difference exists in how we choose to throw exceptions in guard clauses in private methods, rather than trying to make every method able to handle every possible input; doing so would just lead to unnecessary over-complication.

Handling transient errors

So far we have established that transient errors are to some extent unavoidable, but steps can and should be taken to mitigate them. In order to reduce the risk of failure, lambdas should adhere to the single responsibility principle and keep the number of failure points as low as possible.

When lambdas fail due to transient errors, it should be possible to solve the issue by retrying. This is easier said than done; writing exception safe code is very hard, as realistically few developers know in advance everywhere where the code could fail. A simple strategy to help achieve this is to ensure that all calls to perform IO to get the data that the lambda needs (typically the main source of failures) are done before making any calls to trigger side effects that may change the state of the broader system; for example triggering SNS notifications, updating a database etc.

public class Handlers
{
  // Bad:
  public async Task<ResponseDTO> Handler(RequestDTO request)
  {
    var firstSetOfData = await GetFirstDataSet(request);
    
    await SendNotification(firstSetOfData);

    // If this fails, a retry will send the notification again. 
    var secondSetOfData = await GetSecondSetOfData(request);

    await AddEntryToDatabase(firstSetOfData, secondSetOfData);
  }

  // Better:
  public async Task<ResponseDTO> Handler2(RequestDTO request)
  {
    var firstSetOfData = await GetFirstDataSet(request);
    var secondSetOfData = await GetSecondSetOfData(request);

    // Leave side effects until the end.
    await AddEntryToDatabase(firstSetOfData, secondSetOfData);
    await SendNotification(firstSetOfData);
  }
}

Of course you can go further than this to ensure idempotency; what if the update succeeds but the send fails? How far you go depends on the needs of the system.

Internal retry strategies

Often when faced with a flaky external dependency, developers will reach for a simple retry strategy by calling the service in a loop with a sleep step on failure and breaking on success.

public class Handlers
{
  // An example of internal retries, in general avoid doing this.
  public async Task<ResponseDTO> Handler(RequestDTO request)
  {
    var data = await GetDataSet(request);

    int count = 0;
    while (true)
    {
      try
      {
        await InvokeExternalDependency(data);
        break;
      }
      catch
      {
        count++;
        if (count < 3)
        {
          await WaitSeconds(1);
          continue;
        }
        throw;
      }
    }
  }
}

A more advanced version of this would be using an external library such as Polly. Although the appeal of this is it’s apparent simplicity, it is actually in most cases an anti-pattern: When you retry on a failing service internally, you are effectively hiding the failure; no failure will appear in CloudWatch. At best you may see errors in your logs, but this will require extra log monitoring to work with. On top of that, if you are adding a wait period between retries, then you will have to pay for the extra invocation time as the lambda will still run while it is waiting.

Usually it is better to rely on the retry mechanisms provided by AWS. This has the added benefit of keeping your code free from the clutter of retry loops. AWS can use sophisticated retry strategies out of the box, implementing these in code would be unnecessarily reinventing the wheel.

There are times when an internal retry strategy may be a good idea, for example: If retrying the whole lambda adds an unacceptably large performance cost then perhaps an internal retry makes sense; if you are designing an API for external consumption via APIGateway then perhaps giving responsibility of retrying to the caller is unacceptable. These situations are rare though.

External retry strategies

As a general rule, the responsibility of retrying a failing lambda should fall on the invoker. If one lambda invokes another lambda directly and that lambda fails, it should be the responsibility of the invoking lambda to retry. The invoking lambda may then itself pass on the responsibility of retrying to whatever invoked it by failing itself (Note: long chains of lambdas invoking lambdas are an anti-pattern). In this way, responsibility of retrying can bubble to the root lambda invocation.

When a lambda fails, it causes a failure in the invoking lambda, which bubbles to the root and triggers a retry.

Asynchronous vs Synchronous lambdas

Lambdas are invoked either synchronously or asynchronously. Asynchronous lambdas are fire-and-forget, the invoker is not interested in the response and moves on after invocation. With synchronously invoked lambdas the invoker waits until they get a response. All lambdas that are triggered by AWS events such as SNS, SQS, StepFunction or EventBridge events are asynchronous.

By default, asynchronously invoked lambdas already have a simple retry strategy. This happens by default because the root invoker (and thus the one with responsibility for retries) is AWS itself. The default behaviour can be changed via the configuration of the lambda. Synchronously invoked lambdas on the other hand do not retry by default. The invoker is the system code, which has the responsibility to retry, or pass on responsibility to its own invoker by failing.

Invocation vs Function Errors

When a lambda is invoked directly, there are two kinds of errors that can occur, invocation errors and function errors. Invocation errors occur when the lambda could not be invoked. This is usually due to an incorrect function identifier or missing permissions. Invocation errors are systematic errors and by default throw an exception.

Function errors occur when there is an error in the execution of the invoked lambda itself. One of the gotchas of directly invoking lambdas is that these errors do not cause exceptions from the calling client. The caller must manually check the response to check if an error has occurred. This catches many developers out, as it is inconsistent with how other AWS services work. The reason for this is that AWS does not want to assume whether an error in the invoked lambda is actually an error for the invoker, if a lambda is invoked asynchronously it usually isn’t. When directly invoking a lambda synchronously, the best thing to do is usually to check the response and throw an exception if a function error is present.

Dead-letter-queues

For whatever reason sometimes a retry strategy will be unable to stop a failure, this may be because a transient failure is being caused by a server being down for longer than the retry strategy can handle, or maybe once in a while a lambda happens to have several unrelated transient errors in a row. More often than not, if a lambda fails all its retries, then this is due to a systematic error.

A lambda failing all of its retries can potentially mean customer data getting lost from the system. In this case it may be best desirable not to lose that data entirely. For this reason you should consider setting up a dead-letter-queue (DLQ) on asynchronously invoked lambdas. This is quite easy, and supported out of the box. A dead letter queue will simply store the event upon all retries failing. The developer is free to recycle the events in the DLQ at their convenience, meaning that once the systematic error has been fixed, the lost events can be fired again with no loss of data (albeit with a delay).

It isn’t always necessary to have a DLQ, sometimes it simply isn’t necessary to worry about every lost event. For example if an event for a daily customer notification fails for a week, a customer is unlikely to be grateful to receive a weeks worth of notifications after the issue is resolved. Some good judgement is required.

An early temptation many developers have when finding out about DLQs is to automate recycling the queue. I would advise against this; a DLQ is intended as the final stage after the retry cycle, not a part of it. Items appearing in the DLQ should be a sign of a systematic error; if too many transient errors are finding their way into a DLQ, then that is a sign that the retry strategy should be revisited.

Earlier I mentioned the importance of being discerning with alerts, and not be flooded with alerts from every transient failure. A very good strategy to filter out transient errors is triggering an alert not on the lambda failure, but rather on the DLQ being populated. At this point, we know that all the retries have failed and the issue is likely to be systematic. This can be set up quite easily in CloudWatch.

Alerting systems

In general I think the best way to set up alerts is to integrate with CloudWatch’s alerting system. CloudWatch tracks a lot of metrics across its services including lambda failures, and can even alert on more complicated metrics built on simple arithmetic of existing metrics. One must first set up an alert topic in SNS and then listen to it. Triggering emails from an SNS topic is trivial. Sending alerts to a slack channel is a little more complicated and requires setting up an AWS chatbot to listen to the topic and send notifications.

Figuring out what to alert on in an exhaustive but discerning manner is an art; triggering off DLQ’s is a neat way of filtering out transient errors, another way is to set up a metric that monitors the rate of errors to success and only sends an alert when the rate crosses a threshold. A good strategy is to approach the problem exhaustively first; start by alerting on all errors, and then add filtering mechanisms as they become necessary to deal with transient errors.

Log based alerts

A common and simple way of implementing alerts is to trigger them based on the logs. You simply watch for error logs and send a notification when you get one. This approach is quite easy to set up and we have used it a lot. We set up our logs to go to a third party structured logging tool such as Loggly or Datadog, and then set up alerts within those tools to send error alerts directly to Slack. While this approach is simple to set up, it does have some glaring weaknesses; discerning between systematic errors and transient errors is difficult; you have to rely on logging at various levels of severity, which is not always a simple decision. You also have to ensure non-exceptional errors don’t use the same error logging used for alerting. On top of this, it only works if the error is actually logged; lambda timeouts will simply cause execution to end. If this happens then any error logging code will simply not be hit. In my opinion log based alerting can be a useful supplement to CloudWatch alerts, but probably shouldn’t be considered a replacement.

Summary

Ensure errors are correctly categorised and that any errors are actually real errors
In the case where a real error occurs, remember to fail the lambda
Understand the difference between transient and systematic errors
Ensure lambdas can be safely retried in the event of a transient error
Prefer to let the invoker be responsible for retries rather than trying to retry internally
Prefer AWS built-in retry mechanisms to custom ones
Use dead-letter-queues to protect against data loss from transient errors
Use the tools provided by CloudWatch to ensure exhaustive but discerning error reporting
Consider supplementing with log based alerts, but be aware of the limitations