Your AWS Lambda Function Failed, What Now?

How we fix errors faster than two shakes of a lamb’s tail.

Paul Singman
Oct 28, 2020 · 5 min read

On the analytics team at Equinox Media, we invoke thousands of Lambda functions daily to perform a variety of data processing tasks. Examples range from the mundane shuffling of files around on S3, to the more stimulating generation of real-time fitness content recommendations on the Equinox+ app.

Because of our reliance on Lambda, it’s critical to diagnose issues as quickly as possible.

Here’s a diagram of the process we’ve set up to do so:

Image for post
Image for post
Serverless error handling architecture

If you are also a user of Lambda, what does your error alerting look like? If you find yourself struggling to figure out why a failure occurred, or worse — unaware one happened at all — we hope sharing our solution will help you become a more effective serverless practitioner!

After every single run of a Lambda function, AWS sends a few metrics to the CloudWatch service by default. Per AWS documentation:

Invocation metrics are binary indicators of the outcome of an invocation. For example, if the function returns an error, Lambda sends the Errors metric with a value of 1. To get a count of the number of function errors that occurred each minute, view the Sum of the Errors metric with a period of one minute.

To make us aware of any failures, we create a CloudWatch Alarm based on the Errors metric for a specific Lambda resource. The exact threshold of the alarm depends on how frequently a job runs and its criticality, but most commonly this value is set to trigger upon three* failures in a five minute period.

*One for the original failure, plus two automatic retries.

For some, generic alerting of this sort is sufficient, and notifications are simply directed to a work email or perhaps a PagerDuty Service tied to an on-call schedule.

However, we know in this scenario valuable information about the failed invocation is being ignored. To be most efficient, we strive to automate more of the debugging process.

Our journey, eager Lambda user, is only beginning.

Instead of sending straight to an alerting service, we send alarm notifications to a centralized SNS topic that handles failure events for all Lambda functions across our cloud data infrastructure.

Image for post
Image for post
Configuring a CloudWatch Alarm to send to SNS.

What happens to an Alarm record sent to the topic? It triggers another Lambda function of course!

We call this special Lambda function the Alerting Lambda and it performs three main steps:

  1. Sends a message to Slack with details about the failure.
  2. Creates an incident in PagerDuty, also populated with helpful details.
  3. Queries CloudWatch Logs for log messages related to the failure, and if found, sends to Slack.

The first two steps are relatively straightforward so we’ll quickly cover how they work before diving into the third.

If you inspect the payload sent from CloudWatch Alarms to SNS, you’ll see it contains data related to the alarm itself like the name, trigger threshold, old and current alarm state, and relevant CloudWatch Metric.

The Alerting Lambda takes this data and parses it into a super-helpful Slack message (via a webhook) that looks like this:

Image for post
Image for post
Slack message from #data-alerts channel

Similarly, using the pypd package we create a PagerDuty event with helpful Custom metrics and AWS console link populated:

Image for post
Image for post
PagerDuty Incident with Alarm data populated as Custom Events

Both of these notifications help us instantly determine if an alert is legitimate or perhaps falls more into the “false alarm” category. When managing 100+ tasks, this provides a quality-of-life improvement for everyone on the team.

The third step of the Alerting Lambda is recently implemented (inspired by this post on effective Lambda logging) and has proven to be a beloved shortcut for Lambda debugging.

The output is a message in Slack containing log messages from recent Lambda failures that looks something like this:

Image for post
Image for post
CloudWatch logs automatically appear in Slack!

How does this work exactly?

The first step is to parse out the Lambda function name from the SNS event. This allows us to know which CloudWatch Log Group to query against for recent errors, shown in the code snippet below:

And after parsing the query response for a requestId, we run a second Insights query filtered on that requestId, re-format the log messages returned in the response, and send the results to Slack.

Fetch query logs and post to Slack

Place code like this in your Alerting Lamba and before you know it, you’ll be getting helpful log messages sent to Slack too!

Final Thoughts

Though this solution has proven effective for our needs, there is room for improvement. Notably while we query CloudWatch Logs when a Lambda Errors, we don’t handle other Lambda failures (like timeouts or throttling).

The idea to run an Insights query when a Lambda fails didn’t come to us in a “Eureka!” moment of inspiration… but rather from observing any consistent, predictable actions we perform that could be automated. Maintaining an awareness for these situations will serve any developer well in his or her career.

Another lesson for some getting started with serverless technologies is that you cannot be afraid of managing many, many cloud resources. Critically, the marginal cost of adding an additional Lambda function or SQS queue to your architecture should be near-zero.

The idea of spinning up an additional SNS topic and Lambda for error handling was a turn off to some. We hope we’ve shown the benefits of growing past that limiting mindset. If you want to read more on this topic, check out our post on painlessly deploying Lambda functions.

One final thought, you may be wondering if all other Lambdas are monitored by the Alerting Lambda, what then, monitors the Alerting Lambda function?

Hmmm.

Image for post
Image for post
Photo by Leonardo Yip on Unsplash

Equinox Media Tech

One app. A world of unlimited fitness.

Paul Singman

Written by

ML Engineering Lead at Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

Equinox Media Tech

Hear from the strongest technology team building the most powerful fitness collective.

Paul Singman

Written by

ML Engineering Lead at Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

Equinox Media Tech

Hear from the strongest technology team building the most powerful fitness collective.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store