Your AWS Lambda Function Failed, What Now?
On the analytics team at Equinox Media, we invoke thousands of Lambda functions daily to perform a variety of data processing tasks. Examples range from the mundane shuffling of files around on S3, to the more stimulating generation of real-time fitness content recommendations on the Equinox+ app.
Because of our reliance on Lambda, it’s critical to diagnose issues as quickly as possible.
Here’s a diagram of the process we’ve set up to do so:
If you are also a user of Lambda, what does your error alerting look like? If you find yourself struggling to figure out why a failure occurred, or worse — unaware one happened at all — we hope sharing our solution will help you become a more effective serverless practitioner!
Step #1: Create An Error Metric-Based CloudWatch Alarm
After every single run of a Lambda function, AWS sends a few metrics to the CloudWatch service by default. Per AWS documentation:
Invocation metrics are binary indicators of the outcome of an invocation. For example, if the function returns an error, Lambda sends the
Errorsmetric with a value of 1. To get a count of the number of function errors that occurred each minute, view the
Errorsmetric with a period of one minute.
To make us aware of any failures, we create a CloudWatch Alarm based on the
Errors metric for a specific Lambda resource. The exact threshold of the alarm depends on how frequently a job runs and its criticality, but most commonly this value is set to trigger upon three* failures in a five minute period.
*One for the original failure, plus two automatic retries.
For some, generic alerting of this sort is sufficient, and notifications are simply directed to a work email or perhaps a PagerDuty Service tied to an on-call schedule.
However, we know in this scenario valuable information about the failed invocation is being ignored. To be most efficient, we strive to automate more of the debugging process.
Our journey, eager Lambda user, is only beginning.
Step #2: With A Little Help From An SNS Topic + Lambda Friends
Instead of sending straight to an alerting service, we send alarm notifications to a centralized SNS topic that handles failure events for all Lambda functions across our cloud data infrastructure.
What happens to an Alarm record sent to the topic? It triggers another Lambda function of course!
We call this special Lambda function the Alerting Lambda and it performs three main steps:
- Sends a message to Slack with details about the failure.
- Creates an incident in PagerDuty, also populated with helpful details.
- Queries CloudWatch Logs for log messages related to the failure, and if found, sends to Slack.
The first two steps are relatively straightforward so we’ll quickly cover how they work before diving into the third.
If you inspect the payload sent from CloudWatch Alarms to SNS, you’ll see it contains data related to the alarm itself like the name, trigger threshold, old and current alarm state, and relevant CloudWatch Metric.
The Alerting Lambda takes this data and parses it into a super-helpful Slack message (via a webhook) that looks like this:
Similarly, using the
pypd package we create a PagerDuty event with helpful Custom metrics and AWS console link populated:
Both of these notifications help us instantly determine if an alert is legitimate or perhaps falls more into the “false alarm” category. When managing 100+ tasks, this provides a quality-of-life improvement for everyone on the team.
The third step of the Alerting Lambda is recently implemented (inspired by this post on effective Lambda logging) and has proven to be a beloved shortcut for Lambda debugging.
The output is a message in Slack containing log messages from recent Lambda failures that looks something like this:
How does this work exactly?
The first step is to parse out the Lambda function name from the SNS event. This allows us to know which CloudWatch Log Group to query against for recent errors, shown in the code snippet below:
And after parsing the query response for a
requestId, we run a second Insights query filtered on that
requestId, re-format the log messages returned in the response, and send the results to Slack.
Place code like this in your Alerting Lamba and before you know it, you’ll be getting helpful log messages sent to Slack too!
Though this solution has proven effective for our needs, there is room for improvement. Notably while we query CloudWatch Logs when a Lambda
Errors, we don’t handle other Lambda failures (like timeouts or throttling).
The idea to run an Insights query when a Lambda fails didn’t come to us in a “Eureka!” moment of inspiration… but rather from observing any consistent, predictable actions we perform that could be automated. Maintaining an awareness for these situations will serve any developer well in his or her career.
Another lesson for some getting started with serverless technologies is that you cannot be afraid of managing many, many cloud resources. Critically, the marginal cost of adding an additional Lambda function or SQS queue to your architecture should be near-zero.
The idea of spinning up an additional SNS topic and Lambda for error handling was a turn off to some. We hope we’ve shown the benefits of growing past that limiting mindset. If you want to read more on this topic, check out our post on painlessly deploying Lambda functions.
One final thought, you may be wondering if all other Lambdas are monitored by the Alerting Lambda, what then, monitors the Alerting Lambda function?