Self-Hosted Lambda Monitoring and Alerting With Slack

Chris Plankey
11 min readSep 20, 2020

--

Over the past few years I have tried many monitoring solutions for my Serverless stacks and while there are many solutions out there, what I felt was lacking was the ability to host these solutions in your own account without sending logs to a vendor’s AWS account.

What you decide to monitor will be different based on use cases, but for me, the most valuable monitoring and alerting I have found is sending my Lambda errors to Slack and that is what this article is about.

Background:

This is not about monitoring Lambda duration, invocations or anomalies, this is about being alerted when a Lambda fails and receiving context about the error. There are many tools such as Datadog, Dashbird, Seed and others, but each has their own trade-off’s. For most of them, it comes down to cost, but for just about all of them, the biggest risk is that they send your logs to a 3rd party AWS account, and for me, that is just not acceptable. My only goal was to alert myself when Lambda’s errored and to not send those logs to a 3rd party (well, we still use Slack, so you have to draw the line somewhere).

After evaluating many of the tools out there and a bit of curiosity, I decided I would build an open source, scalable solution that I could share with anyone who wanted the security of hosting their own error alerting solution.

I wanted alerts to be fired on the following events:

  1. A Lambda Error
  2. A Lambda timeout
  3. Any output written to console.error()

Solution:

A CloudWatch Rule to Invoke a Subscription Lambda that adds CloudWatch Subscriptions with the destination of an Alerter Function

A deployable stack that takes 1 parameter (a Slack URL) and monitors any Lambda in your account that has a tag of monitoring set to true . You can deploy this solution to your account using the following two methods:

  1. Clone my repo and deploy to your account using the Serverless Framework: https://github.com/cplankey/lambda-errors-to-slack
  2. Use the AWS Serverless Application Repo and deploy this stack: https://serverlessrepo.aws.amazon.com/applications/us-east-1/675087241163/lambda-errors-to-slack
lambda-errors-to-slack in the Serverless Application Repository

Stack Requirements:

A Slack Webhook URL which can be generated following the below steps:

Create The Slack App

(These instructions came from a previous tutorial I wrote in 2019)

The first thing you will need to do is create a Slack application that will leverage Incoming Webhooks. Visit: https://api.slack.com and click the ‘Start Building’ button.

Slack API Homepage

You will be prompted to enter an App Name and to map the Application to an existing workspace:

Creating A New Slack App

Click ‘Create App’ and on the next screen, select ‘Incoming Webhooks’.

Select Incoming Webhooks

On the next screen, toggle the ‘Activate Incoming Webhooks’ to ‘On’ then scroll down and click ‘Add New Webhook to Workspace’. Once you do that, you will have to map the webhook to a channel and then you will receive a URL for your webhook.

Activating A New Webhook

The value after https://hooks.slack.com/services/ is what you will need as your stacks parameter.

Now just log into AWS, set the monitoring tag to true on a few Lambdas and any time you write to console.error() , the Lambda times out, or the Lambda errors, you will get a Slack notification that looks like this:

Example Slack Error Alert

If that is all you were looking for, thanks for reading, and feel free to submit a pull request on the GitHub Repo to make this project better! If you’re interested in how I implemented this solution, keep reading!

The process:

Through evaluating different tools that sent Lambda errors to slack, I identified the following requirements:

  1. I need to create CloudWatch Log Subscriptions on each log group that I wish to monitor
  2. I need a destination for the CloudWatch Subscription that can forward the logs to Slack
  3. I need a mechanism for scaling this solution such as auto subscribing to specified lambdas

I will break down each requirement with the information I found and how I chose to solve each component. Because a subscription filter has no value without a destination, we will focus on the destination first.

Understanding destinations for CloudWatch Log Subscription Filters

When you enable Serverless monitoring from many of the industry leaders, it is clear to see in CloudWatch that they subscribe to your log groups with a Kinesis stream. Upon researching CloudWatch Log Subscriptions, I learned you can have subscriptions event to the following destinations:

  1. Kinesis
  2. Lambda
  3. Kinesis Firehose

So my first question was: Why do all vendors use Kinesis for Log Subscriptions? The answer: Kinesis is the only option that supports cross account destinations for subscriptions, they very root of the problem I was trying to solve.

You can make a case that Kinesis is also a better way to handle large volumes of events, but for my task at hand, I figured I could get away with a Lambda function hosted in my own account.

Creating the Lambda Function

When beginning this, I was not sure what the event of a CloudWatch Subscription would look like to my Lambda, but luckily AWS documents this well. The event contains a Base64 encoded string that is also compressed:
"awslogs": {"data": "BASE64ENCODED_GZIP_COMPRESSED_DATA"} }

Once you decompress and decode the data element, the event looks like this:

{
messageType: 'DATA_MESSAGE',
owner: '1234567890',
logGroup: '/aws/lambda/mediumDemo',
logStream: '2020/09/20/[$LATEST]e69be1a1fe5649f3ba086ae870b7fa86',
subscriptionFilters: [ 'alerter-lambda' ],
logEvents: [
{
id: '35695071040888639396093471058274561676918578645330952193',
timestamp: 1600622343474,
message: '2020-09-20T17:19:03.474Z\t56ebbed6-ccc8-4e88-9712-367223298b46\tERROR\tThis is a demo\n'
}
]
}

So the first task was decoding and decompressing the event to begin to make use of it. Luckily for us, AWS again provides a great example:

var zlib = require('zlib');
exports.handler = function(input, context) {
var payload = Buffer.from(input.awslogs.data, 'base64');
zlib.gunzip(payload, function(e, result) {
if (e) {
context.fail(e);
} else {
result = JSON.parse(result.toString('ascii'));
console.log("Event Data:", JSON.stringify(result, null, 2));
context.succeed();
}
});
};

However, this example is not using an async handler and it doesn’t post the error message to Slack.

Making the function async

I spent some time trying to hack together some async implementations of zlib but was making no headway. This is when I began to research how others solved this problem and was very happy to stumble upon this repo from Yan Cui which helped me wrap my head around an async pattern for zlib.

Posting the Error Message to Slack

This task was made up of a few different parts:

  1. Parse the error message for useful info
  2. Format and send a message using Slack Webhooks

Parsing the Message For Useful Info:

The fields I wanted in my Slack alerts were

  1. Error Type
  2. Error Message
  3. Timestamp
  4. Lambda Function Name
  5. Log Stream

If we revist the decoded/decompressed event, some of these fields are easier to grab than others:

{
messageType: 'DATA_MESSAGE',
owner: '1234567890',
logGroup: '/aws/lambda/mediumDemo',
logStream: '2020/09/20/[$LATEST]e69be1a1fe5649f3ba086ae870b7fa86',
subscriptionFilters: [ 'alerter-lambda' ],
logEvents: [
{
id: '35695071040888639396093471058274561676918578645330952193',
timestamp: 1600622343474,
message: '2020-09-20T17:19:03.474Z\t56ebbed6-ccc8-4e88-9712-367223298b46\tERROR\tThis is a demo\n'
}
]

We see that we can grab the Log Stream very easily and we see that the logGroup has a value that will allow us to parse out the Function Name. We see we can get timestamp by looking into the logEvents array, but the rest of our values are in the logEvents message and they aren’t in any great format to parse (JSON or bust).

To get values for Function Name, Error Type, and Error Message, we need to split some strings on various characters. Function Name was easy to split logGroup on / and grab the 3rd value in that split array, but message was a bit more tricky. The reason for this is that the message has a different format for different error types. Through some forced errors and splitting the message on '\t' I found the possible values for this new message array were:

For an Error:

[
'2020-09-04T00:38:00.810Z',
'd440b814-371d-4077-a11d-47615727f4ec',
'ERROR',
'Invoke Error ',
'{"errorType":"TypeError","errorMessage":"Cannot read property \'x\' of undefined","stack":["TypeError: Cannot read property \'x\' of undefined"," at Runtime.exports.main [as handler] (/var/task/services/webhooks/webpack:/tmp/example.js:1:1)"," at Runtime.handleOnce (/var/runtime/Runtime.js:66:25)"]}\n'
]

For a timeout:

[ '2020-09-06T13:57:55.672Z 64cad227-917f-4159-8791-f1c3818dc206 Task timed out after 1.00 seconds\n\n' ]

For a console.error()

[
'2020-09-06T13:02:05.184Z',
'466e6c7a-8cbf-4e53-bbf2-3409486f4b59',
'ERROR',
'THIS IS A CONSOLE ERROR TYPE\n'
]

So with these three variances in mind, this is the logic I implemented to build useful Slack alerts:

let messageArray = eventDetails.logEvents[0].message.split('\t');
if (messageArray[4]){
errorJSON = JSON.parse(messageArray[4]);
errorType = errorJSON.errorType;
errorMessage = errorJSON.errorMessage;
} else {
if(messageArray.length > 1){
errorType = 'console.error()';
errorMessage = messageArray[3];
} else {
errorType = 'TIMEOUT';
errorMessage = messageArray[0].substr(messageArray[0].indexOf('Task'));
timestamp = messageArray[0].substr(0, messageArray[0].indexOf('Z')+1);
}

From here all I needed to do was send a pretty slack message and call it a day.

Format and Send a Message Using Slack Webhooks

This part was a bit more fun. Slack has an awesome tool for building messages with their Block Kit

Custom Slack Message for Lambda Errors

I played around with this tool for a bit and ended up with the following JSON

{
"blocks": [
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Type:*\n${errorType}"
},
{
"type": "mrkdwn",
"text": "*Timestamp:*\n${timestamp}"
},
{
"type": "mrkdwn",
"text": "*Error:*\n${errorMessage}"
}
]
},
{
"type": "context",
"elements": [
{
"type": "plain_text",
"text": "Lambda: ${functionName}",
"emoji": true
}
]
},
{
"type": "context",
"elements": [
{
"type": "plain_text",
"text": "Log Stream: ${logStream}",
"emoji": true
}
]
}
]
}

After I had this JSON, I reused an old function I had written that could post to Slack.

The final Lambda Function resulted in this:

Typically I would use node-fetch for my HTTP calls, but because the https library is included in the Lambda runtime environment, I figured less dependancies would be best.

Great, I have a CloudWatch Subscription Destination, now I just need a CloudWatch Subscription Filter and a way to scale subscribing to Log Groups!

Creating The Subscription Filter

This part can be tricky depending on what you’re trying accomplish. You can spend some times in the AWS docs to learn the syntax. You can also evaluate some vendors, let them subscribe to your log groups and then use the AWS CLI to describe-subscription filters and see what types of patterns they are using.

Doing some digging for how to get these filter patterns and again you might end up looking at docs from Yan Cui, but through all my research I found the best pattern to be the pattern used by the folks at Seed. Again, keep in mind the goal was to get all Timeouts, Function Errors, and console writes at the error level.

The Filter Expression I ended up with is:?”Error: Runtime exited” ?”Task timed out after” ?”\tERROR\t” ?”\\”level\\”:\\”error\\””

Now that I have the Subscription Filter, I need a scalable way to apply it to relevant Lambda functions

Scaling the Solution by auto subscribing to specified lambdas

There are a couple trains of thought on how you may want to accomplish this: you can go for the blanket approach and subscribe to all lambdas, you can manually add the subscription to a small group of lambdas, or you can create a CloudWatch Event that looks for specific tags and fires events when matches are found. CloudWatch Events are the path I decided to take.

So for my implementation, I wanted a CloudWatch Event to be triggered every time a Lambda function had a tag changed on it. This event would trigger a Lambda (Subscriber Lambda) and the Subscriber would look for the tag monitoring.

Lambda Function With The Monitoring Tag Set To True

Depending on the value of this tag, I needed the Subscriber to subscribe or unsubscribe from another Lambda Function’s log group (the function whose tag was modified). I found this AWS tutorial to be the most valuable during this phase and the event I ended up with for a trigger looked like this:

{
"detail-type": [
"Tag Change on Resource"
],
"source": [
"aws.tag"
],
"detail": {
"service": [
"lambda"
],
"resource-type": [
"function"
]
}
}

Within the Subscriber Lambda, I added some basic logic to check if the monitoring tag existed, what it’s value was. I also leveraged the aws-sdk to create or delete a Subscription Filter on the modified Lambda. The event that the Subscriber Lambda is invoked with looks like this:

{
"version": "0",
"id": "fc5b2296-7107-e45f-5e24-fe909fa7bf51",
"detail-type": "Tag Change on Resource",
"source": "aws.tag",
"account": "1234567890",
"time": "2020-09-20T17:17:45Z",
"region": "us-east-1",
"resources": [
"arn:aws:lambda:us-east-1:1234567890:function:mediumDemo"
],
"detail": {
"changed-tag-keys": [
"monitoring"
],
"service": "lambda",
"resource-type": "function",
"version": 1,
"tags": {
"monitoring": "true"
}
}
}

This event is pretty easy to parse and after establishing if the monitoring tag is set to true, I build a request to put a subscription filter that looks like this:

let params = {
destinationArn: process.env.ALERTER_LAMBDA,
/* required */
filterName: 'alerter-lambda',
/* required */
filterPattern: '?"Error: Runtime exited" ?"Task timed out after" ?"\tERROR\t" ?"\\"level\\":\\"error\\""',
/* required */
logGroupName: `/aws/lambda/${functionName}`,
/* required */
distribution: 'ByLogStream'
};
try {
await cloudwatchlogs.putSubscriptionFilter(params).promise();
} catch (err) {
console.log(err);
throw err;
}

Here you can see that I pass the value of our Alerter Lambda as the destinationArn, create a relevant name, use our filter pattern from the above step and specify which log group I am adding a subscription filter to. The aws-sdk docs for subscription filters will be your friend during this process.

I had a bit more fun with this lambda and added some functionality to remove subscription filters when the tag was removed or set to false. The final function ended up looking like

Conclusion

Once you create a Slack Webhook URL and deploy this stack to your AWS account, you will have the ability to set the monitoring tag to true on any Lambda function you desire and any time this Lambda has an Error, Timeout, or log written at the error level, it will relay that message to Slack

Next Steps

  1. Contribute this pattern to more open source projects like CDK-Patterns
  2. Enhance the subscriber Lambda and CloudWatch Event to pick up more than just Lambda Functions
  3. Switch to Kinesis

Thanks for taking the time to read this post! Feel free to leave a comment below on anything you wish to learn more about or open a PR on the repo with any enhancements you think are needed.

A cluttered image to use for the cover image

--

--

Chris Plankey

Software Developer. AWS Enthusiast. Strong passion for voice technologies. https://chrisplankey.com/