Elevating CloudWatch Logs: Smart Alerts with Chatbot, SNS, and Lambda

7 min readOct 11, 2023

Hello world 👋

I’m Louis, a Cloud and DevOps Engineer. Today, I’m excited to introduce you to my first Terraform module, which offers a fresh approach to managing CloudWatch Logs notifications. Additionally, it provides a way to receive stylish Slack notifications directly from AWS.

Without further ado, let’s dive in.

What is the issue with Cloudwatch Alarms?

Consider a scenario where you have an architecture with hundreds of Lambda functions, and you opt to utilize the CloudWatch Logs Metric Filter and CloudWatch Alarms for receiving alerts in case of errors.

Here’s how it works: when an error message appears in the logs, the error metric is set to 1. Once this metric crosses a certain threshold (let’s say 0, to catch every error), it triggers an alarm, which in turn can send a message to either Slack or an email address (using SNS and/or AWS Chatbot).

The main drawback of this alerting method is that in cases of recurrent errors, only one notification is dispatched because the alarm is triggered solely when the threshold is breached. If the error level persists, you won’t receive additional notifications. This could potentially lead to a mistaken belief that the errors were temporary.

Furthermore, there’s another scenario where two distinct types of errors occur. In this situation, only one notification will be generated, even if you have two or more different errors.

My goal is to implement a system that can analyze each error originating from the logs and determine whether a notification is warranted. All while ensuring it’s easy to integrate into an existing infrastructure and remains cost-effective.

And I said: “Let there be…

CloudWatch Logs Enhanced Alerts

To tackle this problem, here’s the idea: let’s create a Lambda function that triggers when an error occurs in CloudWatch Logs. This function will determine if we need an alert by checking if we’ve had this error in the last few minutes, etc.

I understand this might sound complex, so allow me to explain the process in detail with a diagram:

An error occurs in CloudWatch Logs, which triggers a Lambda function using the CloudWatch Subscription Filter.
The function performs a GetItem operation on a DynamoDB table to check if the error recently occurred. Based on this, it decides if an alert is necessary. (I’ll explain below how we decide if a notification is needed.)
If a notification is required, the Lambda function publishes a message to an SNS topic.
The SNS topic can be connected to AWS Chatbot for a Slack notification (though it’s also possible to set up email alerts using only the SNS topic).
Finally, the notification is delivered (in this case, on Slack).

So now that we have a general overview, let’s delve into how the Lambda function operates. When triggered, the Lambda checks if there’s an item in the DynamoDB table with the error message’s hash as the primary key. Each record in DynamoDB will have the following attributes:

error_message_hash: as mentioned earlier, this is the hash of the error message used as the primary key.
timestamp: this attribute is used to determine if we should send another notification (in the case of multiple alerts in quick succession).
n and m: two integers used to compute an exponential backoff for updating the timestamp and ttl.
payload: an attribute to store the error’s details.
counter: represents the number of errors since the last sent notification.
ttl: used to remove the entry after a certain amount of time has passed.

If an item exists, the function checks the timestamp to decide if a new notification is necessary. If not, it only updates the counter (allowing us to track the number of errors between notifications).

If an alert is needed, it updates:

the timestamp with an exponential backoff computed with n and m (this prevents an influx of messages if there are multiple errors).
the ttl with an exponential backoff computed with n and m,
n and m (following the Fibonnaci sequence).

And, if there is no item, the function creates a new entry and sends a notification.

I understand this might be a bit intricate, so I’ve included a visual schema of the process for those who, like me, prefer visual aids:

Now, let’s address a potential scenario: if we encounter multiple errors, this system will only send one notification. However, what happens when the ttl expires for an item with a counter greater than 0? We might lose some information.

To resolve this, we can set up a DynamoDB Stream that triggers the same Lambda function when an item expires. The function will process the event differently to send an alert (steps 2, 3, and 4) only if the counter is not 0 when an item expires, without performing the analysis and decision of sending a message.

While theory is intriguing, I believe in hands-on experience, so let’s put this plan into action.

Let’s setup everything (using Terraform and Slack)

Alright, let’s roll up our sleeves and dive in. You can find the full code, along with an example on GitHub. Alternatively, if you’d like to use the module directly, you can find it on the Terraform Registry.
I’ll demonstrate the example below using Terraform, but rest assured, you can achieve the same using the AWS CLI or the AWS Console with equal ease.

Setting up the roots

First, you’ll need to setup a DynamoDB table with a DynamoDB Stream enabled:

resource "aws_dynamodb_table" "logs_errors_table" {
  name             = "logs_errors_table"
  hash_key         = "error_message_hash"
  billing_mode     = "PAY_PER_REQUEST"
  stream_enabled   = true
  stream_view_type = "OLD_IMAGE"

  attribute {
    name = "error_message_hash"
    type = "S"
  }

  ttl {
    attribute_name = "ttl"
    enabled        = true
  }
}

Then you’ll need a SNS topic:

resource "aws_sns_topic" "alerts_sns_topic" {
  name = "alerts_sns_topic"
}

If you opt for Slack integration, you’ll need to configure AWS Chatbot with the appropriate IAM settings. However, if you prefer email notifications, you can set it up using SNS. In case you choose Slack, make sure to include the AWS Chatbot application in your workspace. Here’s how you can go about it:

//IAM Settings
data "aws_iam_policy_document" "assume_role_chatbot" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["chatbot.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "chatbot_slack_role" {
  name_prefix        = "chatbot_slack_role"
  assume_role_policy = data.aws_iam_policy_document.assume_role_chatbot.json
}

data "aws_iam_policy_document" "chatbot_slack_policy" {
  statement {
    actions = [
      "cloudwatch:Describe*",
      "cloudwatch:Get*",
      "cloudwatch:List*",
    ]
    resources = ["arn:aws:cloudwatch:REGION:ACCOUNT_ID:*"]
  //Don't forget to replace the REGION and ACCOUNT_ID
  }
}

resource "aws_iam_role_policy" "chatbot_slack_role_policy" {
  name   = "chatbot_slack_role_policy"
  role   = aws_iam_role.chatbot_slack_role.id
  policy = data.aws_iam_policy_document.chatbot_slack_policy.json
}

//AWS Chatbot Settings
resource "awscc_chatbot_slack_channel_configuration" "slack_alerts" {
  configuration_name = "slack_alerts"
  iam_role_arn       = aws_iam_role.chatbot_slack_role.arn
  slack_channel_id   = "00000000000" //Replace with your own Channel ID
  slack_workspace_id = "00000000000" //Replace with you own Workspace ID
  sns_topic_arns     = [var.sns_topic_arn]
}

After all of that, let’s get into the main thing: the Lambda function. And first of all, let’s configure the Terraform part:

//IAM ROLE & POLICIES
data "aws_iam_policy_document" "assume_role_lambda" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "lambda_role" {
  name_prefix         = "lambda_role"
  assume_role_policy  = data.aws_iam_policy_document.assume_role_lambda.json
}

data "aws_iam_policy_document" "lambda_policy" {
  statement {
    actions = [
      "dynamodb:PutItem",
      "dynamodb:UpdateItem",
      "dynamodb:DeleteItem",
      "dynamodb:GetItem",
      "dynamodb:Scan",
      "dynamodb:Query",
    ]
    resources = [aws_dynamodb_table.logs_errors_table.arn]
  }
  statement {
    actions = [
      "dynamodb:GetRecords",
      "dynamodb:GetShardIterator",
      "dynamodb:DescribeStream",
      "dynamodb:ListStreams"
    ]
    resources = ["${aws_dynamodb_table.logs_errors_table.arn}/stream/*"]
  }
  statement {
    actions = [
      "sns:Publish",
    ]
    resources = [aws_sns_topic.alerts_sns_topic.arn]
  }
}

resource "aws_iam_role_policy" "lambda_policy" {
  name_prefix = "lambda_policy"
  role        = aws_iam_role.lambda_role.id
  policy      = data.aws_iam_policy_document.lambda_policy.json
}

//Lambda function
data "archive_file" "lambda_zip" {
  type        = "zip"
  source_file = "alerts.py"
  output_path = "build.zip"
}

resource "aws_lambda_function" "lambda" {
  function_name    = "CloudWatchLogsAlerts"
  description      = "Trigger alerts based on Cloudwatch Logs Subscription Filter trigger"
  role             = aws_iam_role.lambda_role.arn
  filename         = data.archive_file.lambda_zip.output_path
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256
  timeout          = 10
  runtime = "python3.8"
  handler = "logs_alerts.lambda_handler"
  environment {
    variables {
       SNS_ARN        = aws_sns_topic.alerts_sns_topic.arn
       DYNAMODB_TABLE = aws_dynamodb_table.logs_errors_table.name
       MAX            = 600 //10 minutes
    }
   }
   depends_on = [
      aws_dynamodb_table.logs_errors_table,
      aws_sns_topic.alerts_sns_topic,
      aws_iam_role.lambda_role
   ]
}

resource "aws_lambda_permission" "allow_cloudwatch_logs" {
  action         = "lambda:InvokeFunction"
  function_name  = aws_lambda_function.lambda.function_name
  principal      = "logs.amazonaws.com"
  source_account = "ACCOUNT_ID"
  source_arn     = "arn:aws:logs:REGION:ACCOUNT_ID:log-group:*"
 //Don't forget to replace the REGION and ACCOUNT_ID
}

resource "aws_lambda_event_source_mapping" "trigger" {
  event_source_arn  = aws_dynamodb_table.logs_errors_table.stream_arn
  function_name     = aws_lambda_function.lambda.function_name
  starting_position = "LATEST"
  filter_criteria {
    filter {
      pattern = "{\\"userIdentity\\":{\\"type\\":[\\"Service\\"],\\"principalId\\":[\\"dynamodb.amazonaws.com\\"]}}"
    }
  }
}

And the last thing you’ll need is the code for the function:

Connect CloudWatch Logs to the Lambda

Now that everything is setup and ready (after a terraform init and terraform apply), you have to connect your CloudWatch Logs Group to your Lambda. And for that, you have several options:

Using Terraform:

resource "aws_cloudwatch_log_subscription_filter" "alarm_error_lambda" {
  name            = "filterName"
  log_group_name  = aws_cloudwatch_log_group.myLogGroup.name
  filter_pattern  = ""
  destination_arn = aws_lambda_function.yourLambda.arn
}

Using the AWS CLI:

aws logs put-subscription-filter \\
    --log-group-name myLogGroup \\
    --filter-name filterName \\
    --filter-pattern "" \\
    --destination-arn yourLambdaArn

Or using the AWS Console:

Thanks AWS for this tutorial 🫶

The results

Now that you’ve configured everything… let’s put it to the test!

If you execute the test case named “CloudWatch Logs” on Lambda, you should see this message in your Slack channel:

Above, we’ve conducted a brief test. Now, let’s delve into a real-world example (with obfuscated data):

Conclusion

Well, there you have it! This marks my debut article on Medium and the release of my inaugural Terraform Module… it’s a significant first step!

Thanks to David who also worked a lot on this system. And who knows, perhaps I’ll have the pleasure of seeing you again soon on Medium! 👋