Monitor your AWS infrastructure with Cloudwatch alarms & Slack alerts

nolwen brosson

Published in

Legalstart

7 min readJan 27, 2023

Introduction

Context

Providing top quality legal services can require a lot of computing power. At Legalstart, we believe that user experience and page speed should not be impacted by these heavy workloads. For this reason, we split our code between our main back-end services, and Celery workers that perform heavy tasks in the background. All these applications are deployed in multiple ECS clusters. Of course, most of these services are critical. For this reason, we need to have as few downtimes as possible.

The problem

We are more and more confident on our testing system to deploy stable versions in production, but our biggest pain point are hardware issues. Indeed, while AWS takes care of a lot of things with ECS, it’s our responsibility to keep our EC2 instances and docker containers up and running without issues. An example of pain point for us is the memory usage, which, when too high, can have an impact on some applications. Multiple times, asynchronous tasks such as document generation & data updates fail because of infrastructure issues. Moreover, our logging strategy is not as great for hardware errors as it can be for application errors, and it can take several minutes if not more to discover these issues. Fortunately, AWS provides a service called Cloudwatch, which can help us to tackle this issue.

Cloudwatch alarms

Monitor hardware metrics

Metrics, conditions and thresholds

Cloudwatch monitors AWS resources, and Cloudwatch alarms can trigger events based on some metrics, conditions and thresholds. Metrics are data about the performance of your systems. Conditions must be greater than, less than, or equal to the threshold.

You can find the list of resources that publish metrics to AWS here. For each service, metrics provided by AWS can be different. EC2 and ECS provide different metrics. Also, note that you can build custom metrics which are built on existing metrics provided by AWS. Finally, note that you have to enable Container Insights on your ECS cluster to use cloudwatch alarms.

An example

Let’s say you want to be notified when a server (here, an EC2 instance) is having an issue. In order to do so, you need to define what is the issue type (exampe: a disk is over-used), when there is an issue (for example, when more than 65% of the disk is used), and what should happen when this situation occurs (I want to receive a mesage on Slack when it happens).

Here, to do that, you can create a metric such as DiskUsage, a condition/ threshold such as GreaterOrEqualTo 65% and an event based on that.

Trigger a lambda function when an event is sent

A Cloudwatch alarm can be in three different states: OK , ALARM, and INSUFFICIENT_DATA (the last one means there is not enough data yet for the alarm to work). You can configure a Cloudwatch alarm so that whenever the state has changed, Cloudwatch can send an event to AWS EventBridge. From there, we can send notifications to a Amazon Simple Notification Service (SNS) topic that we define. Then, we can trigger services such as Lambda functions when we receive such notifications.

Lambda can send alerts anywhere we want (on our case, Slack)

We can decide when to send a notification to SNS. For example, when the state changes from OK to ALARM, or ALARM to OK. When the notification is received by SNS, we can configure it to trigger a Lambda function. This Lambda receives as input a JSON containing all the information about the alarm. We can make a lambda that will process this input, and send an alert accordingly to a Slack channel (or any other messaging system).

The code

Below, you will find the code to monitor:

An EC2 instance in a specific ECS cluster
An ECS task

We monitor metrics such as disk or CPU usage.

Terraform code

Note that all the Terraform is provided in a link at the end of the article.

Lambda function

A basic lambda function with a Python 3.8 runtime. Note that you have some Terraform variables to set. Also, we allow SNS to invoke this lambda function in this code block.

resource "aws_lambda_function" "send_message_slack" {
  function_name = "send_message_slack_${local.env_type}"
  role          = aws_iam_role.slack_alerting_lambda_role.arn
  s3_bucket     = var.lambda_s3_bucket
  s3_key        = "slack_alerting.zip"
  handler       = "main.lambda_handler"
  timeout       = 60
  memory_size   = 256
  runtime       = "python3.8"

  environment {
    variables = {
      SLACK_WEBHOOK_URL = var.slack_webhook_url
      SLACK_CHANNEL     = var.slack_channel
      SLACK_USERNAME    = var.slack_username
    }
  }

  depends_on = [
    aws_iam_role_policy_attachment.sqs_permissions,
    aws_iam_role_policy_attachment.attach_basic_role_to_slack_lambda,
  ]
}

resource "aws_lambda_permission" "with_sns" {
  statement_id  = "allow_slack_lambda_exec_from_sns"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.send_message_slack.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.sns_slack_alert_topic.arn
}

IAM configuration

We need some policy configuration for:

Giving Lambda some permissions for SNS topics
Allow Lambda to send logs in Cloudwatch

resource "aws_iam_role" "slack_alerting_lambda_role" {
  name = "slack_alerting_lambda"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_policy" "permissions_for_sns" {
  name = "slack_alerting_for_sns"
  policy = jsonencode(
    {
      "Version" : "2012-10-17",
      "Statement" : [
        {
          Sid : "",
          Effect : "Allow",
          Action : [
            "SNS:Subscribe",
            "SNS:SetTopicAttributes",
            "SNS:RemovePermission",
            "SNS:Receive",
            "SNS:Publish",
            "SNS:ListSubscriptionsByTopic",
            "SNS:GetTopicAttributes",
            "SNS:DeleteTopic",
            "SNS:AddPermission",
          ],
          Resource : [
            aws_sns_topic.sns_slack_alert_topic.arn
          ]
        },
      ]
    }
  )
}

resource "aws_iam_role_policy_attachment" "sqs_permissions" {
  role       = aws_iam_role.slack_alerting_lambda_role.name
  policy_arn = aws_iam_policy.permissions_for_sns.arn

  depends_on = [aws_iam_policy.permissions_for_sns, aws_iam_role.slack_alerting_lambda_role]
}

data "aws_iam_policy" "lambda_basic_execution_role" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_iam_role_policy_attachment" "attach_basic_role_to_slack_lambda" {
  role = aws_iam_role.slack_alerting_lambda_role.name
  # this policy permits to log in cloudwatch
  policy_arn = data.aws_iam_policy.lambda_basic_execution_role.arn

  depends_on = [aws_iam_role.slack_alerting_lambda_role]
}

# Allow cloudwatch to invoke lambda

resource "aws_iam_role" "cloudwatch_role" {
  name_prefix = "cloudwatch-role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}
# Allow Cloudwatch to publish on SNS

resource "aws_sns_topic_policy" "slack_topic_policy" {
  arn    = aws_sns_topic.sns_slack_alert_topic.arn
  policy = data.aws_iam_policy_document.sns_topic_policy.json
}

data "aws_iam_policy_document" "sns_topic_policy" {
  statement {
    sid       = "Allow CloudwatchEvents"
    effect    = "Allow"
    actions   = ["sns:Publish"]
    resources = [aws_sns_topic.sns_slack_alert_topic.arn]

    principals {
      type        = "Service"
      identifiers = ["events.amazonaws.com", "cloudwatch.amazonaws.com"]
    }
  }
}

SNS configuration

Here, we simply create a SNS topic and specify which lambda function to trigger when messages are sent in the topic.

resource "aws_sns_topic" "sns_slack_alert_topic" {
  name = "slack-alerting"
}

resource "aws_sns_topic_subscription" "sns_notify_slack" {
  topic_arn = aws_sns_topic.sns_slack_alert_topic.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.send_message_slack.arn
}

Alarms

EC2 instance

resource "aws_cloudwatch_metric_alarm" "monitor_cpu_usage" {
  alarm_name                = "monitor-ec2-instance"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "2"
  period                    = "60"
  datapoints_to_alarm       = "2"
  insufficient_data_actions = []
  alarm_description         = "It looks like at least one EC2 instance of the cluster has a very high CPU usage."
  alarm_actions = [
    aws_sns_topic.sns_slack_alert_topic.arn
  ]
  ok_actions = [
    aws_sns_topic.sns_slack_alert_topic.arn
  ]
  metric_name = "CPUUtilization"
  namespace   = "AWS/ECS"
  dimensions = {
    ClusterName = var.ecs_cluster_name
  }
  statistic = "Average"
  threshold = "80"
}

ECS task instance

In order to get the CPU percentage usage in a container, as AWS does not provide built-in metrics, we need to build a custom metric (called e1), based on two other metrics called m1 and m2. m1 is responsible for checking the amount of CPU used in the container, while m2 gives us the total amount of CPU allocated to this container. Then e1 simply calculates the percentage used by doing 100*m1/m2. You can find the documentation on how to build custom queries in this AWS documentation

resource "aws_cloudwatch_metric_alarm" "monitor_containers_cpu_usage" {
  alarm_name                = "monitor-containers-cpu"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "2"
  datapoints_to_alarm       = "2"
  threshold                 = "80"
  insufficient_data_actions = []
  alarm_description         = "It looks like at least one container of the cluster has a very high CPU usage."
  alarm_actions = [
    aws_sns_topic.sns_slack_alert_topic.arn
  ]
  ok_actions = [
    aws_sns_topic.sns_slack_alert_topic.arn
  ]

  metric_query {
    id          = "e1"
    expression  = "100*m1/m2"
    label       = "CPU Usage"
    return_data = "true"
  }

  metric_query {
    id = "m1"

    metric {
      metric_name = "CpuUtilized"
      namespace   = "ECS/ContainerInsights"
      period      = "60"
      stat        = "Average"

      dimensions = {
        ClusterName = var.ecs_cluster_name
      }
    }
  }

  metric_query {
    id = "m2"

    metric {
      metric_name = "CpuReserved"
      namespace   = "ECS/ContainerInsights"
      period      = "60"
      stat        = "Maximum"

      dimensions = {
        ClusterName = var.ecs_cluster_name
      }
    }
  }
}

The lambda function code

This code (in Python), is responsible for parsing the JSON input and sending messages on Slack based on the content of the input.

The requirements:

boto3==1.26.23
botocore==1.29.23

The code

The python code used in the lambda can be found here: https://github.com/terraform-aws-modules/terraform-aws-notify-slack/blob/master/functions/notify_slack.py.

Conclusion: Thousands of euros saved

With it, you have all the knowledge to monitor your own infrastructure hardware and get notifications on Slack. For us, having such alerts resulted in:

Prevention of 12 potential downtimes in a specific asynchronous service in 3 months:
Each of these downtimes would have led to potentially 3 hours of dev to fix (so, around 36*60 = 2160€ saved in three months)
Better user experience
Of course, the possibility to secure all our services

Let us know in the comments if you have some improvements suggestions!

Thanks for reading

Link to terraform code: https://github.com/legalstart/aws-cloudwatch-alarms-terraform