Monitoring EC2 instances deployed with Blue/Green deployment

6 min readFeb 26, 2024

Introduction:

In this post, we have a configuration for monitoring EC2 instances deployed with the Blue/Green deployment strategy. This configuration consists resources:
Lambda function with necessary access which does:

get the instance IDs based on the instance names,
get the metric's value for the instance from the AWS/EC2 CloudWatch namespace,
enhance the metric’s data with additional information,
put changed metric's data to custom ws-deployment namespace;

EventBridge schedule rule which runs the lambda function every 5 minutes;
CloudWatch alarms which monitor the EC2 instances based on the metrics from the custom ws-deployment namespace.

The reason for this configuration is that for the AWS/EC2 namespace as metric dimension we have only InstanceId, not InstanceName, more information about CloudWatch metrics is here :

aws cloudwatch list-metrics --namespace AWS/EC2 --metric-name CPUUtilization   
{
    "Metrics": [
        {
            "Namespace": "AWS/EC2",
            "MetricName": "CPUUtilization",
            "Dimensions": [
                {
                    "Name": "InstanceId",
                    "Value": "i-1234567890abcde"
                }
            ]
        },

This post is the third part series of posts about Blue/Green deployment on AWS EC2 instances with the Systems Manager Automation runbook, the first part is here, and the second part is here.

About the project:

All infrastructure is created with CloudFormation template infrastructure/ec2_monitoring.yaml and has independent deployment from the ec2-bluegreen-deployment stack. In the Systems Manager Automation runbook configuration, we have only one EC2 instance for creation, but the lambda function can work with many instances, for this, we only need to specify instance names as a comma-separated list of the “InstanceNames” parameter.

ec2_monitoring.yaml template:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudWatch metrics and alarms for monitoring the deployed EC2 instances'

Parameters:
  TransformMetricsLfName:
    Type: String
    Default: 'TransformEc2Metrics'
  CustomNamespace:
    Type: String
    Default: 'ws-deployment'
  InstanceNames:
    Type: String
    Description: 'Comma-separated list of the instance names'
    Default: 'ws-instance'
  MetricNames:
    Type: String
    Description: 'Comma-separated list of the metric names'
    Default: 'CPUUtilization,StatusCheckFailed_Instance,StatusCheckFailed_System'
  MetricUnits:
    Type: String
    Description: 'Comma-separated list of the metric units'
    Default: 'Percent,Count,Count'
  SnsTopicName:
    Type: String
    Default: 'blue-green-deployment-notifications'

Resources:
#####################################
#  Lambda Function configuration
#####################################
  TransformEc2MetricsRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: TransformEc2MetricsRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
      Policies:
        - PolicyName: TransformEc2Metricspolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - cloudwatch:GetMetricStatistics
                  - cloudwatch:PutMetricData
                  - cloudwatch:GetMetricData
                Resource: '*'

  LambdaLogsGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Retain
    Properties:
      LogGroupName: !Sub '/aws/lambda/${TransformMetricsLfName}'
      RetentionInDays: '7'

  TransformCustomMetrics:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Ref TransformMetricsLfName
      Description: 'transforming and processing metrics from AWS/EC2 namespace'
      Runtime: python3.12
      Handler: index.lambda_handler
      Timeout: 30
      Role: !GetAtt TransformEc2MetricsRole.Arn
      LoggingConfig:
        LogGroup: !Sub '/aws/lambda/${TransformMetricsLfName}'
      Environment:
        Variables:
          instance_names: !Ref InstanceNames
          metric_names: !Ref MetricNames
          metric_units: !Ref MetricUnits
          custom_namespace: !Ref CustomNamespace
      Code:
        ZipFile: |
          import boto3
          import os
          from datetime import datetime, timedelta

          def lambda_handler(event, context):
              try:
                  instance_names = [name.strip() for name in os.environ['instance_names'].split(',')]
                  metric_names =  [metric.strip() for metric in os.environ['metric_names'].split(',')]
                  metric_units =  [unit.strip() for unit in os.environ['metric_units'].split(',')]
                  custom_namespace = os.environ['custom_namespace']

                  # Initialize EC2 and CloudWatch clients
                  ec2_client = boto3.client('ec2')
                  cloudwatch_client = boto3.client('cloudwatch')

                  # Get instance IDs based on the instance names
                  instance_ids, instance_names_result = get_instance_ids(instance_names, ec2_client)
                  if not instance_ids:
                      print("[INFO] No instances found for transforming metrics.")
                      return

                  # Get metrics for each instance
                  for instance_id, instance_name in zip(instance_ids, instance_names_result):
                      metrics = get_instance_metrics(instance_id, metric_names, metric_units, instance_name, cloudwatch_client)

                      # Put formatted metrics to custom CloudWatch namespace
                      put_custom_metrics(metrics, custom_namespace, cloudwatch_client)

              except Exception as e:
                  print(f"Error with proceeding metrics transformation: {str(e)}")

          def get_instance_ids(instance_names, ec2_client):
              instance_ids = []
              instance_names_result = []

              for instance_name in instance_names:
                  response = ec2_client.describe_instances(
                      Filters=[
                          {'Name': 'tag:Name', 'Values': [instance_name]},
                          {'Name': 'instance-state-name', 'Values': ['running']}
                      ]
                  )

                  # Extract instance IDs from the response
                  ids = [instance['InstanceId'] for reservation in response['Reservations'] for instance in reservation['Instances']]
                  
                  # append instance IDs
                  if ids:
                      instance_ids.extend(ids)
                      instance_names_result.append(instance_name)
                  
              return instance_ids, instance_names_result

          def get_instance_metrics(instance_id, metric_names, metric_units, instance_name, cloudwatch_client):
              end_time = datetime.utcnow()
              start_time = end_time - timedelta(minutes=10)

              metrics_dict = {}

              for metric_name, unit in zip(metric_names, metric_units):
                  # take necessary metric values
                  id_for_query = metric_name.lower()

                  query = {
                      "Id": id_for_query,
                      "MetricStat": {
                          "Metric": {
                              "Namespace": "AWS/EC2",
                              "MetricName": metric_name,
                              "Dimensions": [
                                  {"Name": "InstanceId", "Value": instance_id}
                              ]
                          },
                          "Period": 300,
                          "Stat": "Average",
                          "Unit": unit
                      },
                      "ReturnData": True
                  }

                  response = cloudwatch_client.get_metric_data(
                      MetricDataQueries=[query],
                      StartTime=start_time,
                      EndTime=end_time
                  )

                  # Extract data from the response
                  metric_data_results = response.get('MetricDataResults', [])

                  if not metric_data_results:
                      print(f"No data available for metric: {metric_name} related to {instance_name}")
                      continue

                  values = metric_data_results[0].get('Values', [])

                  if not values:
                      print(f"No values available for metric: {metric_name} related to {instance_name}")
                      continue

                  # Get the latest value
                  latest_value = values[-1]

                  # Combine metric name, value, unit, instance name into a dictionary
                  metrics_dict[id_for_query] = {
                      'MetricName': metric_name,
                      'Value': latest_value,
                      'Unit': unit,
                      'InstanceId': instance_id,
                      'InstanceName': instance_name
                  }
              return metrics_dict

          def put_custom_metrics(metrics_dict, custom_namespace, cloudwatch_client):
              for metric_id, metric_info in metrics_dict.items():
                  metric_name = metric_info['MetricName']
                  value = metric_info['Value']
                  dimensions = [
                      {'Name': 'InstanceName', 'Value': metric_info['InstanceName']}
                  ]
                  unit = metric_info['Unit']
                  instance_name = metric_info['InstanceName']

                  response = cloudwatch_client.put_metric_data(
                      Namespace=custom_namespace,
                      MetricData=[
                          {
                              'MetricName': metric_name,
                              'Dimensions': dimensions,
                              'Value': value,
                              'Unit': unit
                          }
                      ]
                  )

                  # Print information about the success or failure process
                  if response['ResponseMetadata']['HTTPStatusCode'] == 200:
                      print(f"Successfully put metric data for {metric_name} in {custom_namespace} related to {instance_name}")
                  else:
                      print(f"Failed to put metric data for {metric_name} in {custom_namespace}. Response: {response}")

  LambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref TransformCustomMetrics
      Principal: events.amazonaws.com
      SourceArn: !GetAtt ScheduleRule.Arn

  ScheduleRule:
    Type: AWS::Events::Rule
    Properties:
      Name: TransformCustomMetricsScheduleRule
      ScheduleExpression: 'rate(5 minutes)'
      Targets:
        - Arn: !GetAtt TransformCustomMetrics.Arn
          Id: TransformCustomMetricsTarget

#####################################
#  CloudWatch Alarms
#####################################
  CPUAlarm: 
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: 
        !Sub 
          - '${InstanceName} - High CPU Usage'
          - InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
      AlarmDescription: 'High CPU Usage'
      AlarmActions:
      - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
      OKActions:
      - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
      MetricName: !Select [0, !Split [",", !Ref MetricNames]]
      Unit: !Select [0, !Split [",", !Ref MetricUnits]]
      Namespace: !Ref CustomNamespace
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 90
      ComparisonOperator: GreaterThanOrEqualToThreshold
      Dimensions:
      - Name: InstanceName
        Value: !Select [0, !Split [",", !Ref InstanceNames]]

  SystemStatusAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: 
        !Sub 
          - '${InstanceName} - System Status Check Failed'
          - InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
      AlarmDescription: 'System Status Check Failed'
      Namespace: !Ref CustomNamespace
      MetricName: !Select [1, !Split [",", !Ref MetricNames]]
      Unit: !Select [1, !Split [",", !Ref MetricUnits]]
      Statistic: Minimum
      Period: 300
      EvaluationPeriods: 1
      ComparisonOperator: GreaterThanThreshold
      Threshold: 0
      AlarmActions:
      - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
      OKActions:
      - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
      Dimensions:
      - Name: InstanceName
        Value: !Select [0, !Split [",", !Ref InstanceNames]]

  InstanceStatusAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: 
        !Sub 
          - '${InstanceName} - Instance Status Check Failed'
          - InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
      AlarmDescription: 'Instance Status Check Failed'
      Namespace: !Ref CustomNamespace
      MetricName: !Select [2, !Split [",", !Ref MetricNames]]
      Unit: !Select [2, !Split [",", !Ref MetricUnits]]
      Statistic: Minimum
      Period: 300
      EvaluationPeriods: 1
      ComparisonOperator: GreaterThanThreshold
      Threshold: 0
      AlarmActions:
      - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
      OKActions:
      - !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
      Dimensions:
      - Name: InstanceName
        Value: !Select [0, !Split [",", !Ref InstanceNames]]

Infrastructure schema and list of metrics from the ws-deployment namespace:

*CloudWatch EC2 metrics in the custom namespace*

Deployment:

1. clone the repository (if you don’t have already cloned it).

git clone https://gitlab.com/Andr1500/ssm_runbook_bluegreen.git

2. Put “dummy” metrics data with AWS CLI into the custom namespace for each metric. It is necessary because without this data CloudWatch alarms were not created correctly.

aws cloudwatch put-metric-data \
    --namespace "ws-deployment" \
    --metric-name "CPUUtilization" \
    --dimensions "InstanceName=ws-instance" \
    --value 70 --unit Percent

3. Create CloudFormation stack.

aws cloudformation create-stack \
    --stack-name ws-ec2-monitoring \
    --template-body file://ec2_monitoring.yaml \
    --capabilities CAPABILITY_NAMED_IAM --disable-rollback

Conclusion:

In this post, we showed how we can monitor EC2 instances deployed with the Blue/Green deployment strategy. To be sure that the lambda function works correctly we can add the configuration of the “InsufficientDataActions” parameter in the CloudWatch alarms for sending notifications in case of changing the CloudWatch alarm state to “Insufficient data”. If you need to have more specific CloudWatch metrics from the EC2 instances — here is my post about Monitoring Disk Space (as example) with CloudWatch agent.

If you found this post helpful and interesting, please click the clap button below to show your support.

Monitoring EC2 instances deployed with Blue/Green deployment

Written by Andrii Shykhov