Monitoring EC2 instances deployed with Blue/Green deployment

Andrii Shykhov
6 min readFeb 26, 2024

--

Introduction:

In this post, we have a configuration for monitoring EC2 instances deployed with the Blue/Green deployment strategy. This configuration consists resources:
Lambda function with necessary access which does:

  • get the instance IDs based on the instance names,
  • get the metric's value for the instance from the AWS/EC2 CloudWatch namespace,
  • enhance the metric’s data with additional information,
  • put changed metric's data to custom ws-deployment namespace;

EventBridge schedule rule which runs the lambda function every 5 minutes;
CloudWatch alarms which monitor the EC2 instances based on the metrics from the custom ws-deployment namespace.

The reason for this configuration is that for the AWS/EC2 namespace as metric dimension we have only InstanceId, not InstanceName, more information about CloudWatch metrics is here :

aws cloudwatch list-metrics --namespace AWS/EC2 --metric-name CPUUtilization   
{
"Metrics": [
{
"Namespace": "AWS/EC2",
"MetricName": "CPUUtilization",
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-1234567890abcde"
}
]
},

This post is the third part series of posts about Blue/Green deployment on AWS EC2 instances with the Systems Manager Automation runbook, the first part is here, and the second part is here.

About the project:

All infrastructure is created with CloudFormation template infrastructure/ec2_monitoring.yaml and has independent deployment from the ec2-bluegreen-deployment stack. In the Systems Manager Automation runbook configuration, we have only one EC2 instance for creation, but the lambda function can work with many instances, for this, we only need to specify instance names as a comma-separated list of the “InstanceNames” parameter.

ec2_monitoring.yaml template:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudWatch metrics and alarms for monitoring the deployed EC2 instances'

Parameters:
TransformMetricsLfName:
Type: String
Default: 'TransformEc2Metrics'
CustomNamespace:
Type: String
Default: 'ws-deployment'
InstanceNames:
Type: String
Description: 'Comma-separated list of the instance names'
Default: 'ws-instance'
MetricNames:
Type: String
Description: 'Comma-separated list of the metric names'
Default: 'CPUUtilization,StatusCheckFailed_Instance,StatusCheckFailed_System'
MetricUnits:
Type: String
Description: 'Comma-separated list of the metric units'
Default: 'Percent,Count,Count'
SnsTopicName:
Type: String
Default: 'blue-green-deployment-notifications'

Resources:
#####################################
# Lambda Function configuration
#####################################
TransformEc2MetricsRole:
Type: AWS::IAM::Role
Properties:
RoleName: TransformEc2MetricsRole
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
Policies:
- PolicyName: TransformEc2Metricspolicy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
- cloudwatch:GetMetricStatistics
- cloudwatch:PutMetricData
- cloudwatch:GetMetricData
Resource: '*'

LambdaLogsGroup:
Type: AWS::Logs::LogGroup
DeletionPolicy: Delete
UpdateReplacePolicy: Retain
Properties:
LogGroupName: !Sub '/aws/lambda/${TransformMetricsLfName}'
RetentionInDays: '7'

TransformCustomMetrics:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Ref TransformMetricsLfName
Description: 'transforming and processing metrics from AWS/EC2 namespace'
Runtime: python3.12
Handler: index.lambda_handler
Timeout: 30
Role: !GetAtt TransformEc2MetricsRole.Arn
LoggingConfig:
LogGroup: !Sub '/aws/lambda/${TransformMetricsLfName}'
Environment:
Variables:
instance_names: !Ref InstanceNames
metric_names: !Ref MetricNames
metric_units: !Ref MetricUnits
custom_namespace: !Ref CustomNamespace
Code:
ZipFile: |
import boto3
import os
from datetime import datetime, timedelta

def lambda_handler(event, context):
try:
instance_names = [name.strip() for name in os.environ['instance_names'].split(',')]
metric_names = [metric.strip() for metric in os.environ['metric_names'].split(',')]
metric_units = [unit.strip() for unit in os.environ['metric_units'].split(',')]
custom_namespace = os.environ['custom_namespace']

# Initialize EC2 and CloudWatch clients
ec2_client = boto3.client('ec2')
cloudwatch_client = boto3.client('cloudwatch')

# Get instance IDs based on the instance names
instance_ids, instance_names_result = get_instance_ids(instance_names, ec2_client)
if not instance_ids:
print("[INFO] No instances found for transforming metrics.")
return

# Get metrics for each instance
for instance_id, instance_name in zip(instance_ids, instance_names_result):
metrics = get_instance_metrics(instance_id, metric_names, metric_units, instance_name, cloudwatch_client)

# Put formatted metrics to custom CloudWatch namespace
put_custom_metrics(metrics, custom_namespace, cloudwatch_client)

except Exception as e:
print(f"Error with proceeding metrics transformation: {str(e)}")

def get_instance_ids(instance_names, ec2_client):
instance_ids = []
instance_names_result = []

for instance_name in instance_names:
response = ec2_client.describe_instances(
Filters=[
{'Name': 'tag:Name', 'Values': [instance_name]},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)

# Extract instance IDs from the response
ids = [instance['InstanceId'] for reservation in response['Reservations'] for instance in reservation['Instances']]

# append instance IDs
if ids:
instance_ids.extend(ids)
instance_names_result.append(instance_name)

return instance_ids, instance_names_result

def get_instance_metrics(instance_id, metric_names, metric_units, instance_name, cloudwatch_client):
end_time = datetime.utcnow()
start_time = end_time - timedelta(minutes=10)

metrics_dict = {}

for metric_name, unit in zip(metric_names, metric_units):
# take necessary metric values
id_for_query = metric_name.lower()

query = {
"Id": id_for_query,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EC2",
"MetricName": metric_name,
"Dimensions": [
{"Name": "InstanceId", "Value": instance_id}
]
},
"Period": 300,
"Stat": "Average",
"Unit": unit
},
"ReturnData": True
}

response = cloudwatch_client.get_metric_data(
MetricDataQueries=[query],
StartTime=start_time,
EndTime=end_time
)

# Extract data from the response
metric_data_results = response.get('MetricDataResults', [])

if not metric_data_results:
print(f"No data available for metric: {metric_name} related to {instance_name}")
continue

values = metric_data_results[0].get('Values', [])

if not values:
print(f"No values available for metric: {metric_name} related to {instance_name}")
continue

# Get the latest value
latest_value = values[-1]

# Combine metric name, value, unit, instance name into a dictionary
metrics_dict[id_for_query] = {
'MetricName': metric_name,
'Value': latest_value,
'Unit': unit,
'InstanceId': instance_id,
'InstanceName': instance_name
}
return metrics_dict

def put_custom_metrics(metrics_dict, custom_namespace, cloudwatch_client):
for metric_id, metric_info in metrics_dict.items():
metric_name = metric_info['MetricName']
value = metric_info['Value']
dimensions = [
{'Name': 'InstanceName', 'Value': metric_info['InstanceName']}
]
unit = metric_info['Unit']
instance_name = metric_info['InstanceName']

response = cloudwatch_client.put_metric_data(
Namespace=custom_namespace,
MetricData=[
{
'MetricName': metric_name,
'Dimensions': dimensions,
'Value': value,
'Unit': unit
}
]
)

# Print information about the success or failure process
if response['ResponseMetadata']['HTTPStatusCode'] == 200:
print(f"Successfully put metric data for {metric_name} in {custom_namespace} related to {instance_name}")
else:
print(f"Failed to put metric data for {metric_name} in {custom_namespace}. Response: {response}")

LambdaInvokePermission:
Type: AWS::Lambda::Permission
Properties:
Action: lambda:InvokeFunction
FunctionName: !Ref TransformCustomMetrics
Principal: events.amazonaws.com
SourceArn: !GetAtt ScheduleRule.Arn

ScheduleRule:
Type: AWS::Events::Rule
Properties:
Name: TransformCustomMetricsScheduleRule
ScheduleExpression: 'rate(5 minutes)'
Targets:
- Arn: !GetAtt TransformCustomMetrics.Arn
Id: TransformCustomMetricsTarget

#####################################
# CloudWatch Alarms
#####################################
CPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName:
!Sub
- '${InstanceName} - High CPU Usage'
- InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
AlarmDescription: 'High CPU Usage'
AlarmActions:
- !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
OKActions:
- !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
MetricName: !Select [0, !Split [",", !Ref MetricNames]]
Unit: !Select [0, !Split [",", !Ref MetricUnits]]
Namespace: !Ref CustomNamespace
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 90
ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: InstanceName
Value: !Select [0, !Split [",", !Ref InstanceNames]]

SystemStatusAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName:
!Sub
- '${InstanceName} - System Status Check Failed'
- InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
AlarmDescription: 'System Status Check Failed'
Namespace: !Ref CustomNamespace
MetricName: !Select [1, !Split [",", !Ref MetricNames]]
Unit: !Select [1, !Split [",", !Ref MetricUnits]]
Statistic: Minimum
Period: 300
EvaluationPeriods: 1
ComparisonOperator: GreaterThanThreshold
Threshold: 0
AlarmActions:
- !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
OKActions:
- !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
Dimensions:
- Name: InstanceName
Value: !Select [0, !Split [",", !Ref InstanceNames]]

InstanceStatusAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName:
!Sub
- '${InstanceName} - Instance Status Check Failed'
- InstanceName: !Select [0, !Split [",", !Ref InstanceNames]]
AlarmDescription: 'Instance Status Check Failed'
Namespace: !Ref CustomNamespace
MetricName: !Select [2, !Split [",", !Ref MetricNames]]
Unit: !Select [2, !Split [",", !Ref MetricUnits]]
Statistic: Minimum
Period: 300
EvaluationPeriods: 1
ComparisonOperator: GreaterThanThreshold
Threshold: 0
AlarmActions:
- !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
OKActions:
- !Sub 'arn:${AWS::Partition}:sns:${AWS::Region}:${AWS::AccountId}:${SnsTopicName}'
Dimensions:
- Name: InstanceName
Value: !Select [0, !Split [",", !Ref InstanceNames]]

Infrastructure schema and list of metrics from the ws-deployment namespace:

Full infrastructure schema
CloudWatch EC2 metrics in the custom namespace

Deployment:

1. clone the repository (if you don’t have already cloned it).

git clone https://gitlab.com/Andr1500/ssm_runbook_bluegreen.git

2. Put “dummy” metrics data with AWS CLI into the custom namespace for each metric. It is necessary because without this data CloudWatch alarms were not created correctly.

aws cloudwatch put-metric-data \
--namespace "ws-deployment" \
--metric-name "CPUUtilization" \
--dimensions "InstanceName=ws-instance" \
--value 70 --unit Percent

3. Create CloudFormation stack.

aws cloudformation create-stack \
--stack-name ws-ec2-monitoring \
--template-body file://ec2_monitoring.yaml \
--capabilities CAPABILITY_NAMED_IAM --disable-rollback

Conclusion:

In this post, we showed how we can monitor EC2 instances deployed with the Blue/Green deployment strategy. To be sure that the lambda function works correctly we can add the configuration of the “InsufficientDataActions” parameter in the CloudWatch alarms for sending notifications in case of changing the CloudWatch alarm state to “Insufficient data”. If you need to have more specific CloudWatch metrics from the EC2 instances — here is my post about Monitoring Disk Space (as example) with CloudWatch agent.

If you found this post helpful and interesting, please click the clap button below to show your support.

--

--

Andrii Shykhov

DevOps engineer: AWS, Infrastructure as Code, CI/CD pipelines