Automating Container Instances Draining in Amazon ECS using Lambda and ASG lifecycle hook

dushyant chouhan
4 min readOct 1, 2019

--

If you are familiar with AWS elastic container service(ECS) then you may experienced an issue in which whenever ECS cluster autoscaling terminate an ECS instance on which some task are running then tasks on that instance will be lost or reschedule on other available ECS instances that leads to some downtime.

In this blog we will discuss and implement how can we migrate tasks automatically before ECS cluster autoscaling fully terminate that instance with zero downtime. we will use below aws services in order to implement this scenario.

  • AWS Autoscaling Lifecycle hook
  • AWS Simple notification service (SNS)
  • AWS Lambda
  • AWS IAM roles

1. AWS IAM roles

We need to create below two IAM roles

  • First role is for autoscaling with role entity as “EC2 auto scaling” with “AutoScalingNotificationAccessRole” permission to access SNS topic. This role helps autoscaling to trigger SNS topic whenever any terminate lifecycle hook trigger.
  • Second role is for lambda with role entity as “Lambda” and with below permission to access cloudwatch logs, ECS cluster instances and cluster autoscaling.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"autoscaling:CompleteLifecycleAction",
"ecs:UpdateContainerInstancesState",
"ecs:ListContainerInstances",
"ecs:DescribeContainerInstances",
"logs:CreateLogGroup",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}

2. AWS Simple notification service (SNS) Topic

  • We need to create a SNS topic that will be used in Cluster autoscaling lifecycle hook. So, whenever a terminate lifecycle hook trigger on down scaling event then it will trigger the lambda function.
  • you can add some subscribers in case you want notification whenever a lifecyle hook trigger occur.

3. AWS lambda function

  • create a lambda function with the role that was created before in first step, you will see function like below in lambda designer.
  • Use below python code (python 3.7) that will drain the tasks first from the ECS instance and then terminate the instance once task become zero on it by triggering “complete_lifecycle_action” boto3 function to cluster autoscaling.
import json
import time
import boto3
ECS = boto3.client('ecs')
ASG = boto3.client('autoscaling')
SNS = boto3.client('sns')
def find_ecs_instance_info(instance_id, CLUSTER):
paginator = ECS.get_paginator('list_container_instances')
for list_resp in paginator.paginate(cluster=CLUSTER):
arns = list_resp['containerInstanceArns']
desc_resp = ECS.describe_container_instances(cluster=CLUSTER,
containerInstances=arns)
for container_instance in desc_resp['containerInstances']:
if container_instance['ec2InstanceId'] != instance_id:
continue
print('Found instance: id=%s, arn=%s, status=%s, runningTasksCount=%s' %
(instance_id, container_instance['containerInstanceArn'],
container_instance['status'], container_instance['runningTasksCount']))
return (container_instance['containerInstanceArn'],
container_instance['status'], container_instance['runningTasksCount'])
return None, None, 0
def instance_has_running_tasks(instance_id, CLUSTER):
(instance_arn, container_status, running_tasks) = find_ecs_instance_info(instance_id, CLUSTER)
if instance_arn is None:
print('Could not find instance ID %s. Letting autoscaling kill the instance.' %
(instance_id))
return False
if container_status != 'DRAINING':
print('Setting container instance %s (%s) to DRAINING' %
(instance_id, instance_arn))
ECS.update_container_instances_state(cluster=CLUSTER,
containerInstances=[instance_arn],
status='DRAINING')
return running_tasks > 0
def lambda_handler(event, context):
msg = json.loads(event['Records'][0]['Sns']['Message'])
clustername = msg['NotificationMetadata']
if 'LifecycleTransition' not in msg.keys() or \
msg['LifecycleTransition'].find('autoscaling:EC2_INSTANCE_TERMINATING') == -1:
print('Exiting since the lifecycle transition is not EC2_INSTANCE_TERMINATING.')
return
if instance_has_running_tasks(msg['EC2InstanceId'], clustername):
print('Tasks are still running on instance %s; posting msg to SNS topic %s' %
(msg['EC2InstanceId'], event['Records'][0]['Sns']['TopicArn']))
time.sleep(5)
sns_resp = SNS.publish(TopicArn=event['Records'][0]['Sns']['TopicArn'],
Message=json.dumps(msg),
Subject='Publishing SNS msg to invoke Lambda again.')
print('Posted msg %s to SNS topic.' % (sns_resp['MessageId']))
else:
print('No tasks are running on instance %s; setting lifecycle to complete' %
(msg['EC2InstanceId']))
ASG.complete_lifecycle_action(LifecycleHookName=msg['LifecycleHookName'],
AutoScalingGroupName=msg['AutoScalingGroupName'],
LifecycleActionResult='CONTINUE',
InstanceId=msg['EC2InstanceId'])

4. Cluster auto scaling lifecycle hook

Basically autoscaling provide two types of lifecycle hook i.e. Launch and terminate lifecycle hook. They generally provide some time to perform any activity on instances before launching or terminating instance in autoscaling.

we are using below aws cli command to create a autoscaling terminate lifecycle hook as in AWS console you won’t get any option to add SNS topic.

$ aws autoscaling put-lifecycle-hook --lifecycle-hook-name <ASG_LIFECYCLE_HOOK_NAME> --auto-scaling-group-name <ECS_CLUSTER_AUTOSCALING_NAME> --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING --notification-target-arn <SNS_TOPIC_ARN> --role-arn <IAM__AUTOSCALING_ROLE_ARN> --notification-metadata <ECS_CLUSTER_NAME> --heartbeat-timeout <HEARTBEAT_TIMEOUT_IN_SECONDS>

This aws cli command using below variables:

  • ASG_LIFECYCLE_HOOK_NAME : give any name you want for lifecycle hook
  • ECS_CLUSTER_AUTOSCALING_NAME : ecs cluster autoscaling name.
  • SNS_TOPIC_ARN : SNS topic ARN that we created in step 2.
  • IAM__AUTOSCALING_ROLE_ARN : ecs autoscaling role that was created in step 1.
  • ECS_CLUSTER_NAME : ecs cluster name that will be used in lambda function.
  • HEARTBEAT_TIMEOUT_IN_SECONDS : Autoscaling heartbeat timeout in seconds i.e. The maximum time, in seconds, that can elapse before the lifecycle hook times out. The range is from 30 to 7200 seconds. The default value is 3600 seconds (1 hour)

Now, After implementation of whole steps AWS Lamda automatically drain ECS instance tasks before terminating it.

Article References:

Thanks for reading.

--

--