The “last mile” of ECS cluster automation

Luca Persegani
Growens Innovation Blog
4 min readJan 9, 2020

A “fire and forget” solution to get a seamless ECS cluster’s instances management.

When ECS was released back in 2015 it was described as a “managed” solution to run Docker containers in AWS Cloud (before Fargate kicked in).
Although ECS is a powerful, feature-rich service and it’s widely used by the company where I work today, it still has some drawbacks, one of them is the way it handles EC2 instances update/replace under the hood.

OK, let’s assume we already have an Elastic Container Service (ECS) cluster, deployed with CloudFormation, managed by Auto Scaling Group (ASG) triggered by CloudWatch alarms.

Problem

Now, what happens when the Cluster needs to scale-in ?
Well, the Auto Scaling Group does what he’s designed to do, it “randomly” chooses an EC2 instance and terminates it.

“What’s wrong with this approach?”, you may ask, given that all the other ecs-agents are aware of the sudden death of the node and will start the same number of Docker containers elsewhere.
Well, “terminate”, as the word itself says, is not a graceful operation: although this approach may be fine for a development environment, it can be dangerous for a production one.

And what about EC2 AMI update in ASG’s Launch configuration ?
In some cases this process may require the replacement of all instances risking to cause a “chain reaction”.

Solution (hard way)

Since the World Wide Web is a wonderful place and the wheel doesn’t need to be re-invented every time, the first thing I did was searching the Web for a solution… and what I found was an amazing blog post where the author explains how it’s possible to use a Lambda function, triggered by an ASG’s Lifecycle Hook in order to “Automate Container Instance Draining in Amazon ECS”.

So I modified our CloudFormation template adding the following resources and adapting the code to our specific case:

The IAM Role assigned to the ASG’s LifecycleHook, which allows the ASG to publish notifications when any instance enters a “wait” state.

IAM Role assigned to the ASG’s LifecycleHook

The Lifecycle hook that pauses the instance as the ASG terminates it.

Lifecycle hook

The IAM Role assumed by the Lambda function which enables it to:
Manage Auto Scaling Lifecycle action, put logs in CloudWatch, get information about Docker containers status and publish messages to the dedicated SNS topic

IAM Role assumed by the Lambda function

The SNS topic collecting all messages coming from both Lifecycle Hooks and Lambda executions

SNS topic

The Lambda permission that allows the SNS topic to invoke the Lambda

Lambda permission

And finally the Lambda function itself.

Lambda function

Here it is a schematic visualization of the above explained process.

Now take a break, double check your changes, create a change-set for your stack, review again and apply it.

The Draining phase is finally automated, we can safely add the following “Update Policy” to the ASG in our CloudFormation template, so it can handle the EC2 AMI update.

This piece of configuration defines how the ASG will replace the EC2 instances in case any change is detected in its Launch Configuration.
Obviously the process described in the first step will intervene to make this operation graceful.

Note: Be sure to apply this change alone, otherwise this could lead to an unpredicted behavior in case of rollback, since CloudFormation will revert your changes without honoring the new Update Policy.

Good, we have finally reached our two main goals, it’s time to spruce things up!

A possible optimization is once again to modify the CloudFormation template in order to change the AMI reference from a static parameter to a dynamic one, but, which one ?
Have you ever heard of AWS System Manager Parameter Store ?
In short, is a Key-Value database, integrated with many AWS services, Cloudformation included.

The value of this parameter will be kept updated by AWS itself in order to always refer to the latest AMI version.

MISSION ACCOMPLISHED !
From now on, every time an EC2 instance within you cluster needs to be terminated, a fully automated procedure will take care of it.
You can therefore focus on that new managed Kubernates cluster you are planning to migrate to. 😁

Solution (easy way)

Be brave and wait a little longer until AWS decides to natively support this process:
https://github.com/aws/containers-roadmap/issues/256

Improvements

Are you an extreme automation addicted ?
What do you think of automating the automation ?
If you like the idea, then you can schedule a Cloudwatch rule triggered every time AWS updates the AMI version stored in the Parameter Store.
This Cloudwatch rule then triggers the simplest Lambda ever that initiates a Cloudformation Stack Update where the only mandatory parameter is the stack-name itself.
But be careful… this could happen anytime, even that night you planned to enjoy watching Netflix.

Source / Documentation

Automate Container Instance Draining in Amazon ECS
Cloudformation “Update Polycy” attribute
Set Up Notifications or Trigger Actions Based on Parameter Store Events
AWS Container Roadmap issue

--

--