AWS ECS host auto-scaling with custom Cloudwatch metrics and AWS Lambda

How we implemented a custom host auto-scaling strategy that we think might fit most of our and your needs too

Published in

THRON tech blog

8 min readApr 16, 2019

In this 2 part series we will describe how we went from a human managed scaling to a fully-automated scaling of hosts and containers. This first article will cover the topic of host scaling.

Auto-scaling is a good solution to the very complex task of capacity planning, most Cloud vendors offer built-in techniques to trigger cluster scale-out and scale-in based on common metrics and expected systems behavior… unfortunately that doesn’t always work out as you would expect.

We are using Amazon ECS (Kubernetes is not our cup of tea, yet): it is an AWS proprietary container orchestration service that leverages EC2 as hosts for container clusters. The host fleet can be either manually scaled or, as suggested, automatically scaled thanks to Auto Scaling Group , which basically binds several EC2 instances together. On those EC2 instances an ECS agent is running and each container that will start in that cluster will automatically register itself on ECS.

Our goals

Our desire to fully automate scaling has not just been driven by the will of limiting manual operations, reducing errors and improve reaction times and system availability, it was also a step to achieve:

Horizontal container cluster scaling. Our product has many different services, each one with variable load profile: manually managing the scaling process can be a very complex task. Automated host scaling is a pre-requisite to enable automated service scaling, the ability to dynamically scale the number of containers that deliver a given service, ultimately providing better performance, availability and infrastructure cost efficiency;
Developers independence from operations team that manages infrastructure. Developers can size their cluster based on their knowledge of the service and the implementation, the infrastructure will simply scale to cope with the demand granting a 100% availability of resources to the developer. This means that deploying new services or service updates becomes much easier and involves less people in the process, reducing communication overhead too.

Efficient host auto-scaling is hard

How to manage host auto-scaling in an efficient way is a complex problem, this includes both the topic of adding resources to the cluster as well as the (more tricky) topic of removing resources from the cluster to optimize costs.

In AWS the scaling events are triggered by Cloudwatch (AWS proprietary monitoring and alerting system) based on thresholds on given metrics. ECS uses a simple CPU and RAM metric to select in which host to instantiate a new service. Therefore, ECS provides CPU and RAM Cloudwatch metrics (how much CPU and RAM is left after removing all container-reserved resources) out of the box and you might be tempted to use them to manage the host scaling events.

The easiest way to do so is to setup four alarms:

scale out when available CPU is below a certain threshold
scale out when available RAM is below a certain threshold
scale in when available CPU is above a certain threshold
scale in when available RAM is above a certain threshold

Since our services load is heterogeneous (some are CPU intensive, some are Memory intensive) we need to trigger scaling on both CPU and RAM, we are not allowed to be lazy and use just one.

The reason why this is not a good design is that adding a new EC2 host will increase both CPU and RAM at the same time:

if your cluster is starving for CPU but has plenty of free RAM, adding a new host will bring CPU back to normal usage but your “scale in” RAM rule will trigger and your cluster will keep scaling up and down all the times, never finding a good balance.

Use a custom metric to manage host scaling

The solution is to combine CPU and RAM into a single custom metric that you can easily tweak based on your needs. You can leverage AWS Lambda to calculate this metric, in a very cost-efficient way (we love serverless for this type of workload). The metric we choose to implement represents the number of “worst case” containers that we can add to the hosts.

The Lambda function will be triggered each minute by Cloudwatch to monitor the cluster health, it will read maximum reserved RAM and CPU for active services and use the max(RAM), max(CPU) couple to identify the worst case container size that might be requested. This lambda function, for each host, checks available RAM and CPU and define how many “worst case” containers can fit into each host. The sum of those values is then pushed in Cloudwatch as a custom metric: a value of 7 means that we can instantiate up to 7 containers in the cluster.

Thanks to custom metric we just need 2 alarms to manage host auto-scaling, one alarm to trigger scale-out and one to trigger scale-in.

Once you realize that the “worst case” changes all the time, it becomes clear that the Lambda function will also need to update Cloudwatch alarms because the need to trigger a new host is also dependent on the size of the “worst case” container. Updating both the metric and the alarm allows the developers to remove containers from the cluster without paying attention about their relative “weight” in the cluster.

To prevent false scale-out and scale-in requests, the alarm that triggers a new scale-out or scale-in request is configured to wait 2 metric updates that are both outside the threshold (so 120s are needed before the alarm is triggered).

It’s also worth noting that Cloudwatch pricing increases both by metrics and alarms. It might seem a cheap service, but it quickly scales to hundreds of $ per month on complex systems.

It took almost 10 minutes to update cluster size

Everything is fine, on paper, but once we tried the custom lambda metric with test load we encountered an unexpected behavior: when a service was being scaled (either manually or automatically) there was a big delay from the initial request to the time it took the cluster to increase the host count:

The root cause is related to how ECS manages the “retry” the action of creating a new container: 1st retry is after 30s, 2nd retry is after 30s then it waits for about 5 minutes before starting again.

Since the alarm is triggered after 2 minutes, this will happen while ECS is waiting for about 5 minutes before retrying to create a new cluster, wasting precious time. As a solution for this we added a “reset” call to ECS that is triggered after creating or removing hosts to the cluster. This “reset” feature has been added to the same Lambda function: it collects information about which services are waiting for a new host and forces ECS to retry right away. We have been able to reduce reaction time from 7–8 minutes to about 3–4 minutes. We can still optimize it by working on the frequency of the Lambda invocation and the alarm trigger rules.

Using a single custom metric to manage step scaling too

Our scaling policy was configured to add a fixed host amount each time (simple scaling), this was too slow to cope with load bursts or when our developers were pushing too many containers into the cluster. Situation is worsened by the fact that there’s the need for a delay time between host additions to the cluster: this is required to let ECS instantiate new containers and let the system compute the new metrics for the scaling.

We still find simple scaling policy very useful to manage scale-in: we prefer to remove hosts from the cluster in a slow, controlled way. This sacrifices cost efficiency but allows for a better management of unpredictable peaks that might come right after triggering a scale-in, improving performance and reliability.

To fully manage step scaling policy we need to be able to compute how many hosts we need to add to the cluster each time. We can do this by updating the usual “cluster size check” Lambda function: it was collecting required CPU and RAM and it can also collect which containers were requiring more resources and thus it can easily how much CPU and RAM are needed to cope with all the requests. EC2 instance size is fixed in our case, so it’s straightforward to define how many instances we need to add at a given time to be able to scale all the required containers.

We also didn’t want to add a different metric describing how many hosts to add, so we extended the meaning of the custom metric: positive values define how many available “worst case” hosts there are, negative values define how many hosts are needed to cope with current demand. A value of -3 means that 3 more EC2 hosts must be added to the cluster.

How to drain your hosts to scale in

Scaling down entirely relies on how you “drain” your hosts of all active services so that you can safely remove it. When Auto Scaling Group is triggered with a “scale in” request, it will simply terminate the host (or hosts) while the services inside the host might still be active. AWS wrote a guide that explains how to manage host draining. By using such approach, when Auto Scaling Group decides which host must be removed, it will be put in a “draining state” and a Lambda function will wait for the draining operation to finish. When the lambda function terminates, the host will be removed by Auto Scaling Group.

Conclusions

By adding just 2 Lambda functions we have been able to manage host auto-scaling in a very efficient way, increasing performance, reducing both infrastructure and management costs.

There are still improvements we plan to make:

the “host draining” Lambda function has an active wait. This is not how you want to use serverless functions and we plan to design a different approach to host-draining without relying on active waits;
the “custom metric” Lambda function assumes that all hosts are the same size (which is true at the moment) but it would be helpful to be able to instantiate different hosts sizes based on different needs to optimize costs and overall efficiency;
Some containers have needs that cannot be solved by auto-scaling alone, such as host mapped ports that require hosts with such port free on the host itself. The action is to remove such cases by refactoring the containers;

In the next part we will see how we manage container scaling inside these hosts.

Are you using a different approach for autoscaling? Are you using a different scaling metric? Feel free to reach us to discuss this matter.