Autoscaling in AWS Part 3: Autoscale EC2 machines

Santi Muñoz
Signaturit Tech Blog
6 min readSep 30, 2018

This is the last and most challenging part of a series about autoscaling in AWS. In this post we will talk about scaling the machines under the ECS clusters.

As mentioned in Part 1, our infrastructure is based on ECS Clusters provisioned by machines from two Autoscaling Group for each cluster. When scaling we will be adding and removing machines only from the SPOT cluster.

Scale Out machines

In ECS we are able to configure the memory and CPU reservation per container, this allow us to have a global view of how many resources are using the tasks that are deployed in each cluster.

When a cluster is reaching the maximum resources available, we should add more machines. In order to do that we will create a lambda that will be continually checking the available resources in each cluster, and decide if they need to scale out.

But the question here is, How many resources should be last in a cluster in order to scale it out?

For this we will use as a metric the number of the largest container that could be scheduled in each machine with the actual available resources. The idea of using this metric comes from Philipp Garbe’s article called A better solution to ECS AutoScaling. If you have already read it you will notice that the implementation described in this post is a little bit different, at the end each company has different requirements and approaches.

This metric will be based on the memory reservation of each service since our services doesn’t use the CPU reservation option.

For each machine of the cluster will be calculated how many of the largest container could be scheduled with the remaining resources, then if in the whole cluster can’t be placed more than N of the largest container means it needs to scale up.

The starting point is to recollect some data from the clusters, we will use the following info in order to make future calculations and decisions:

Cluster’s data

It’s important when looking for the largest container to be sure to that it’s not a DAEMON service.

In the image below you can find the method that decides if the cluster needs to scale out, notice that we are ignoring the DRAINING instances because they are not scheduling containers.

Then, if we decide that the cluster should scale up, we will publish a custom metric in CloudWatch called clusterName-needs-scaleOut.

For this metric we will create an alarm that will be triggered when its value rises 1.

And the last piece of the puzzle is the Scaling Policy attached to the SPOT Autoscaling Group of the cluster, that will be triggered when the alarm enter in the ALARM state.

This Scaling Policy will increase the number of desired machines of the cluster in 1.

With all this stack of lambdas and resources, when the cluster is getting full we will be able to automatically scale out machines in order to have enough space for future containers.

Scale In machines

Disclaimer: this algorithm only works when all the machines in the cluster are from the same type. If you are using multiple type of machines jump to the Part 4 of this series, where we implement a new version of this algorithm to make it work with mixed type of instances.

The most delicate part, remove instances without causing downtime or killing active containers.

A cluster can scale down when it has enough free resources to remove one machine, reschedule all the tasks, and end up with space to schedule one more container of the largest service.

And always following the requirements below:

  • Maintaining the AZ balance in the Autoscaling Group.
  • Remove the machine with more free resources.
  • Wait until all the containers are stopped before terminating a machine.
  • Remove only Spot machines.

First, we will evaluate if the cluster needs to scale down, using the same cluster’s info used to scale out, we have to count the number of the largest container that could be scheduled in the cluster, if the result is greater than the number of the largest containers that can be fit in one empty instance, plus the number reserved for scaling out, then we can assume that the cluster can reschedule all the tasks in case we remove an instance, and also leaving enough free space to scale out the largest service.

In the method below you can check it out:

Once decided that the cluster needs to scale down, we have to decide which machine we should remove. In order to choose it, we have three requirements:

  • Maintaining the AZ balance in the Autoscaling Group. If you break the balance of the availability zones of an Autoscaling Group, it will automatically terminate another machine to keep the balance, and this will probably kill running containers of your cluster, ending up in downtime and failures.
  • Remove the machine with more free resources. To minimize the interruptions and the re-schedules of the running containers, it’s important to remove the machine with less containers running.
  • Remove only Spot machines. As we specified before, in our case we only want to use the spot cluster to scale in/out, so it’s important to not remove machines from the OnDemand cluster.

The method below will choose the machine that will be removed:

Once selected, we will put the machine in DRAINING state and wait until all the containers are stopped.

And finally, once all the containers are stopped, we can remove the instance from the Autoscaling Group and terminate it using terminateInstanceInAutoScalingGroup method from the AWS SDK.

With all these steps we can safely scale in without interrupting our services.

An overview of the whole process

Summary

We have used the number of the largest container that could be scheduled as a metric to decide when a cluster should scale in/out. The toughest part is scaling in, the risk is very high, and you have to be sure to remove the machines when all the containers are stopped and the Autoscaling Group will not be unbalanced.

This is the end of a 3 posts series, in the Part 1 we implemented the autoscale for the ECS services that use a Loadbalancer, in the Part 2 we did the same but for the workers, and we finish this series implementing the autoscale for the machines of the clusters.

UPDATE: We have released the Part 4 of this series, where we implement a Scale In algorithm when using mixed instance types.

About Signaturit

Signaturit is a trust service provider that offers innovative solutions in the field of electronic signatures (eSignatures), certified registered delivery (eDelivery) and electronic identification (EID).

Open Positions

We are always looking for talented people who share our vision and want to join our international team in sunny Barcelona :) Be a SignaBuddy > jobs

--

--