https://www.safaribooksonline.com/library/view/cloud-architecture-patterns/9781449357979/ch04.html

Autoscaling using Custom Metrics

Published in

Qbits

7 min readApr 20, 2018

AWS CloudWatch provides a rich set of tools to monitor health and resource utilization of various AWS services. The metrics CloudWatch collected can then be used to set up alarms, send notifications, and trigger actions upon alarms firing.

At Quorum, we follow an Immutable Deployment pipeline and autoscale our production instances using AWS Auto Scaling Group (ASG) and AWS CloudWatch. The initial version of our scaling policy was quite simple: we started with two production instances (for redundancy, it’s best practice to have at least two instances). When we experienced high CPU load for a set period, we added one more instance to the ASG (also known as scaling out). Similarly, when we recorded low CPU load across all instances, as long as we had more than two instances, we removed one instance (a.k.a scaling in).

Side note: since we are adding (or removing) instances of the same type (otherwise known as horizontal scaling), we’ll be using scaling out/in going forward over scaling up/down, which is more typically used in vertical scaling.

This simple policy served our needs very well most of the time, particularly for our Quorum Grassroots product. Most visitors (also known as advocates or supporters) do not perform particularly intensive requests or queries. However, at any given time (especially right after a campaign is launched), we can reasonably expect a lot of advocates coming to an Action Center to take action or write their members. This often results in high web traffic, which is a primary reason for high CPU utilization. In order to meet the increase in traffic, we add more instances to the ASG using our CPU scaling policies and let Amazon Elastic Load Balancer (ELB) distributing load across the instances, thereby reducing the per-instance load.

Memory Scaling

While CPU scaling policies work great for a high number of low-intensity web requests, they don’t solve all scaling problems. A problem we frequently ran into was that one of our Gunicorn workers would eat up all available memory while executing a memory-intensive task. This prevented other processes from utilizing resource and as a result, AWS ELB healthcheck would mark the instance unhealthy. The instance was then taken out of service and a new instance was launched to meet the ASG’s minimum instance count. The vicious cycle continued as long as the memory-intensive task was still being carried out because our scaling policies were not responsive to memory utilization.

The solution here is to add memory-sensitive scaling policies. Unfortunately, memory utilization is one of the metrics not available by default in CloudWatch. Since AWS does not have access to the instance at the OS level, only metrics that can be monitored through the Hypervisor layer (such as CPU and Network Utilization) are recorded. There are various ways to solve this problem; in fact AWS offers one themselves. Since we primarily leverage Python and Boto in our daily work, we settle on the following gameplan.

We record the memory usage per instance. This can be done by grep -ping from free -m or /proc/meminfo .
Using Boto and its CloudWatch API, we then send the collected metrics to AWS.
We turn step 2 into a task that is executed every minute. This can be done with a cron.
We create new scaling policies based on the newly-recorded metrics (e.g. scale out when memory usage exceeds 70% across all instances, scale in when memory usage is below 20%.)
Finally, we attach the new scaling policies to the Auto Scaling Group created every deployment and let AWS takes care of all the magic from there.

Problem Solved?

We can now scale based on either CPU or Memory utilization, and we have a gameplan to add more custom metrics in the future should the needs arrive. It looks like our job is done here.

Or is it? With four alarms (High CPU, Low CPU, High Memory, Low Memory), it seems like we can respond to high resource utilization well in general. However, we must not forget that any pair of the four alarms (as long as the metrics are different) can happen at the same time. Problems arise when a metric is used to scale out while another one is used to scale in. Imagine having high CPU utilization but low Memory usage overall. We would end up in a situation where a new instance is launched and another one terminated, thereby resulting in another vicious cycle as long as both alarms’ condition are met.

A limitation of CloudWatch alarms is that it only allows us to respond based on one metric at a time and not on a combination of metrics. What we actually want here is to scale out whenever either CPU or Memory is in high demand and only scale in when both CPU and Memory are under utilized. The desirable behavior here can be written with the following pseudocode.

if high_memory or high_cpu:
    scale_out()
elif low_memory and low_cpu:
    scale_in()
else:
    do_nothing()

Darn! And we were so close too.

But wait! 💡 What if we compute the if conditions one level before CloudWatch? Then we could just send CloudWatch three different enum values: one to scale out, one to scale in, and one to stay where we’re at. The above pseudocode can be rewritten as follows.

if high_memory or high_cpu:
    return 1   # instances += 1
elif low_memory and low_cpu:
    return -1  # instances -= 1
else:
    return 0

One Metric to Rule ’Em All

So far, we’ve learned that scaling with two (or more) independent metrics can be problematic, but we can get around that by scaling with a single metric calculated from both CPU and Memory. Let’s now translate the above pseudocode into actual code. We can calculate CPU and Memory usage separately and branch based on the aforementioned conditions. The result is then sent to CloudWatch through the put-metric-data API call, which is also available in boto . This function takes in a few parameters, of which the following are important to us:

namespace : a custom namespace that does not start with AWS , since that is reserved for official AWS services. To stay consistent with the AWS/Service format, we’ll go with EC2/Autoscaling.
unit: CloudWatch offers a wide range of selections here. To keep things simple, we’ll go with Count as it represents our enum values best. Since an event’s Count of -1 does not make sense (in fact, CloudWatch limits the value to only positive double), we’ll shift our enum values up by one. As we’ll see later, this shift also simplifies the scaling threshold calculation.
Metric’s name/value : “Autoscaling” / a value of 0, 1, or 2.
dimension : we want to associate the metrics with the AutoScalingGroupName.

Putting it all together, we arrive at the following script.

Integrating What We’ve Learned into the Deployment Playbook

At long last, let’s put everything together and integrate what we’ve learned into the Ansible deployment playbook. We go in depth how software deployment is done at Quorum in this article, which might serve as a good refresher before moving forward.

First, in the playbook that spins up a Staging server/environment, we want to set the above script as executable and add a cron that runs and publishes data to CloudWatch every minute.

Second, in the deployment playbook, we’ll configure two scaling policies: one for scaling out and the other for scaling in. We make the scaling out’s cooldown shorter than the scaling in’s so that we can add more instances faster during high load periods.

Finally, we configure the metric alarms and attach the scaling policies to them using ec2_metric_alarm.

The two threshold values are purely specific to our workload and deserve further explanation. Given a configuration with two instances, if one instance is utilizing a lot of resources and sending CloudWatch a value of 2 while the other one is fine with a value of 1, we want to scale out. The average value here is 1.5, which is what we set for our scale out threshold.

Similarly, let’s say we are now at three instances after a scale out event takes place. At least two of our instances are recording low resources utilization and sending CloudWatch a value of 0, while the other one is normal with a value of 1. In this scenario, the average value recorded would be 1/3, which is why we set the scale in threshold to 0.34.

The point here is that we want to be very responsive to scaling out. An overwhelmed instance does not provide great user experience. On the other hand, we don’t want to be too responsive to scaling in. We keep in mind that AWS and most infrastructure provider round instance billing to the next hour. Renting an instance from 2 to 2:30 will cost the same as renting from 2 to 2:59. Furthermore, instance renting that spans two clock hours (say 2:50 to 3:10) will bill our account for both, so we might as well keep our scale out instance(s) around until we are certain that resources across all instances are under utilized.

Summary

We’ve learned the following things:

We can autoscale using custom metrics that do not come out of the box with AWS CloudWatch.
That being said, scaling with two (or more) metrics independently can cause undesirable result, particularly when one metric wants to scale out and the other wants to scale in.
For good user experience, be very liberal scaling out (add instances at the first sight of high load) and conservative scaling in (it doesn’t hurt to have more resources than needed plus other instances billing quirks.)

Interested in working at Quorum? We’re hiring!