Autoscaling Microservices on AWS — Part 1

1. Introduction

At FiNC we have adopted a Microservice Architecture to handle a significant increase in developer activities such as adding new features, fixing bugs, and resolving the technical debt. Most of our microservices use Ruby on Rails and we deploy on Amazon ECS as containerized applications. Like monolithic applications, microservices also need to be scaled-out at peak hours and scaled-in at off-peak hours. Needless to say, scaling-out is to serve more incoming requests and scaling-in is to release computing resources, hence saving costs. In this blog post, I will cover the way we do it at FiNC.

2. Infrastructure Architecture

Basically, our Infrastructure Architecture looks like:

Figure 1. Infrastructure Architecture

We deploy both real-time services and asynchronous workers on ECS in separate ECS clusters. Real-time services are responsible for serving requests (mostly coming from users) that require quick responses and async workers process delayed jobs enqueued in Message Queues (Redis or Amazon SQS).

3. Autoscaling Microservices

As shown in Figure 1, a microservice utilizes three components: ECS, Message Queues and RDS. Therefore, in order to make a microservice automatically scalable, we need to make those three components automatically scalable. In this section, I will only focus on discussing the way we set up Auto Scaling for the ECS component. I’ll cover Auto Scaling for the other two components in the next parts.

On Amazon ECS, a cluster is backed by an Auto Scaling Group that contains a collection of EC2 instances. These EC2 instances are computing resources for our containerized applications.

3.1. Real-time services

To orchestrate microservices independently in the same ECS cluster, we create a separate ECS service for each microservice. Each ECS service runs a specified number of tasks and each task contains several application instances.

At FiNC, most of our microservices are built with the Ruby on Rails framework and Unicorn (a Rack HTTP server) is used to enable Rails applications to process requests concurrently.

When the number of concurrent users (CCU) increase significantly, the traffic to the microservices increases accordingly and the number of application instances may be not enough to serve all concurrent requests. Consequently, a shortage of capacity may occur. To adapt to this situation, we need to observe the capacity of microservices and scale-out application instances timely and smoothly. We need to request the ECS services to run more tasks timely. However, the more tasks that are run the more computing resources (CPU and memory) will be consumed. This will probably lead to the shortage of the computing resources at some point in time. Therefore, we also need to observe the computing resources utilization and scale-out EC2 instances timely and smoothly.

Conversely, when CCU decreases, the traffic coming microservices decreases accordingly, and therefore some application instances will be redundant. At that time, we should scale-in application instances to release the computing resources. That is we need to request the ECS services to stop some running tasks gracefully. After the computing resources utilization goes down to a certain threshold, we can terminate some EC2 instances gracefully to save cost up.

In Figure 2 below, the left-hand side box marked with the number (1) shows components constituting an autoscaling workflow for EC2 instances backing a real-time services cluster. The right-hand side box marked with the number (2) shows components constituting an autoscaling workflow for an ECS service.

Figure 2. An ECS cluster hosting real-time services

(1) Autoscaling EC2 instances

As mentioned above, EC2 instances backing the real-time services cluster belong to an Auto Scaling group, which supports Scaling Policies:

Figure 3. Scaling Policies

To enable the Auto Scaling group to scale-out EC2 instances automatically, we create a Simple Scaling Policy named ScaleOut. When this policy is executed, it will add two new EC2 instances to the Auto Scaling group. This policy requires a CloudWatch Alarm to trigger its execution. To create that CloudWatch alarm (scale-out alarm), we need a certain metric representing appropriately the computing resources utilization. However, that kind of metric is not available for use. We look for metrics published by the ECS cluster and we find that the ECS cluster publishes two rather relevant metrics, namely, CPUUtilization and MemoryUtilization. Apparently, we cannot only use either metric to create the scale-out alarm because each metric only represents a single computing resource utilization. Sometimes, CPUUtilization is much higher than MemoryUtilization and vice versa. Therefore, we come up with the idea of combining these two metrics to produce a more relevant metric. We set up a lambda function (the 𝝺1 in Figure 2) to do it. The logic of combination is as follows:

Compute CPUMemoryUtilizationNormalized

where, cpu stands for CPUUtilization and mem stands for MemoryUtilization. With this combination logic, we ensure that both CPUUtilization and MemoryUtilization are taken into account equally. The lambda function publishes the return value to CloudWatch as CPUMemoryUtilizationNormalized metric. Once this metric becomes available in CloudWatch, we can create the scale-out alarm from it and then, we can proceed to create the ScaleOut policy like this:

Figure 4. ScaleOut policy

In contrast with scaling-out EC2 instances, we cannot use a Simple Scaling Policy for scaling-in EC2 instances because any tasks running on the removed EC2 instances are killed impolitely. This may lead to unstable situations with ECS services. In order to deal with this problem, we set up two lambda functions 𝝺2 and 𝝺3 (in Figure 2). 𝝺2 periodically queries CloudWatch to check the CPUMemoryUtilizationNormalized metric. If its value is under a certain threshold, then 𝝺2 will ask Amazon ECS to DRAIN a container instance which corresponds to the oldest EC2 instance in the Auto Scaling group from the ECS cluster. The status of that container instance will become DRAINING like this:

Figure 5. A container instance with the DRAINING status

𝝺3 periodically queries Amazon ECS (via APIs) to check the statuses of container instances in the ECS cluster. If there exists a container instance with a status of DRAINING and running tasks count is 0, then 𝝺3 will terminate the corresponding EC2 instance in the Auto Scaling group and decrements desired capacity as well.

Terminate an EC2 instance in Auto Scaling group

(2) Autoscaling ECS services

Amazon ECS allows us to configure Auto Scaling for ECS services independently. For example, the following is the Auto Scaling configuration of an ECS service in our production environment:

Figure 6. Auto Scaling configuration of an ECS service

We have created 2 Step Scaling policies, namely, ScaleOut and ScaleIn for the purpose of scaling-out and scaling-in tasks, respectively. Each policy requires one exclusive CloudWatch alarm to trigger its execution. When the ScaleOut policy is triggered by the scale-out alarm, it will add a specified number of new tasks to the ECS service. In contrast with the ScaleOut policy, when the ScaleIn policy is triggered by the scale-in alarm, it will remove a specified number of running tasks from the ECS service gracefully.

As shown in Figure 6, we have created CloudWatch alarms from the same metric, that is AppInstanceBusyPercent. But what does AppInstanceBusyPercent mean? At FiNC, we use New Relic APM to monitor our applications performance and therefore, we can understand how busy our applications are by checking out the App instance busy chart in the capacity analysis report. For example, the following chart shows App instance busy for a Ruby on Rails application in our production environment:

Figure 7. App instance busy chart

Here, one app instance is equivalent to one Unicorn worker process and the App instance busy chart shows what percent of time Unicorn worker processes take to process requests. The more percent of time Unicorn worker processes take to process requests, the less concurrent requests they can serve. Obviously, in order to keep our service fast and stable, we need to keep App instance busy under a certain threshold. Therefore, we can use this metric to create CloudWatch alarms for ScaleOut and ScaleIn policies. To do so, we set up a lambda function, 𝝺4 (in Figure 2) to get App instance busy from New Relic and then publish it to CloudWatch as AppInstanceBusyPercent metric.

The Python code snippet below shows how we get the average value of App instance busy from New Relic by requesting to their API endpoint.

Get App instance busy from New Relic

Figure 8 below shows how well the Auto Scaling configurations work on a real-time services cluster in our production environment. The first graph shows the variations in RequestCount metrics of Application Load Balancers (An Application Load Balancer is used by a corresponding ECS service as shown in Figure 2. A RequestCount metric represents the number of requests processed successfully by an ECS service). The second graph shows the variations in running tasks counts of ECS services in accordance with the variations in the RequestCount metrics of Application Load Balancers. Lastly, the third graph shows the variations in healthy instances count of the Auto Scaling group that backs for the real-time services cluster in accordance with the variations in running tasks counts of ECS services. These variations have been happening automatically without any interference from human beings.

Figure 8. The effect of Auto Scaling configurations on an ECS cluster

3.2. Async workers

As mentioned before, we also deploy async workers to ECS clusters, and therefore we also need to configure Auto Scaling for EC2 instances backing these async worker clusters like real-time service clusters.

For async workers, we run them as orphan tasks, we don’t use any ECS services to maintain those async worker tasks. The reason is when an ECS service is updated with new task definition revision, the ECS scheduler will kill current running tasks impolitely by using SIGTERM and SIGKILL signals. Whereas, our async workers need to be killed gracefully by using SIGUSR1.

Figure 9. Autoscaling async workers

As shown in Figure 9, we have developed a tool, namely, daemon_watcher to maintain async worker tasks running in ECS clusters. A Jenkins job periodically executes daemon_watcher. It queries Amazon Parameter Store to get settings related to Auto Scaling async workers (settings, such as Desired count, Minimum tasks, Maximum tasks, Scale-out threshold, Scale-in threshold, Incremental tasks) and queries CloudWatch to get queue size. Depending on queue size, daemon_watcher will either scale-out async workers by running Incremental tasks or scale-in async workers by killing a specified number of running tasks gracefully. Currently, we are using both Redis and Amazon SQS as Message Queues, SQS automatically publishes queue size as the ApproximateNumberOfMessagesVisible metric to CloudWatch, while we have to write a lambda function to calculate queue size in Redis, and then publish it to CloudWatch.

4. Risks and Countermeasures

For real-time services, Auto Scaling relies completely on New Relic. This is a risk in our design because New Relic API servers may be down at any point in time. In fact, New Relic API servers went down several times. At that time, the lambda function 𝝺4 cannot get App instance busy from New Relic, and thus the CloudWatch metric AppInstanceBusyPercent has no data. We have 2 countermeasures for this situation: either we can configure scale-out alarms to treat missing data points as breaching and ECS services will scale-out tasks as usual or we can check how busy our services are by looking at Application Load Balancer metrics such as RequestCount, ActiveConnectionCount. Depending on the values of those metrics, we can manually scale-out tasks to keep our services stable.

Autoscaling both real-time services and async workers may also cause a MySQL Denial of Service (DoS) due to too many connections or high query throughput. To deal with this kind of risk, we also need to prepare appropriate countermeasures beforehand. Let me discuss this interesting topic in the next parts.

5. Conclusion

Autoscaling ECS component brings great benefits to us:

  • Microservices can seamlessly adapt to an increase in traffic.
  • Freeing SRE members from doing tedious and muscular works (manually scale-out/in).
  • Reducing AWS cost because Auto Scaling helps reduce the number of virtual machines at off-peak hours, especially in the midnight.

At the time we set up Auto Scaling for EC2 instances, AWS Fargate wasn’t born. However, at this time Fargate is ready for use:

With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters. Fargate lets you focus on designing and building your applications instead of managing the infrastructure that runs them.

With that useful feature, we are going to migrate from EC2 launch type to Fargate launch type for our ECS clusters. Hopefully, with AWS Fargate, we can save even more costs.

Auto Scaling is a really good challenge to SRE members in managing microservices on AWS. In the next parts, I will cover Auto Scaling for the rest of the two components, Message Queues, and RDS. Stay tuned.

By the way, currently, we are seeking a Site Reliability Engineer to join our growing SRE team. Check out the job description on Wantedly: