Episode 5: The one with auto-scaling.

Concurrency and Distributed Systems: A guide with Kubernetes

--

Any business would want to maximize their profit. They would want to get the maximum amount of work/output from the least amount of resources, which makes a lot of sense — getting the most bang for your buck. In any software business, we want to maximize the server utilization. Ideally, we want to utilize 100% of the server’s capacity. But is that possible? Even if it is possible, is it the right thing to do?

In order to efficiently and economically make use of the available hardware and be responsive at the same time, we should horizontally scale our services up and down. What is horizontal scaling? When the load is high, we increase the number of worker nodes(or containers) to accommodate the higher load and stay under the required SLA. On the other hand, when the load is low, we want to decrease the number of workers and turn down the excess hardware to save money. That is horizontal scaling.

Let’s lead this discussion with some questions and looking back at episode 2 when we introduced replication.

  • What metric shall we use for server utilization?
  • How does server utilization affect performance and response time?
  • What is the best server utilization target value?

We are going to re-run the experiments from episode 2 to see the effect of utilization on performance.

Prerequisites

I am using the sysstat Linux package to make CPU/Memory measurements on my system. To install this on Ubuntu, simply run the following command:

Once you install this package you will have access to pidstat and sar binaries which are very useful for when we are load testing.

Utilization

What metric shall we use for server utilization? The metrics we can easily measure on the server are CPU, Memory and IO usage. Between these metrics, CPU is usually the most expensive in terms of dollar value. We usually want to reduce the CPU usage or get the most out of it. So we mostly focus on this metric for server utilization measurements.

Once you install the sysstat package you will have access to sar binary. Using sar you can sample the average CPU usage over a period of time. For example the following sar command will print CPU usage every 5 seconds.

The experiment we want to run here is as follows. Run the replication code from episode 2, multiple times with varying request’s arrivalRate. We then sample the CPU usage and compare it with performance we get from each run. Our performance metric is the p95 execution time. For the purpose of this experiment we are going to effectively turn off throttling by setting the throttling threshold to 2000. After running the load tests five times for arrival rates = 5, 10, 15, 20 and 25 and collecting the average CPU load and running time, we can chart the p95 against the average CPU usage as shown in the following graph:

P95 vs Average CPU Utilization

As you can see in the above diagram, performance(p95) starts to really suffer when CPU utilization goes above 70%. As a matter of fact by the time the CPU utilization is above 75%, we have violated the SLA terms(> 5 seconds to run).

Increasing CPU utilization only pays off up to a certain point. After that, servers would spend a lot of time context switching between a lot of tasks and would not be able to run any of them within an accepted time-frame, and hence the unacceptable performance.

As we saw in the above graph, 66% CPU utilization is actually the sweet spot and a good utilization target for many applications. Going above this target could potentially put us at the risk of over-utilization, hurting the performance and missing the SLA. We will use this value when we want to setup our auto-scaling in K8s. Ideally we always want to keep the average CPU utilization near this target value by adding or removing worker nodes dynamically. That is when auto-scaling comes into play.

To learn more about utilization I highly recommend this video:

HPA

Horizontal Pod Autoscaling(HPA) is a K8s mechanism to help us with automation of horizontal scaling. K8s offers a rich API with plenty of configuration options to control pod scaling based on various system metrics.

For example the following HPA declaration will scale the respective pod based on CPU utilization:

It increases or decreases pod instances to keep the average CPU utilization near 60%. Important thing to remember is that this is done based on hardware resources availability.

K8s follows a specific algorithm(which can be parameterized) to do the auto-scaling. The details of the algorithm can be found in this link. But in simple terms, and in the case of the above config, when it sees that the CPU utilization is above 60% for more than 15 seconds(this is configurable), it adds new pods. On the other hand, if the CPU utilization is less than 60% for more than 5 minutes(again configurable), it scales down the pods.

Let’s put this into use for our little Sum of Primes service. This is the HPA config for our service:

For this episode we changed the code to deploy only once replica at the beginning. Using the above HPA config, K8s will automatically create a new replica when the CPU utilization goes above 66%.

Links

--

--