Kubernetes Scheduling and Timescales

Dimitri Stiliadis
10 min readApr 3, 2020

--

In our effort to optimize performance and better understand the characteristics of our platforms, we had to deeply understand the resource allocation mechanism of Kubernetes and CPU limit enforcement by Linux systems. We believe that some of these details are often missed by Kubernetes users and operators. In this blog post we will attempt to clarify how one can actually estimate the right resource allocations in a Kubernetes cluster and explain some concepts around timescales, statistical multiplexing gains and process burstiness that are often ignored.

The main conclusion is that lack of controls over the timescales that CPU limits are enforced by the Linux can either lead to low CPU utilization or artificially constrain application performance. In order to properly set CPU limits, one has to understand the performance requirements of applications in short timescales, which is almost impossible for any parallel or multi-threaded workload. At the same time, statistical multiplexing gains can be significantly reduced and this can result in significant under-utilization of CPU resources.

Kubernetes Resources Limits

Kubernetes (and container systems more generally) allow operators to place resource limits while deploying workloads. We focus here on CPU allocations. There are two parts in any container resource specification:

1. CPU request, that defines the minimum possible CPU capacity that a container requires. This request is used by the Kubernetes scheduler when assigning PODs to nodes with the goal that all such requests must be satisfied. In other words, the sum of the CPU requests of all the PODs scheduled in a node must not exceed the node capacity.

2. CPU limits identify the maximum resource allocation for a given container, even when a node is otherwise not utilized (Note that limits are per container not per POD). This is essentially a hard upper bound on the amount of CPU that is allocated to a container. CPU requests alone cannot guarantee any isolation or quality of service since individual containers could be otherwise allowed to exceed their allocation up to the node capacity. Requests are simply a hint to the scheduler on the minimum resources required. On the other hand, limits place an absolute bound on the CPU allocation of a container, thus explicitly restricting the load it can place on the system.

The important question, though, and probably the most critical parameter, is over what time-scale is the average resource utilization calculated. During the execution of an application, its resource utilization fluctuates based on incoming load, I/O operations, etc. In the case of Kubernetes, this averaging interval is configurable only through a kubelet parameter and it is 100ms by default (` — cpu-cfs-quota-period`). In other words if an application uses more CPU resources during a 100ms interval than the defined limit, the CPU allocation is throttled.

The operational problem that this introduces is how to predict the right limit for an application in a way that does not introduce artificial bottlenecks. Especially for highly multi-threaded or parallel applications, instantaneous load can fluctuate dramatically over time, depending on the parallel invocation of threads that in turn depends on the arrival traffic.

A Synthetic Benchmark

In order to better illustrate the problems with the limit calculation, we wrote a small program that exercises these limits in a controlled way. The program, written in Go, instantiates 4 parallel Go-routines that execute a CPU bound addition instruction in a loop for 200ms, and then sleep for 2 seconds. We chose this synthetic benchmark in order to illustrate the problem.

We run the program in a simple Kubernetes cluster in GKE with n1-standard-4 VM instances (4 cores). Note that the duration of the loop was selected as 200ms that is higher than the average interval used by the kubelet.

Initially, we applied no resource limits in the program using a definition like below:

During this execution, each of the Go-routines would achieve approximately 1M additions during the 200ms loop with a total “goodput” of 4M additions. We then modified the request and limited the CPU to 2 cores. Note that this limit was selected on purpose since it is lower than the parallelism level of the program itself. The corresponding definition was:

The “goodput” of the program was immediately reduced by almost 50% and each Go-routine was now capped at about 500–600K additions in 200ms. The total goodput was reduced to approximately 2.3M. One could assume that this is valid behavior. However, if we plot the CPU utilization of the program in 100ms intervals we can see the actual utilization of the CPU. The utilization was derived in the nodes using:

top -b -n 1000 -d 0.1 -p <pid> | grep scheduler

As can be seen in the figure, the average utilization over 1-second intervals when no caps or limits are applied is well below 1 core. However, because the instantaneous CPU utilization over shorter timescales exceeds the limit, a cap significantly deteriorates the performance. One can reasonably have the expectation that since the CPU is under-utilized over 1-second intervals, there should be no difference in performance.

Oversubscribing and Statistical Multiplexing

So far we were trying to estimate the CPU load by applying limits and we noticed that when we apply limits both CPU utilization and the actual work that our program was doing were reduced. In order to illustrate this point better, we now run four instances of our program without any limits. Through a ‘nodeSelector’ we force all instances of the containers to run in the same node. When there is no limit, the aggregate goodput of all PODs is now at 16M that is essentially linear to the number of PODs since are not over-utilizing the CPU resources. The aggregate capacity used remains below the four core max.

By not placing any limits we have achieved a higher overall system utilization without any degradation in performance. On the other hand, if we re-introduce the limits, the statistical multiplexing gain is removed, CPUs are idle and the programs take a longer time to complete.

In other words, by artificially constraining the CPU allocation of a POD at small timescales that do not match the parallelism properties of the program we significantly limit the ability of the system to take advantage of statistical multiplexing gains. This limitation leads to lower overall resource utilization that can translate to either higher costs or lower performance. Limits are useful only if we know the right timescales to use.

Performance of a Simple Web Server

We now extend the analysis with a simple web application serving API calls that represent a simple microservice. For simplicity, we assume that all API calls return immediately and we do not account at this point for variable processing times of API calls. We also create a simple client that issues API calls following a Markov Modulated On/Off process with exponential arrival times.

Traffic Client

Each client starts from the ON state and makes API requests as fast it can. The time in the ON state is determined through an exponential distribution. After the client is complete, it enters the OFF state for an exponentially averaged time, where it remains idle. We super-impose multiple of these processes in order to create a realistic arrival traffic.

Performance Analysis

In order to demonstrate the effects of CPU limiting, we deploy the client and server in different nodes of a Kubernetes cluster using 8-core machines in GCP. Initially, we deploy the server without limits and then we introduce limits of 2, 3, and 4 cores. We chose the arrival rate so that in the average (over 5 seconds intervals) an unrestricted server will have utilization of less than 180% (and essentially less than 2 cores). We then supply traffic and measure the average CPU utilization, the rate of API calls/second served as well as the average latency of the API calls. The arrival traffic uses 64 independent clients.

The server CPU utilization without any limits over different time-scales is illustrated in the figure below:

As it can be noticed, in the un-constrained case the CPU utilization over longer time-scales is below 2 cores, although the 100ms CPU utilization can reach a maximum of 3 cores. The instantaneous utilization obviously depends on the number of concurrent active clients and on their on/off state.

We then repeat the same experiment by limiting the CPU utilization through the Kubernetes limit parameters to 2, 3, and four cores. The aggregate results are shown in the table below:

| Metric            | Unlimited | 2-cores  | 3-cores | 4-cores || API calls /second | 22969     | 21563    | 22000   | 22500   || Average Latency   | 14.3us    | 30.67us  |  16.2us | 16.1us. |

As we can see from the above results, given our traffic patterns, restricting the cores to two reduced the API rate by 5%, but most significantly increased the average latency by 100%. Essentially our clients were observing double the latencies for executing the same API workload.

Increasing Source Burstiness

In order to illustrate the effects of traffic burstiness even more, we modified our setup and launched client processes in different nodes while keeping the aggregate CPU requirement for the server below 2 cores. Note that when all the clients are running in a single node, the parallelism of the client processes is limited by the number of cores where the client is running. By launching clients in multiple machines and doubling the simulated number of sources at a lower rate we end-up with a much more bursty traffic profile even though the total traffic to the server is not affected. Below we illustrate this more bursty traffic pattern:

While measuring the performance with and without any limits we can see the same type of latency increases and throughput drop as before. Note, that the total arrival rate is actually smaller now and the total CPU utilization of the server is way below the 2-core limit, but the effects in observed latency are almost 5x. When limits are used “traffic burstiness” can dramatically increase latencies and this can have serious consequences in overall application performance.

| Metric            | Unlimited | 2-cores  || API calls /second | 13332     | 11900    || Average Latency   | 3.1us     | 15.8us   |  

Multi-hop Impacts

In a micro-service environment where processing a user request might involve the execution of several micro-services potentially in a series of calls, the above variations can have dramatic results. It is not only that the latency of a single server is increased, but clients will also notice additive latencies while the overall CPU utilization is low.

Know Your Limits

Unfortunately, it is extremely hard for a software team to estimate the instantaneous limits of an application. In several instances engineering teams might calculate longer-term averages given specific test workloads, but they will often not match those with the scheduling time-scales. Even though in the synthetic application we are able to explicitly determine the degree of parallelism and active/sleeping timeframes, in any production use it is not easy to estimate the effect of limits on the performance of applications.

One has a couple of choices on properly setting limits:

1. Launch applications without limits, actively monitor nodes and hope that none of the microservices violate the behavior. Given that PODs from multiple users and/or namespaces can be scheduled in the same node though, this approach provides very little comfort to operations teams on any CPU boundaries. More importantly in any soft multi-tenant environment developers will not be able to see the complete effects on applications since nodes are shared by several users.

2. Correctly estimate the limits. We believe that in most cases estimating parallelism at 100ms intervals (or for that matter any short timeframe) will be very hard for any application and more importantly it limits the statistical multiplexing gains. In most scenarios not all micro-services are bursting in terms of CPU utilization at the same time and taking advantage of statistical multiplexing gains is critical to maintaining a relatively high overall node utilization.

3. Explicitly control the parallelism of applications so that any limits in Kubernetes match an expected behavior. In this case an application is tested with specific parallelism limits (as an example by limiting COMAXPROCS in Go) and in this case, the Kubernetes deployment will match the limits of the application. When this approach is chosen, performance is predictable, but the statistical multiplexing gains are also limited, and therefore overall CPU utilization will be limited as well.

Essentially we are seeing the same performance tradeoffs that the networking world has been studying for years. Hard limits mean low CPU utilization and potentially longer latencies for bursty applications. Lack of limits results in better CPU utilization but uncontrolled behavior and no separation between applications.

As a first step, a better balance is to increase the time of averaging in your Kubernetes clusters since at least you can have a more predictable performance boundary. Longer-term, the CPU scheduling and cgroup limits need a lot of work to account for statistical multiplexing gains.

--

--