Applications autoscaling strategy in Kubernetes

Published in

The Startup

11 min readApr 22, 2020

With all the cloud providers giving us more and more abilities to precisely control the level of resources for our applications infrastructure, we still need to define (closely monitor and control) how much CPU, RAM and I/O our system needs and pay only for the just enough. And what about resilience? How to not overpay for stability? How to control the budget and not breach SLA? What would define scalability bottlenecks and how to find them in advance?

Before we dive deeper into the problem of application scaling, we should understand where this problem comes from…otherwise, we won’t find the correct conclusions!

Reasons to autoscale are economical

The load profile of any application for the Internet is almost certainly not constant over a significant period in a time cycle (hour, day, week) and changing from time to time(after updates, new underlying infrastructure). Keeping this in mind, everyone wants to save some money when the cloud “hardware” is doing not as much work as on top of the incoming requests chart hill.

Let’s call “requests” all the jobs that can be executed by an application deployed in a Kubernetes(or any other) cluster — HTTP requests, queue consumption, and others. This also can be known as throughput.

The main objective for autoscaling is to introduce as small delta between Available and Used resources as possible at any moment in time of the system’s life cycle. In Kubernetes, there is also the intermediary objective, which is to keep delta between Requested and Used and Available and Requested resources on minimum.

If the deltas are equal to 0, that means there is no budget overhead on resources that are not in use.

Methods of autoscaling

Let’s consider four available types of scaling:

Vertical for nodes: the process of changing resources on each of the Kubernetes nodes
Horizontal for nodes: the process of changing the number of nodes in the cluster
Vertical for pods: the process of changing the amount of resources requested/allocated for a pod
Horizontal for pods: the process of changing the number of pods for a specific deployment of an application

By “autoscaling” we should imply ability of the system to automatically adjust (with help of some script or a programm) specific parameters based on some inputs, metrics, rules.

With Kubernetes, I believe it’s going to be hard or even won’t make sense to have only one method of autoscaling to achieve any decent economic result without any significant risk of having issues in the production system performance. So, in the picture below, I’ve illustrated how different parts of the ecosystem should depend on each other in a sort of loop.

For someone, I would imagine, the budget that can support the entire infrastructure, is semi-unlimited so you can scale your application semi-infinitely…but in real life, that’s either rare or not a realistic case. Also, please read the part of the article about “The Universal Scalability Law” which can plant some seed and explain why a system can’t be scaled limitless.

Let’s try to do our cause-effect analysis in this loop and understand how any adjustment of one parameter should affect the others.

(0). Maximum Pod Size = F(Pod Quality)

In general, I consider that the maximum amount of resources(CPU, RAM) to be given to any pod should be determined by the results of load testing and satisfy the condition of the Best Resource Efficiency. From the chart, you can see that at some point, further resources allowance increase won’t be as efficient as spinning up a new pod and giving these resources to this newly created pod.

Maximum Pod Size is a function of Pod Quality, which defines how much resources should be maximum given to a pod limited by the Best Efficiency condition.

(1). Pod Size = F(Requests Per Second)

Pod Size is a function of Current Load/Requests Per Second on the application but limited by Maximum Pod Size. It defines how much resources is needed for a pod to successfully process the current load/throughput.

Practically, adjusting Pod Size in your cluster can be done by Vertical Pod Autoscaller(VPA). The VPA can monitor some metrics (that represent the current number of requests or system load)and modify pods resources(requests/limits) based on some formula or simple rules. The upper limit for this modification should be the Best Efficiency Point.

(2). Pods Count = F(Requests Per Second)

Pod Count is a function of Requests Per Second and defines number of pods required to process successfully the current load (number of requests)

Horizontal Pod Autoscaler can do this for you and change the number of pods based on given metrics that should effectively represent current system load, throughput, or number of Requests Per Second

(3). Network I/O = F(Pods Count)

If you have many pods and applications in your cluster, you should keep an eye on the network between nodes and make sure its performance (bandwidth, throughput, latency, etc) is efficient and can support the communications needed for the application’s operational mode.

Network I/O is a function of Pods Count and defines the required network performance between nodes with the current pods layout on the nodes.

Pods layout is an interesting topic and to be discovered in one of the following paragraphs.

(4). Node Size = F(Network I/O)

The fewer nodes you have to communicate with each other the less network performance you need to support these communications. Here I disregard internal node performance issues with shared resources (Disk I/O, caches, etc)when having too many pods working on one “physical” node. I do that because in most of the cases, nodes inter-communications issues appear sooner, and the Node Size term should also include these internal shared resources along with CPU and RAM.

Node Size is a function of Network I/O and defines how big the nodes(in terms of available resources) should be to not exceed the limitation of the network performance and be able to accommodate all pods.

I would assume this parameter is going to be hard, but still possible, to automate. For example, based on network performance, you can change cloud server type once in a while and replace (if required) a cluster’s nodes.

(5). Nodes Count = F(Node Size)

Knowing Node Size (which is based, in a way, on the system throughput) you should have no issues to calculate required Nodes Count.

Nodes Count is a function of Node size and defines the number of nodes required to support the given system load with limitations of the network performance, Node Size and condition of the Best Efficiency.

Cluster autoscaler is designed to be the tool to automatically change the number of nodes in a cluster with input from some historical or real-time metrics.

(6). Budget = F(Nodes Count)

The more nodes you need — the more money it will eat. Nothing else I have to say here :)

Budget is a function of Nodes Count and dictates how much money is needed to build and maintain the desired system infrastructure.

(7). Rate Limit = F(Budget)

As I mentioned before, usually, Budget is the main restriction for any company. It hence becomes the first parameter to calibrate, which leads to a sort of chain reaction and affects the rest of the parameters, starting from Rate Limit.

Rate Limit is a function of Budget that defines the maximum number of Requests Per Second allowed to be processed by the system in order to not breach company money limitations for the infrastructure.

Here and above I was assuming that SLA is a constant parameter and is not adjustable once agreed on.

Scalability Model complications

This whole model is like a La-la land with unicorns 🦄, butterflies 🦋 and some magical 🧙‍♂️scalers that solve all of your problems…but you know what? Life is not that simple.

Here some complications that could change the model and bring extra inputs for some parameters calculations:

Different nodes size or types
Affinity and anti-affinity in Kubernetes
Different network performance (or requirements to it) between different nodes
Segmented Network and federated clusters
SLA and resilience requirements

The following parts of the article will try to explain how to work with the complications.

Pods layout

From time to time, you will face the situation with your cluster when all the pods are spread across the cluster, but the effect they have on nodes is not equal. This situation is far away from the perfect scenario as it indicates overload on some nodes or/and underutilization on the others.

Rebalancing

The goal of the rebalancing process is to make sure that there are no differences in load or resources available on separate nodes and cluster segments (unless it’s really required). Balanced clusters are more resilient to unexpected events that can happen to your application and underlying infrastructure.

In order to optimize the pods layout, you need to have a pretty complex approach that will take into consideration a lot of system parameters. The parameters that include, but are not limited by individual pods load profile, nodes and network performance, redundancy requirements. Preferably, rebalancing should be iterative and analyze system behavior after every reorganization cycle.

Scheduling strategy and Quality of Service for pod

Eventually, approaches of QoS (Quality of Service for pod) and Kubernetes scheduler gives you a lot of flexibility in terms of tweaking the cluster as precisely as you need to achieve the desired pods layout.

Requests, limits and affinities will dictate the best location for all the pods, but it might take quite a few cycles before you can consider the exercise is finished. Depending on the target in equivalent nodes state the automation can be simple or heavy in implementation.

Redundancy for resilience

Computational Resources

I took the redundancy (of extra resources) for the system resilience out of the scope. Reliability is another problem that is usually solved with additional resources put in place to support your application performance on some constant level(SLA).

You should understand (historically or in other ways get the info) how many instances of the ATOMIC entities (i.e. pods in a deployments or nods in the cluster) can be unavailable due to different reasons and situations in the life cycle of your application. Such cases/events are deployments, crashes, maintenance, switchovers, rebalancing.

Once you know it, the solution becomes obvious — you should add more entities to be settled in advance or for periods with a high risk of the “event” happening. You cover the risk with extra c̶a̶s̶h̶ resources.

When the redundancy “factor” is calculated, so you know how many more pods/nodes you need…

…and you are tolerant to a TEMPORARY systems performance degradation: you can slightly adjust parameters that depend or causing the number of these ATOMIC entities.
In other words, for Nodes Count — you will have to reconsider Node Size or Budget to support this number of nodes; for Pods Count — you will need to change Rate Limit or make sure that Network I/O stays on a decent level.
…and you are NOT tolerant to any systems performance degradation: you should only adjust parameters that depend on the ATOMIC entities.
In other words, for Nodes Count — your Budget will have to be affected; Pods Count — will change Network I/O.

The rest of the system’s parameters will be calculated in this loop/cascade, as explained above in the article. Given the updated ones that are dictated by redundancy demands, you should introduce some coefficients or constant additions to the formulas or rules you would decide to use for autoscaling.

Network (I/O) Resources

When it comes to complex network topologies of Kubernetes or, actually, any other clusters (different Availability Zones, Multi-Region clusters, Multi-Cloud clusters, any other type of Federated clusters) this whole story becomes even more exciting. Still, the high-level (auto-)scaling strategy should remain the same as in the ‘Computational Resources’ paragraph… with maybe minor differences.

If you can’t tolerate any performance issues — no magic here you just need to make sure your network segments duplicate entire infrastructure to be ready to cross-handle outages.

Different expectations to a network segment(s) downtime shall dictate differences in cluster installation tweaks:

Healthy segments should handle the extra load from unhealthy ones by scaling itself up. It just means that the overall Rate Limit on all healthy segments should be constant, regardless of how many unhealthy segments there are in the system. The rest of the parameters should be automatically calculated, and resources scaled accordingly.
Healthy segments should NOT cover unavailable parts of the system. It means that Rate Limit can be different based on the number of healthy network segments.

Based on your target Recovery Time Objective (TRO) and cluster auto-scaling speed, you can pick one or combination of the strategies and follow it.

You can also disregard the need for redundancy if the cost of its support is substaintialy higher than the potential loss from the degraded performance or even downtime. So be careful and data-driven. (Run some tests?)

The Universal Scalability Law

Not a single system can be scaled infinitely, and this is something to always keep in mind when you are designing the system, trying to understand its scalability limits and finding bottlenecks. This knowledge is also useful on early system development or even early design stages.

It’s worth diving into The Universal Scalability Law to get familiar with the answer to the question — “Why is increasing resources not only allowing higher throughput but actually can cause issues and introduce performance degradation?”

P.S.

In the end, I would like to say that you must always remember that any solution should be tested and challenged against a particular situation and system requirements with unique SLO and SLA requirements. This article is written to give you only a high-level overview on how to approach the autoscaling framework…but lacks a lot of details on implementation, specific tools, and practices.

Thanks.