Cost-Based Node Group Ranking for Cluster Autoscaler

Published in

ILLUMINATION

5 min read1 day ago

When your Kubernetes cluster has different types of nodes (like some with GPUs and others without), the Cluster Autoscaler (CA) needs to figure out which type of node to add when more resources are needed. Currently, it often just picks one at random. This can lead to situations where a powerful, expensive node is added to handle a simple task, which is inefficient.

We need a better way to choose the best node type for expansion. We need a way to compare the cost of adding different node types and choose the most economical option.

Estimating Node Costs

To make informed choices, we need a way to estimate the cost of each node type. For cloud providers like Google Cloud Platform (GCP), the cost of a node is usually readily available but not through the Kubernetes API. A simple configuration file could be used to store these costs.

We don’t need to be super precise; we just want to make sure cheaper nodes are ranked lower than more expensive ones.

Picking the Best Node Pool

Knowing the cost of individual nodes is only part of the solution. We need a way to choose the node pool that is best suited for all the unscheduled pods. Sometimes, using only one node pool isn’t the best idea because:

Node pool limits: Adding pods might exceed the maximum allowed for a pool.
Pod fit: Some pods might not fit on certain node types.
Node selectors: Different pods might require different node types.

Different node pools could potentially handle different pods at very different costs, making a simple comparison challenging. Let’s represent the cost of expanding a specific node pool as “C”.

An Example

Imagine two expansion options:

Option 1: Adds 3 nodes of Type 1, accommodates pods P1, P2, P3, and costs $10 (C1).
Option 2: Adds 2 nodes of Type 2, accommodates pods P1, P3, P4, P5, and costs $20 (C2).

It’s tricky to determine if spending $10 to run 3 pods is better than spending $20 to run 5 pods. We need a better way to compare C1 and C2.

Theoretical Cost (T)

We can estimate the cost of running each pod on a node perfectly suited to its needs. Based on pricing from a provider like GCP, we can use values like:

* 1 CPU core costs $0.033174 per hour.
* 1GB of memory costs $0.004446 per hour.
* 1 GPU costs $0.7 per hour.
* 50GB of local SSD storage costs $0.01 per hour.

We can use this information to calculate the theoretical cost (T) of running all pending pods on ideally sized machines for each expansion option. Then, we can look at the ratio of actual cost to theoretical cost (C/T).

* If C/T is 2, it means we’re paying double the ideal cost.
* If C/T is 1.05, we’re close to the ideal, suggesting that it’s probably a good option.

By always picking the expansion with the lowest C/T, we get a good approximation of the cheapest cluster setup.

Adding Node Preference

C/T is a good starting point, but it doesn’t consider preferences, such as wanting to use bigger machines in large clusters or maintaining consistency. Bigger nodes usually mean less wasted space (resource fragmentation) and more chances of accepting new pods.

NodeUnfitness

We introduce a “NodeUnfitness” metric for each node type to account for this preference. It’s a value that indicates how well the node matches the desired “shape” of the cluster.

We can calculate NodeUnfitness as the ratio between the preferred node size and the current node’s size:

NodeUnfitness = max(preferred_cpu/node_cpu, node_cpu/preferred_cpu)

For example, if the preferred node is `n1-standard-8`, the `NodeUnfitness` of an `n1-standard-2` node would be 4.

Combining Cost and NodeUnfitness

Ideally, we’d combine C/T and NodeUnfitness in a formula. However, simply adding them linearly has problems. Instead, we introduce a “big cluster damper” (X) which we add to both the actual and theoretical costs in the C/T ratio:

(C + X) / (T + X)

Experimenting with various values for X, like the cost of running a 0.5 CPU pod, shows that we can adjust the weights in favor of bigger nodes.

The Final Formula

After considering these issues and experimenting, we arrive at a final ranking function:

suppress(NodeUnfitness, NodeCount) * (C + X) / (T + X)

where:

* `suppress(u, n)` is a function that gradually reduces the influence of `NodeUnfitness` as the number of nodes needed increases (explained further below).
* `NodeUnfitness` is calculated as mentioned earlier.
* `C` is the actual cost of adding the nodes.
* `T` is the theoretical cost of accommodating the pending pods ideally.
* `X` is a “damper” value to stabilize the C/T ratio, typically the cost of running a small pod.

Suppressing NodeUnfitness for Large Expansions

The suppression function `suppress(u, n)` aims to:

* Keep `NodeUnfitness` the same when adding only one node.
* Decrease the effect of `NodeUnfitness` for larger expansions (multiple nodes).

This is useful because when expanding with a lot of nodes, the cost difference might be the primary driver, and we might prefer cheaper but “less-ideal” nodes.

The final function utilizes a modified sigmoid function for suppression:

suppress(u,n) = (u-1)*(1-math.tanh((n-1)/15.0))+1

This effectively achieves the desired suppression behavior, smoothly reducing `NodeUnfitness`’s influence as the number of nodes (`n`) increases.

Preferred Node Size

We need a mechanism for selecting the “preferred node” size based on the overall cluster size. A simple, hard-coded mapping can be used:

* Cluster size 1–2: `n1-standard-1`
* Cluster size 3–6: `n1-standard-2`
* Cluster size 7–20: `n1-standard-4`
* Cluster size 20–80: `n1-standard-8`
* Cluster size 80–300: `n1-standard-16`
* Cluster size 300+: `n1-standard-32`

Summary

This blog post presents a cost-based approach to improve node selection for the cluster autoscaler, making it more intelligent in heterogeneous Kubernetes clusters.

This ranking function considers the cost of different node types, takes into account pod resource requests, and incorporates a preference for larger and more consistent nodes, resulting in a more efficient and cost-effective cluster. By thoughtfully utilizing a NodeUnfitness metric and a suppression function based on the scale-up size, this approach promotes cost savings and helps improve the cluster’s resource management capabilities.

Source: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/pricing.md#adding-preferred-node-type-to-the-formula