3.14 Tips for using the Kubernetes Cluster Autoscaler on Oracle Cloud

In this article, we introduce two implementations of Oracle Cloud’s Cluster Autoscaler Node-Groups and some tips we’ve learned using them at scale

Jesse Millan
Oracle Developers
7 min readMar 14, 2022

--

The Kubernetes Cluster Autoscaler (CA) automatically adjusts the number of Kubernetes Nodes in your cluster based on the workload by dynamically adjusting the size of scaling groups on the backing cloud provider. Many cloud providers have one or more implementations that act as a “Node Group” for the Cluster Autoscaler including Oracle Cloud Infrastructure (OCI). If you are running Oracle Container Engine for Kubernetes (OKE), Node Groups are implemented as OKE Node Pools on the backend (--cloud-provider=oke). For those running a non-hosted Kubernetes distribution on OCI infrastructure, we’ve developed a new implementation using OCI Instance Pools (--cloud-provider=oci).

In celebration of Pi (π) Day, we thought we’d take a look at 3 important areas we focus on when correctly configuring when autoscaling Kubernetes Nodes at scale, including the configuration of the backing Instance Pools and OKE Node Pools, the Cluster Autoscaler configuration, and the application Pods running in our production clusters. In these 3 areas, we’ve selected the top 14 tips we’ve learned… 3.14, get it?

Note: If you haven’t already done so, you can sign up for an Oracle Cloud Free Tier account today.

Cluster Autoscaling in a nutshell

Before diving into specific tips and recommendations, the first thing we learned is that you can avoid a lot of frustration by being familiar with how the CA works, how it doesn’t work, and how it can be tuned and customized.

What CA does:

  1. Wait --scan-interval, check for Pending Pods in the Kubernetes cluster.
  2. For each node-group in the CA configuration, ask provider for a “model” of what a new Node from this group would look like (i.e. resources, labels, taints, etc.).
  3. Simulate the Kubernetes scheduler logic to see/predict whether Pending Pod(s) could be scheduled on this hypothetical new Node(s) from this group.
  4. If it would, and this node-group has not reached its max or total Node count, have cloud provider perform resize (potentially adding many Nodes to multiple groups).
  5. Loop back to step 1.

What CA doesn’t do:

  1. Take into account actual CPU, memory, or GPU utilization on Nodes (only the combined resource requests).
  2. Register Nodes to your Kubernetes cluster.
  3. Install software or configure Nodes.
  4. Add labels or taints to Nodes.
  5. Schedule Pods to Nodes.

The CA Frequently Asked Questions on GitHub has a lot more information and is worth a read.

1. Cloud Provider Node Group Configuration Tips

Tip 1.1: Configure the CA Deployment to manage multiple node-groups (i.e. OKE node-pools or OCI instance-pools), each scoped to a single Availability Domain (AD). This configuration allows the autoscaler’s scheduling simulator to accurately and deterministically scale up Kubernetes Nodes based on the AD e.g. when anti-affinity for topology.kubernetes.io/zone is the reason a particular Pod is pending/unschedulable.

Tip 1.2: Set user-defined max node count --nodes=<min>:<max>:<id> on node groups as large as possible. Because scaling is a time-consuming process that the CA performs serially, we’ve observed increased performance by having it provision the most Nodes in the least amount of separate operations.

Tip 1.3: Set min node count of node-groups to 0 --nodes=0:<max>:<id>. The CA is capable of scaling down unused node-groups completely (i.e. to 0) when no Nodes from the group are needed, it is useful to allow a complete scale down to reduce or elimnate costs of unneeded resources on your cloud provider.

Tip 1.4: When using more than one node-group, use expanders to control which to scale up if more than one would satisfy the Pod requirements. As noted, we almost always configure the CA to manage more than one node-group. We also prefer to use flexible compute shapes when they are available. The --expander=priority lets us prioritize our node-groups to the CA e.g. here is an example expander of prioritizing our flex-instance based pools over our vm-standard instance based pools.

2. Cluster Autoscaler Configuration Tips

Tip 2.5: Use the same version of Cluster Autoscaler (CA) as your Kubernetes server. For example, if your cluster is running Kubernetes v1.23, then you should run the Cluster Autoscaler v1.23. The reason for this is that the CA imports the scheduling logic directly from Kubernetes itself. To avoid difficult to diagnose scheduling mismatches between the CA simulator and your actual scheduler, use the same version the CA as Kubernetes. Also, consider the implications on the CA before using a custom scheduler in your Deployments (i.e. Scheduling Policies may differ from the CA’s scheduling simulator).

Tip 2.6: Isolate the CA Pod(s) from your containerized workloads. Because the CA is fairly CPU and memory intensive, we prefer to run it on our Kubernetes master Nodes by having the CA Deployment tolerate our masters’ taints. We avoid allowing the CA Deployment to run alongside our application Pods on Nodes since application load can negatively impact the CA’s performance, or worse, cause it to be evicted leaving it unable to take scaling action to resolve the contention.

Tip 2.7: Run multiple replicas of the CA. Only one autoscaler replica can be the leader at any given time. Still, we’ve found running multiple replicas is beneficial since it increases the probability that one is always healthy and functional.

Tip 2.8: Customize the max time the CA will wait for a new Node to be provisioned & join your cluster on your cloud provider with --max-node-provision-time (default 15 minutes). Even within the same provider, the time it takes for a new Node to be provisioned and register itself can vary by several minutes based on factors including the operating system and the bootstrap procedure for new Nodes joining the cluster (OKE, RKE, kops, etc.).

Tip 2.9: Customize how often cluster is reevaluated with --scan-interval (default 10 seconds). Consider the trade-off between your responsiveness requirements of the CA and the amount of regular API calls that will be required of your Kubernetes API server and your cloud provider. The last thing you want is for the CA to be throttled at the very moment you need it to take scaling action. On most cloud providers, provisioning Nodes takes several minutes. We’ve found we can increase the scan-interval to 2+ minutes without impacting the CA’s responsiveness noticeably.

Tip 2.10: Monitor the cluster-autoscaler-status configmap, events, and the CA’s Prometheus style /metrics endpoint. In addition to the CA container logs, the cluster-autoscaler-status configmap is helpful for debugging issues. It contains the status of your node-groups as perceived by the CA, including the health of each, how many Nodes have been requested versus how many have actually joined the cluster, scaling events, etc. The CA’s internal/metricsendpoint contains even more detailed metrics (number of go-routine, CPU, and memory details, etc.) as well as metrics related to time taken by various parts of the CA framework.

Tip 2.11: Run Cluster Autoscaler (CA) alongside Horizontal Pod Autoscaler (HPA) or the Cluster Proportional Autoscaler (CPA) for diverse scalability. For certain workloads, these different Kubernetes autoscalers solve different problems and can complement each other nicely when used together. The HPA and the CPA both dynamically increase Deployment replica counts, which increase the required resources. The CA steps in and adjusts the Node count to meet the new resource requirements of the newly added replicas.

3. Application Pod Configuration Tips

Tip 3.12: Wherever possible, ensure your application Pods are resilient to being moved. In addition to watching for unschedulable Pods, the CA is regularly looking for opportunities to move Pods around to different Nodes in an attempt to decrease the Node count in the cluster (i.e. to remove underutilized Nodes). It’s important to remember that deleting a given Node requires restarting all the Pods that were previously running on it.

Tip 3.13: Prevent any application Pods that aren’t resilient from being moved. Some Pods aren’t resilient or are inherently expensive to move. We annotate such Pods with cluster-autoscaler.kubernetes.io/safe-to-evict=false to prevent them from being forced onto different Nodes by the CA.

Tip 3.14: Ensure all your Pods have resource requests set in their specs. The CA does not monitor actual resource usage on the Node — only the combined resource requests in the pod specs. Unless all of your Pods are setting spec.containers[].resources.requests, the CA will not have an accurate picture of a Nodes resource utilization — including when an underutilized Node’s --scale-down-utilization-threshold is met. A single Pod not using resource requests will throw it off since its resources are not reflected in the CA’s total resource requests of pods on that node.

Wrapping Up

In this post we’ve shared our most important and hard-fought tips for running the Cluster Autoscaler. We’ve also provided links to OCI’s two implementations for autoscaling node-groups (OKE node pools, OCI instance pools) for both hosted and non-hosted Kubernetes options. We hope you’ve found at least some of the tips helpful. Please don’t hesitate to reach out by leaving a comment or by filing an issue in one of the related GitHub projects.

Happy Pi Day!

Join the conversation!

If you’re curious about the goings-on of Oracle Developers in their natural habitat, come join us on our public Slack channel! We don’t mind being your fish bowl 🐠

Photo by Łukasz Łada on Unsplash

--

--

Jesse Millan
Oracle Developers

Jesse is a software developer at Oracle focused on Kubernetes.