Nova Just-In-Time K8s Clusters

Operation & Example Use Case for Low-Risk Multi-Cluster K8s Version Upgrades

Anne Holler

Published in

Elotl blog

8 min readApr 12, 2023

Author: Anne Holler

ABSTRACT

Elotl Nova simplifies the management and usage of multiple K8s workload clusters. Nova introduces a single K8s control plane cluster that automatically schedules K8s objects onto workload clusters in accordance with placement policies.

Nova has recently been updated to optionally support Just-In-Time cloud workload clusters. When the JIT workload clusters option is enabled, Nova puts cloud clusters that have been idle for a configurable amount of time into standby state. When Nova schedules objects to a standby cluster, Nova brings the cluster out of standby state. Nova can also optionally create a cloud workload cluster, by cloning an existing cluster, if needed to satisfy a placement policy.

This blog describes the operation of Nova’s JIT workload clusters feature. The blog includes an example Nova JIT use case, which is to facilitate multi-cluster K8s version upgrades. Such upgrades present performance risks to K8s workloads, and mitigating those risks involves the operational complexity of sequencing the upgrades, maintaining cluster capacity during upgrade, and ensuring smooth version rollback if needed. Nova JIT reduces the toil of performing low-risk K8s version upgrades, by handling key steps in the process.

INTRODUCTION

Kubernetes is a popular system for deploying, scaling, and managing containerized applications in the cloud. Organizations often deploy separate cloud K8s clusters for their various workload categories, tailoring the clusters to each category’s resource needs, usage burstiness, data access requirements, and business criticality. For example, CI/CD workloads may be run on-demand on commodity x86_64 nodes, machine learning (ML) training may be run as scheduled batch jobs on expensive GPU-enabled nodes with large dataset access, and production ML serving may be run on closely-monitored nodes tuned for efficient ML prediction.

Providing multiple K8s clusters introduces operational complexity, both for the users of the infrastructure and for the team managing it. Elotl Nova reduces the operational complexity for both groups by introducing a single K8s control plane cluster that automatically schedules K8s objects onto the workload K8s clusters. The K8s infrastructure users interact with the control plane cluster for scheduling their workloads, and the team managing the K8s infrastructure defines Nova scheduling policies for placing workloads on the appropriate target cluster.

Nova has recently been updated to optionally support Just-In-Time cloud workload clusters. When the JIT workload clusters option is enabled, Nova puts clusters that have been idle for a configurable period into standby state. When Nova schedules objects to a standby cluster, Nova brings that cluster out of standby state. Nova can also optionally create a cloud workload cluster, by cloning an existing cluster, if needed to satisfy policy-based placement.

DESCRIPTION

Elotl Nova is a multi-cluster multi-cloud control plane that provides policy-driven placement onto workload Kubernetes clusters. Nova presents a single K8s control plane to the user for K8s object placement, and the Nova control plane places those objects on workload K8s clusters, according to the configured policies. Figure 1 depicts Nova’s operation.

When the JIT workload clusters option is enabled, Nova puts clusters that have been idle, i.e., have not hosted any Nova-scheduled resource-consuming objects, for a configurable period (default: 3600 seconds == 60 minutes) into standby state. When Nova schedules objects to a standby cluster, Nova brings the cluster out of standby state.

By default, standby state is implemented by scaling the cluster’s node groups/pools to size 0, while leaving the cluster’s cloud control plane running. Exit from standby involves restoring the node groups/pools original sizes. This suspend/resume mode of operation allows reasonably quick standby state enter/exit (~2 minutes or less each), with the cost in standby being $0.10/hour on both EKS and GKE.

Optionally, standby state is instead implemented by deleting the cluster from the cloud. Exit from standby involves recreating the cluster and reinstalling the Nova agent software. The delete/recreate alternative reduces the cost in standby to 0, but involves significant latency in entering (3–10 minutes) and exiting (3–15 minutes) standby. Delete/recreate standby can optionally include creating workload clusters, by cloning an existing cluster, if needed to satisfy policy-based placement. Clone creation can be controlled by a configured cluster count limit.

We note that Nova workload clusters may include a cluster autoscaler. Cluster autoscaling is useful when a cluster’s usage is bursty. Cluster autoscaling is complementary to JIT clusters, and the two technologies work together, with Nova placement for resource availability taking cluster autoscaling into account. Elotl Luna is an intelligent K8s cluster autoscaler that provisions just-in-time, right-sized, cost-effective compute resources.

EXAMPLE USE CASE

We consider the use case of Nova JIT facilitating multi-cluster K8s version upgrades. In our example, Nova is managing two workload clusters, one running a development deployment of the application of interest and one running a production deployment of that application, as shown in Figure 2. For the application of interest, we use the guestbook app, deployed as shown in Appendix A1. All three K8s clusters are running on EKS, as shown in Figure 3. Nova JIT standby state is configured to delete the cluster, and Nova JIT cluster creation is enabled.

Figure 3: Nova JIT Example Use Case: Nova deployment on EKS K8s clusters before upgrade

The operator’s goal is to upgrade the workload clusters’ K8s version to 1.25, while managing the risk that the upgrade impacts the application of interest. The strategy for accomplishing this goal is to upgrade the development workload cluster, while maintaining full cluster capacity and having the ability to rollback easily if any problems are observed, and after some soak time, to upgrade the production cluster, again while maintaining full cluster capacity and having the ability to rollback easily if any problems are observed. This upgrade strategy is not satisfied by upgrading each existing workload cluster directly, since the direct approach reduces cluster capacity during the upgrade and complicates rollback. Instead the operator prefers to create a new cluster that is upgraded to K8s version 1.25, and to then cut the application over to the new cluster, with the ability to roll the application back to the old cluster if any issues are observed.

Using Nova with JIT clusters, the dev cluster upgrade plan can be accomplished in three steps:

Operator requests the Nova control plane schedule a dummy pod to non-existent cluster -> Nova clone/creates the cluster
Operator requests K8s upgrade of that new cluster -> Upgrade is low risk, since the cluster is not running any user workloads of interest.
Operator changes Nova’s dev placement policy to target the new cluster -> Nova reschedules the dev workloads from the old cluster to the new.

Figure 4 shows execution of these steps for our example. Appendix A2 presents the files policy-create-dev1.yaml and pod-create-dev1.yaml, whose placement triggers cluster clone/creation. The file dev-policy1.yaml modifies the kubernetes.io/metadata.name field in dev-policy.yaml in Appendix A1 to target the new cluster.

Figure 4: Nova JIT Example Use Case: Nova dev workload cluster upgrade

After running these steps, the operator monitors the user workloads running on the upgraded dev cluster. If any issues are observed, rollback is performed by changing the dev placement policy back to targeting the original cluster. Once the operator is satisfied with the dev upgrade, upgrade of the prod cluster is performed via applying the three steps to the prod cluster. Again easy rollback to the old cluster is supported if needed. After the old clusters are idle for the configured time, Nova JIT places them in standby as shown in Figure 5 and deletes them from the cloud as shown in Figure 6.

Figure 5: Nova JIT Example Use Case: Nova cluster upgrade complete; old clusters in standby

Figure 6: Nova JIT Example Use Case: Nova deployment on EKS K8s clusters after upgrade

This example use case has shown how Nova JIT facilitates performing low-risk multi-cluster K8s version upgrades, reducing each cluster upgrade to three steps, with simple rollback if the upgrade impacts user workloads.

SUMMARY AND FUTURE WORK

We’ve described Nova and its new JIT clusters feature. We’ve presented an example use case in which Nova with Nova JIT facilitates low-risk upgrade of workload cluster K8s version, reducing the upgrade to three steps with simple rollback if the upgrade impacts user workloads.

We note that Nova with JIT clusters can handle many other use cases as well, e.g.:

Workload clusters used for different purposes. -> Nova JIT reduces cloud costs without increasing operational complexity.
Workload clusters used to separate customer trials. -> Nova JIT reduces the operational complexity of managing trial resources.
Workload clusters providing resources in different regions. -> Nova JIT reduces the operational complexity of handling resource shortfalls.

We plan to describe these use cases in future blog posts.