Announcing release of Terraform OKE (Kubernetes) module 5.0 — Part 2: More flexible than ever

Ali Mukadam

Published in

Oracle Developers

12 min readNov 28, 2023

I mentioned in my previous post that we re-engineered the Terraform OKE module based on the following themes:

Flexibility
Reusability
Robustness

In this article, we’ll look at some of the flexibility features we have introduced. Largely, this will be an examination of the differentiated offering of basic and enhanced clusters, their use cases and how the Terraform module supports them.

Support for basic and enhanced clusters

A major enhancement to the OKE service that was added is the notion of enhanced clusters. I touched upon this briefly when discussing the increasing support we’ve added to Thanos. Here, I’ll elaborate a bit more.

OKE started as a managed Kubernetes service where the control plane is free. The control plane nodes are run in Oracle’s tenancy at Oracle’s cost and we charge you exactly $0.00 for it. You only get charged for:

compute of your worker nodes
storage of your worker nodes
network resources such as VCN, Load Balancers etc.
and network egress which I mentioned, is also generous and considerably cheaper than in other cloud providers.

However, OKE enhanced clusters enables you to do a lot more:

you can run large clusters
you can use virtual nodes
you can use self-managed nodes aka BYON
you can use OKE workload identity
node cycling upgrade
OKE-managed addons
your OKE clusters have financially-backed SLAs

Let’s look at these in more details.

Increasingly, we have users who need large clusters and these have implications for the control plane. The more worker nodes you have, the likelier your workload will be of a critical nature and therefore the need for a more resilient and performant control plane. Of course, it’s not just a question of throwing more nodes at the problem. A critical component of Kubernetes’ control plane is etcd and it’s not recommended to run with more than 7 members in a cluster. Enhanced OKE clusters have financially-backed SLAs that’s tied to the Kubernetes API server’s uptime and availability. Basically, if the latter’s uptime and availability gets worse, you receive compensation. Thus, it’s on Oracle to ensure you get the necessary resilience in your control plane in order to run your large clusters.

While the control plane of the OKE basic clusters remain free, that of the enhanced clusters are priced at $0.10/hour or roughly $70/month. This compares very favourably with other managed Kubernetes services where there are no differentiated offerings and where you get charged a similar amount come rain or shine.

As this has cost implication for users, in the Terraform OKE module we set the default to basic and as the user, you must explicitly set it to enhanced when you need to:

cluster_type = "enhanced"

Similarly, if you started with a basic cluster, you can change this input variable and set it to “enhanced”, run Terraform again and the cluster type will be upgraded to enhanced accordingly. I use the word “upgrade” here in the sense of an improved capability, not in the sense of a version upgrade. Note you cannot ‘downgrade’ to basic again after changing to enhanced so please make sure you understand the consequences of this decision.

Worker Pools

Until recently, the only type of worker nodes you could run on OKE are node pools. A node pool is essentially a group of compute instances that have the same configuration and function as worker nodes for a cluster. Compute instances in a node pool have the same following (but not limited to) attributes:

Kubernetes version
Image used to provision the worker node
Compute shape e.g. number of OCPUs, memory and block volume allocated as well as whether to use virtual machines or bare metal
CPU architecture e.g. Intel, AMD, ARM, GPU
Node labels
Freeform and defined tags

In an OKE cluster, you can have many node pools each with their own attributes and size e.g. the diagram below shows 3 node pools of varying shapes and sizes to meet mixed performance workload requirements:

Multiple node pools with different shapes and sizes

Likewise, you can have node pools with mixed architectures, all within the same cluster:

Multiple node pools with different CPU architectures

However, node pools support only a subset of the dizzyingly wide array of OCI Compute Services. At the high, blazingly fast end of the OCI Computing Spectrum, there is Cluster Networks, OCI’s High Performance Computing Solution providing high bandwidth and ultra low-latency, built on RDMA. At the other end, there are OCI Container Instances which allow you to run containers fast without the need of a VM.

Let’s say you need to run hardware-accelerated workloads e.g. GPU shapes, shapes designed for high-performance computing workloads and you need Cluster Networks’ high-bandwidth, ultra low-latency, say, to run your own Machine Learning or you want to launch your own AI service. Well, look no further. OCI is the place to be and OKE a great foundation on which to build it.

On the other hand, let’s say you need to run a mundane microservice application, you still want the developer experience of Kubernetes but none but want none of the infrastructure nor operational headache that comes with it. Then, Virtual Nodes is what you are looking for. And then you still have the normal compute instances and instance pools.

In order to accommodate all these, we created a new construct and named it “worker pool”. Note that worker pool is neither a Kubernetes nor OKE terminology. It’s just a concept we added in the Terraform OKE module to allow us to represent the different ways of creating worker nodes but I’m hopeful it will catch on in the OKE world.

I mentioned above you can have multiple node pools. To use the different types of Compute Services, you can also have multiple types of worker pools and you can combine all of them, except for virtual-node-pool. Thus, you can have a default worker pool mode:

worker_pool_mode = "node-pool"

But you can also override this setting for some specific worker pools:

worker_pool_mode = "node-pool"

worker_pools = {
  # no mode specified, will take the default worker_pool_mode
  np1 = {
    shape              = "VM.Standard.E4.Flex",
    ocpus              = 2,
    memory             = 32,
    size               = 3,
    boot_volume_size   = 150,
    kubernetes_version = "v1.27.2"
  }
  # overriden mode, will use a single instance
  np2 = {
    shape              = "VM.Standard.E4.Flex",
    ocpus              = 2,
    memory             = 32,
    mode               = "instance",
    size               = 1,
    boot_volume_size   = 150,
    kubernetes_version = "v1.27.2"
  }
  # overriden mode, will use an instance pool of 5 nodes
  np3 = {
    shape              = "VM.Standard.E4.Flex",
    ocpus              = 2,
    memory             = 32,
    mode               = "instance-pool",
    size               = 5,
    boot_volume_size   = 150,
    kubernetes_version = "v1.27.2"
  }
 # overriden mode, will use cluster networks 
 oke-bm-gpu-rdma = {
  description   = "Self-managed nodes in a Cluster Network with RDMA networking",
  mode          = "cluster-network",
  size          = 1,
  shape         = "BM.GPU.B4.8",
  placement_ads = [1],
  image_id      = "ocid1.image..."
  cloud_init = [
    {
      content = <<-EOT
      #!/usr/bin/env bash
      echo "Pool-specific cloud_init using shell script"
      EOT
    },
  ],
  secondary_vnics = {
    "vnic-display-name" = {
      nic_index = 1,
      subnet_id = "ocid1.subnet..."
    },
  },
 }
}

BYON

The modes “instance”, “instance-pool”, “cluster-networks” are what is collectively and mouthfully known as “self-managed nodes”. Personally, I prefer BYON (Bring Your Own Nodes) because it rhymes with the names of subatomic particles (like muon, gluon, boson) and gives me the nostalgic feeling when I was studying physics.

Why would you go to the trouble of managing your own worker nodes? Well, as I mentioned previously, the use cases keep coming thick and fast. While most users would want to use node pools, we do have a category of users who are running certain workloads at scale and they have special needs and we want to be able to cater to their needs. Among these needs are to be able to run specific shapes or specific OSes e.g. Ubuntu or because of the need for custom images, drivers etc. or using multiple VNICs on ultra low-latency cluster network. At this scale, every detail matters. Even the Coriolis Effect. OK, I’m exaggerating quite a bit but this scene from the fantastic Shooter movie illustrates this point:

Every detail matters at scale. With OKE, you can control a large number of these. You are the master of your own cluster.

Specialized worker pools

Being able to control the shape, size, labels of worker pools allows you to do some interesting stuff. I’ve previously written about how you can ensure certain pods running specific label applications land on certain types of nodes that are most suitable for them or most cost-effective for your needs. When you combine these 2, you can also create specialized worker pools that are tuned for particular tasks:

These can then be tuned and configured using cloud-init.

Consider also the situation where you need to expose your application so that it’s accessible via the OCI Load Balancer. The default behavior would be to add all worker nodes as a backend to the Load Balancer. Thus, in the default scenario, the OCI Load Balancer would send the incoming request to all worker nodes where the kube-proxy is running and handling the request. However, this approach will not work at high scale. This is because the Load Balancer also has limits on resources, to a maximum of 512 backend servers.

Notice in the diagram above we have a worker pool named “Edge”. We want our ingress controller pods to land on its nodes and we want incoming traffic from the Load Balancer to be routed there. We ensure that during provisioning times, the worker pools’ initial Kubernetes labels are set e.g.

edge = {
    shape              = "VM.Standard.E4.Flex",
    ocpus              = 2,
    memory             = 32,
    size               = 3,
    boot_volume_size   = 150,
    kubernetes_version = "v1.27.2"
    node_labels = {
      "edge" = "true",
      "ingress-nginx" = "true"
    },
  }

workload = {
    shape              = "VM.Standard.E4.Flex",
    ocpus              = 2,
    memory             = 32,
    size               = 3,
    boot_volume_size   = 150,
    kubernetes_version = "v1.27.2"
    node_labels = {
      "edge" = "false",
      "node.kubernetes.io/exclude-from-external-load-balancers" = "true"
    },
  }

We can then use node selectors to ensure our ingress controllers will land there.

Similarly, we use labels to restrict other worker nodes from being added as a backend to the load balancer by adding the label above as in the second worker pool. In this way, incoming traffic will always land on some nodes and we won’t have the backend limitation.

OKE Workload Identity

OKE Workload Identity is another feature of OKE enhanced clusters and allows a workload running in OKE to authenticate itself and use OCI Services via service accounts and assigning access at the pod level. I wrote about how it works here in more details using Thanos and OCI Object Storage integration as an example. Not having to store the key in a Secret improves the security posture of your cluster and not having to plan for instance_principal, dynamic groups, labels and node selectors makes infrastructure planning a less daunting exercise, especially if you are at the beginning of your journey with Kubernetes and OCI.

Node Cycling upgrade

Until recently, whenever we release a new Kubernetes version in OKE, you had 2 choices for upgrades:

in-place
out-of-place

Both involved upgrading the control plane version first followed by the worker nodes but this is where they differ. I’ve written about them here so for the sake of completeness, I’ll more or less just copy the relevant parts here.

In-place upgrades consists of keeping existing workers and upgrading them to use a corresponding version of the OKE control plane after the latter has been upgraded.

In contrast, out-of-place upgrades involves provisioning new worker nodes first, and waiting for them to be ready. When new worker nodes are provisioned after an upgrade, they’ll automatically use the control plane’s new version. You can then drain the pods and cordon off the worker nodes that are still on the previous version so that the pods are no rescheduled onto the new worker nodes. You would then delete the older worker nodes.

There are benefits and drawbacks to both and as a lot of things in life, you cannot have your cake and eat it too. So, let’s take a look them and when you would consider each.

In-place upgrade is a good approach to consider if you are very, very conscious about costs, even at the expense of other criteria. This saves you the cost and trouble of duplicating, however temporarily, your worker nodes or part of them. It can also be a consideration if you’re running a large OKE cluster and the size of your cluster will not let you duplicate the worker nodes for the purpose of upgrade because of service limits (if you have such kind of issues, come and talk to us and we would love to help).

In other situations, you might be still be running a cluster that allows you to double in terms of worker nodes but the shape of those nodes are relatively more expensive than standard VMs e.g. large bare metals, GPUs etc. and doubling these and running them side by side, even for a short period is prohibitive. So, in such situations, it works to your advantage that you can upgrade your cluster using the in-place method.

However, there is always the risk that you decided to upgrade at full moon or on Friday the 13th after you broke a mirror the morning of upgrade day, and the Kubernetes upgrade on some nodes fail. While this does not affect the OKE cluster itself, it can have an impact on your application.

In contrast, the out-of-place method is considerably less risky:

Provision new node pools and wait for them to be ready
Drain pods from older pools and cordon them
Let Kubernetes re-schedule the pods on newer pools
Delete the old node pools

You would typically wait until all your new node pools are ready before starting to drain the pods so there’s a small amount of time when you need to run the older worker nodes and new worker nodes side by side. There are of course variations to this that you can consider but it requires a bit of planning and discipline.

I’ve written about using specialized node pools before and what you could do is run different parts of your application on different node pools by using labels and nodeSelectors. Doing so allows you to move your pods to newer nodes in a phased approach. This has the further benefit of keeping the costs down while ensuring the infrastructure upgrade is as least disruptive as possible to the infrastructure.

Besides not being vulnerable to failed Kubernetes upgrades on the worker nodes, you would also consider the out-of-place method if dependencies exist in your application and these are reflected in the order in which you have to deploy your pods. In this way, you can plan your upgrade and pods draining to happen in a deterministic manner before ultimately retiring the nodes of the older version.

To summarize, in-place upgrades may be disruptive to your existing workloads and it’s safer to execute during planned downtime whereas out-of-place upgrades are least risky but can be potentially prohibitive for cost-conscious customers or those with large fleets.

And then there’s node cycling upgrade. Node cycling upgrade is a new feature that comes with enhanced clusters and provides the best of both worlds. It keeps the same node pool but instead of upgrading the entire node pool or require you to run another node pool side-by-side during the upgrade process, you specify a number or percentage of the worker nodes in that node pool that can be upgraded at any point in time. OKE will then handle the gory business of draining and cordoning worker nodes and rescheduling your pods onto newer worker nodes. In this way, it’s neither too costly (especially if you are running large clusters or clusters with expensive shapes) nor disruptive to your workload.

Documentation

With so many features added and I’ve not even mentioned reusability and robustness yet, we also had to significantly improve our documentation. In the past we only published to GitHub in Markdown/Asciidoc but sharing those would sometimes show their source code instead of the rendered page. As much as I like Asciidoc for its structured format, mdbook made publishing so much easier and it can be very easily integrated with GitHub Pages, which is where we have now published the new documentation. They still need a bit more meat and more use cases described but we’ve got a good base to build on.

Summary

In this article, we looked at the support added for enhanced clusters, the various use cases it unlocks, how we added support for these in the 5.0 release and how this makes OKE deployment more flexible. In particular, we looked at:

worker pools, including specialized pools
self-managed nodes aka BYON
workload identity
node cycling upgrade
documentation

I would like to take this opportunity to thank my colleague Devon Crouse who worked impossible hours as well as my other colleagues Karthik and Andrei for taking the module through the testing grinder. Also a huge thanks to our users and OKE customers: you continue to challenge us and this helps us make both the OKE service and the Terraform module better.