Announcing release of Terraform OKE (Kubernetes) module 5.0 — Part 4

Ali Mukadam

Published in

Oracle Developers

13 min readDec 14, 2023

In the previous parts of this series on the Terraform OKE module, we have discussed the following:

Part 1: The rationale and themes, namely flexibility, reusability and robustness, behind the release of a major new version of the Terraform module
Part 2: Its increased flexibility to support more use cases, with a particular focus on the differentiated basic and enhanced OKE service offering. We looked at some of the funky new features in enhanced clusters such as OKE Workload Identity and Node Cycling but we also examine in deeper details how to deploy a wider variety of workload, ranging from serverless to High Performance Computing.
Part 3: Its increased reusability of existing infrastructure resources, the reorganization of the submodules, additional utilities and extensions as well as how the increased reusability help us deploy multiple, connected OKE clusters, especially if you’re looking for High Availability.

All of this is driven by the use cases of our users have brought to us and we sometimes encounter situations where some users want literally almost the opposite thing from other users. For example, most users want the module to create the entire infrastructure for them whereas others want to pick and choose the part of infrastructure they need. As such, we needed to ensure the module is robust enough to handle differing scenarios.

In the last installment in this series, we’ll discuss some of the ways we’ve made the Terraform OKE module more robust.

Re-factoring with moved

In Terraform 1.1, the moved statement was introduced and is meant to facilitate refactoring such as

splitting a single module into multiple modules.

In the 1.3 release, the moved statement was further enhanced to allow the ability to refactor resources to use differently sourced modules. This made splitting up the modules such as extensions into extensions and utilities easier. We now refer to extensions as additional software such as helm charts we wish to install into the cluster after provisioning whereas utilities are nifty tools that you always like to have handy. I wrote about this split in the previous article. Likewise, node pools are now in a sub-module of their own instead of being co-located with the cluster. We used the moved statement whenever we had to move resources around between submodules and save users the trouble of having to recreate their clusters, workers and other resources and upgrade to v5.

Exception Handling

In the 1.2 release, Terraform improved its exception handling capability with pre-conditions and post-conditions. As an example, in the earlier releases of the module, a frequent source of complaint by users were due to an invalid image id. This used to cause the operator and/or certain node pools creation to fail. With the improved exception handling (example below), we can now let users know whether their input values are valid:

    precondition {
      condition     = coalesce(each.value.image_id, "none") != "none"
      error_message = <<-EOT
      Missing image_id; check provided value if image_type is 'custom', or image_os/image_os_version if image_type is 'oke' or 'platform'.
        pool: ${each.key}
        image_type: ${coalesce(each.value.image_type, "none")}
        image_id: ${coalesce(each.value.image_id, "none")}
      EOT
    }

Similarly, the try function is used to catch errors when handling complex data types.

Functions

Terraform has also greatly increased the number of functions available, especially those dealing with types and collections such as:

try
coalesce
anytrue and alltrue
one
tobool

Let’s take an example. In v4, to decide whether to create the internet gateway in the VCN, we would check all of the following conditions:

the load balancers are public
the bastion host will be created
the control plane should be public

The code to check all these conditions used to be like the following:

create_internet_gateway  = var.load_balancers == "internal" && var.create_bastion_host == false && var.control_plane_type == "private" ? false : true

Instead, we now simply do in a local variable using the alltrue and anytrue functions:

  create_internet_gateway = alltrue([
    var.vcn_create_internet_gateway != "never",    
    anytrue([                                      
      var.vcn_create_internet_gateway == "always", 
      var.create_bastion && var.bastion_is_public, 
      var.control_plane_is_public,                 
      var.load_balancers != "internal",            
    ])
  ])

and we can then use it in our resources. Similarly, instead of relying of difficult to maintain code conventions, we use functions such as try, one and coalesce to improve the input variable handling.

Improving Network & Security robustness via NSG chaining

In the v4 release, we changed the Terraform OKE module to use Network Security Groups (NSGs) from security lists. Why did we make that change? There are 3 main components in OKE:

the control plane aka the API endpoint
the worker nodes
the service load balancers

And they communicate with each other. Any networking you do in OCI uses a Virtual Cloud Network aka VCN as the foundation. If you have an AWS or Azure background, that’s the equivalent of your VPC or VNet respectively. Initially, the only way you could secure the VCN is by using security lists. I say “secure” but I should rather be saying open up and that is because by default, the VCN is locked down. Thus, security lists work like an allow list where you specify a list of rules for network traffic that you want to allow in:

An ingress security rule in OCI security list

In security lists, you specify a number of security rules:

ingress vs egress
the source type, either a CIDR range or services from OSN
the protocol
source and destination port ranges

As you can see, there is no allow or deny, like say, iptables’ Accept, Drop or Reject. Everything is “dropped” by default until you explicitly create a rule for a particular type of traffic, creating the equivalent of an Accept for it. This makes implementation considerably simpler and easier to reason about. Now, all we need to worry about are the 4 parameters above for each rule and from a Terraform implementation perspective, we can then put these in a list and iterate over them e.g.

workers_ingress = [
    {
      description = "Allow ingress for all traffic to allow pods to communicate between each other on different worker nodes on the worker subnet",
      protocol    = local.all_protocols,
      port        = -1,
      source      = local.worker_subnet,
      stateless   = false
    },
    {
      description = "Allow control plane to communicate with worker nodes",
      protocol    = local.tcp_protocol,
      port        = -1,
      source      = local.cp_subnet,
      stateless   = false
    },
    {
      description = "Allow path discovery from worker nodes"
      protocol    = local.icmp_protocol,
      port        = -1,
      source      = local.anywhere,
      stateless   = false
    }
  ]

Before explaining the rationale behind the NSG switch, let’s take a look at another component in Kubernetes: the Cloud Controller Manager. Most Kubernetes clusters have one and its role is to let your Kubernetes cluster interact with your cloud provider’s API in a transparent manner. OKE implements one too and a component of the cloud controller manager is the service controller which handles infrastructure resources such as OCI’s load balancers provisioning, both L7 and L4 (aka NLB). Remember that OCI Load Balancer that was magically created for you when you created a service of type LoadBalancer? That’s no magic, that’s the CCM doing its work.

To make life easier for most initial users, OCI’s CCM implements some defaults but allows you to override this default behavior using annotations. One of the default is:

service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: "All"

This will configure the security rules of the security list attached to the Load Balancer subnet and add the necessary egress rules to ensure the Load Balancer can add the worker nodes as backend and invoke the necessary NodePort service when receiving incoming requests. As helpful as they are to get users started, they can also cause problems sometimes:

Running terraform apply again will remove the rules that were added by the CCM to the security list of the load balancer subnet.
The maximum number of rules in a security list is 200. If the number of rules created reaches 200, any new rule will not get added and this will cause the Load Balancer to fail when trying to reaching the service. In a distributed system such as Kubernetes that can sometimes run into a few thousand nodes, it’s very easy to breach that limit and this will result in a failed service.
If the service load balancer subnet does not have a security list attached, the security rules will then get added to the VCN’s Default Security List. This can ultimately lead to (1) as well.

Until v3, we tried to manage Issue 1 above by implementing lifecycle changes on the security list of load balancers:

...
lifecycle {
    ignore_changes = [
      egress_security_rules, ingress_security_rules, defined_tags
    ]
  }
...

However, this has the unfortunate outcome that once created from Terraform, you cannot change the security rules from Terraform again. For example, let’s say you need to open more ports. You would have to do that manually through the console or the CLI and if you run Terraform apply again, say for upgrade or adding more node pools, these rules would need to be added again. Clearly, that was not sustainable. Hence the switch to NSGs.

Rules in security lists and NSGs have a few major differences:

in addition to CIDR and OCI services in security lists, a source or target type in NSG can also be another NSG. Basically, you’re saying that if this packet originate from a host whose VNIC is bound by NSG X, then they can ingress if the protocol is Y and port is Z.
security lists apply to all hosts in the subnet whereas NSGs operate at the VNIC level regardless of the subnet as long as they within the same VCN.
when implementing security rules for NSGs in Terraform, they can be specified separately from the NSG whereas they have to be embedded in the security list e.g.

resource "oci_core_network_security_group" "pub_lb" {
  compartment_id = var.compartment_id
  ...
  ...
}

resource "oci_core_network_security_group_security_rule" "pub_lb_egress" {
  network_security_group_id = oci_core_network_security_group.pub_lb[0].id
  ...
  ...
}

resource "oci_core_network_security_group_security_rule" "pub_lb_ingress" {
  network_security_group_id = oci_core_network_security_group.pub_lb[0].id
  ...
  ...
}

resource "oci_core_network_security_group_security_rule" "pub_lb_egress_health_check_to_workers" {
  ...
  ...
}

Since the rules can be implemented separately, we can also evolve and update the security posture of our OKE clusters. In v5, we further enhanced this by using NSG chaining. How does NSG chaining work? Consider the diagram below:

We have 2 hosts (VM1, VM2) whose VNICs use NSG A and VM3 whose VNIC uses NSG B. You then add the security rules you need to each NSG. You should be familiar with the format of rules r1 and r4 by now. Let’s consider the bottom 2 rules in each. Under these rules,

VM1 can receive traffic from VM2 over port 80 and vice-versa (r2)
Both VM1 and VM2 can receive traffic from VM3 but over port 443 only (r3)
VM3 can also receive traffic from VM1 and 2 but over port 443 only (r6)

Because there’s no other host that also uses NSG B, rule r5 doesn’t kick in. What you’ll also notice is that there are no CIDR blocks specified anymore.

By using NSG chaining, we can accomplish a few things:

a simpler, less bulky implementation of the required security rules for all OKE components (control plane, workers, load balancers) to work
reusing existing NSGs in the VCN when necessary
custom security rules for different OKE workloads
update to the security posture allowing change when required

Let’s look at each of these. You can compare the v4 implementation vs v5 (below) for the default public load balancer security rules:

  pub_lb_rules = local.pub_lb_nsg_enabled ? merge(
    {
      "Allow TCP egress from public load balancers to workers nodes for NodePort traffic" : {
        protocol = local.tcp_protocol, port_min = local.node_port_min, port_max = local.node_port_max, destination = local.worker_nsg_id, destination_type = local.rule_type_nsg,
      },
      "Allow TCP egress from public load balancers to worker nodes for health checks" : {
        protocol = local.tcp_protocol, port = local.health_check_port, destination = local.worker_nsg_id, destination_type = local.rule_type_nsg,
      },
      "Allow ICMP egress from public load balancers to worker nodes for path discovery" : {
        protocol = local.icmp_protocol, port = local.all_ports, destination = local.worker_nsg_id, destination_type = local.rule_type_nsg,
      },
    },
    var.enable_waf ? local.waf_rules : {},
    var.allow_rules_public_lb,
  ) : {}

With NSG chaining, the security rules are 10x smaller and thus easier to implement and audit.

I mentioned in the previous article that one of our goal was to also improve reusability and in some cases, we have users who want to create the basic infrastructure (VCN, subnets, NSGs etc.) separately from the cluster and then reuse the existing VCN, subnets, NSGs to create other other clusters:

nsgs = {
  bastion  = { id = "ocid1.networksecuritygroup..." }
  operator = { id = "ocid1.networksecuritygroup..." }
  cp       = { id = "ocid1.networksecuritygroup..." }
  int_lb   = { id = "ocid1.networksecuritygroup..." }
  pub_lb   = { id = "ocid1.networksecuritygroup..." }
  workers  = { id = "ocid1.networksecuritygroup..." }
  pods     = { id = "ocid1.networksecuritygroup..." }
}

Reusing NSGs for different clusters/worker nodes allow users to build on an agreed security posture but we also make it possible for users to override the default and supply their own and that too on a worker pool basis:

worker_pools = {
  worker-pool-1 = {
    nsg_ids = ["ocid1.networksecuritygroup..."]
  },

  worker-pool-2 = {
    nsg_ids     = ["ocid1.networksecuritygroup..."]
    pod_nsg_ids = ["ocid1.networksecuritygroup..."] // when cni_type = "npn"
  },
}

Updates to NSGs can also be parameterized and implemented separately by creating the rule and attaching it to the NSG which is faster than updating an existing security list.

Improvements in v5: Default security lists for each subnet

Earlier I mentioned the security list management mode annotation, its default value (“All”) and what it does. Since we switched to NSG in v4, we also removed all associated security lists and we recommended users turn off this default behavior by adding this annotation to their Load Balancer services:

service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: "None"

However, if you forget to do so, then the CCM will add those rules and in the absence of a security list for a subnet, the CCM will picks up the VCN’s Default Security list to add those rules. Typically, those rules are to allow egress from the load balancer subnet to the worker nodes subnet so I want to reassure you that neither your infrastructure nor your data is exposed. That being said, we don’t want inadvertent changes, particularly those related to security posture. With the v5 release, we also added a default, separate and empty security list (remember a security list is an allow list and an empty list means nothing is allowed) to each subnet. The idea is to guard against such inadvertent security posture changes in 2 ways:

by adding a default security list to the load balancer’s subnet, even if a user fails to turn off security list management mode, the addition of those rules will be limited to that of the load balancer’s security list only and not impact other resources’ security posture.
even if the rules were somehow inadvertently added to the VCN’s default security list, this does not affect other resources’ security posture as the other subnets are now also using their own default security list.

I’m explaining the internal mechanics so users understand how the various pieces work, how you can alter the behavior of your OKE cluster to suit your needs and even if you are not using the Terraform OKE module, how you can improve your security posture.

The bottom line: use NSGs and turn off the default security list management mode when creating LoadBalancer services using service annotations.

Improvements in v5: Custom rules

I mentioned in the previous article that our users are now deploying clusters in various topologies:

single, isolated clusters
multiple clusters connected to a hub which frequently also runs a management cluster
multiple clusters connected to each other in a mesh

For single, isolated clusters, you are ready to go. But for multiple clusters, depending on how you are using the clusters, you may need to modify and add custom security rules. In v4, you would supply the range of CIDRs and ports and we would then build a Cartesian product of the two and create the rules out of it.

public_lb_allowed_cidrs    = ["0.0.0.0/0"]
public_lb_allowed_ports    = [443,"9001-9002"]

It worked well enough but it was not very flexible e.g. you couldn’t change the protocol if you wanted to. In v5, this is now possible e.g. let’s say you are deploying multi-cluster Istio, more specifically multi-primary and multi-networks in 2 different regions and therefore VCNs. You want the inter-cluster communication to happen securely, preferably over a Remote Peering Connection:

Each cluster will set up its own gateway in the form of an internal Load Balancer. Since they are internal load balancers, it’s a good idea to deploy them in a private subnet by overriding the default subnet:

service.beta.kubernetes.io/oci-load-balancer-subnet1: "ocid..."

define a list of custom ports the associated NSGs will accept ingress:

service_mesh_ports = [80, 443, 15012, 15017, 15021, 15443]

and additional custom security rules to be implemented by the internal load balancer NSG:

  allow_rules_internal_lb = {
    for p in local.service_mesh_ports :

    format("Allow ingress to port %v", p) => {
      protocol = local.tcp_protocol, port = p, source = "10.0.0.0/16", source_type = local.rule_type_cidr,
    }
  }

Finally, override the defaults in the Istio operator manifests:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
values:
  global:
    meshID: australis
    multiCluster:
      clusterName: sydney
    network: sydney
components:
  egressGateways:
    - name: istio-egressgateway
      enabled: true
  ingressGateways:
    - name: istio-ingressgateway
      enabled: true
   k8s:
     serviceAnnotations:
       service.beta.kubernetes.io/oci-load-balancer-internal: "false"
       service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
       service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "100"
       service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
       service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: "None"
       oci.oraclecloud.com/oci-network-security-groups: "ocid..."
    - name: istio-eastwestgateway
      enabled: true
   k8s:
     serviceAnnotations:
       service.beta.kubernetes.io/oci-load-balancer-internal: "true"
       service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
       service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "50"
       service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "50"
       service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: "None"
       service.beta.kubernetes.io/oci-load-balancer-subnet1: "ocid....."
       oci.oraclecloud.com/oci-network-security-groups: "ocid...."
     env:
     - name: ISTIO_META_REQUESTED_NETWORK_VIEW
       value: admin
     - name: ISTIO_META_ROUTER_MODE
       value: "sni-dnat"
     service:
       ports:
       - name: status-port
         port: 15021
         targetPort: 15021
       - name: tls
         port: 15443
         targetPort: 15443
       - name: tls-istiod
         port: 15012
         targetPort: 15012
       - name: tls-webhook
         port: 15017
         targetPort: 15017
   label:
     app: istio-eastwestgateway
     istio: eastwestgateway
     topology.istio.io/network: sydney

A little extra: Cilium CNI

We also recently released v5.1.0. This release allows you to deploy Cilium CNI and replace flannel as the default CNI in OKE. To enable cilium, set the following:

cilium_install           = true
cilium_reapply           = true
cilium_namespace         = "network"
cilium_helm_version      = "1.14.4"
cilium_helm_values       = {}
cilium_helm_values_files = []

You can override your own cilium settings based on its helm chart.

Summary

In this article, we described some of the ways either programmatic, infrastructural and security-wise that we have used to make the Terraform OKE module more robust. We also outlined why we switched from using security lists to NSG. We then looked at how NSG chaining is well-suited for a dynamic system like Kubernetes. Finally, we briefly looked at the latest release v5.1.0 which included the addition of Cilium as CNI.

I hope you find this article useful.

Acknowledgement

NSG diagram provided by Shaun Levey.
Several of the robustness mentioned here were implemented by Devon Crouse.
The extensive refactoring have been tested by some of our customers. We are very grateful to bring your use cases to our attention and spending time testing and validating the implementation.
Review of this article by Mickey Boxell