Cilium, Azure, Rancher & Terraform: let’s call it CART

Amit Gupta
11 min readMar 11, 2024

--

☸ ️Introduction

Rancher is a container management platform built for organizations that deploy containers in production. Rancher makes it easy to run Kubernetes everywhere, meet IT requirements, and empower DevOps teams. Add to that Cilium which isn’t just another Container Networking Interface (CNI). It’s a versatile solution that offers a wide range of features, from load balancing to providing egress gateways.

🎯Goals & Objectives

The following blog will quickly deploy a Rancher server on Azure (using Terraform) in a single-node K3s Kubernetes cluster, with a single-node downstream Kubernetes cluster attached. You can then create a new cluster and run Cilium as the default CNI on it.

Note- This will install the Cilium version that RKE has customized from upstream Cilium. This would not have all the features that are available in the Enterprise version of cilium (from Isovalent).

Pre-Requisites

  • A (virtual) machine with SSH access keys — referred as rke-node
    Minimum: 2GB RAM / 1 vCPU
    Recommendation: 4GB RAM / 2 vCPU
  • To operate properly, Rancher requires a number of ports to be open on Rancher nodes and on downstream Kubernetes cluster nodes.
  • You should have an Azure Subscription
  • Install Terraform CLI
  • Install kubectl
  • Install Cilium CLI
  • You will need the following permissions in the Resource Group where you are creating the AKS cluster.
Microsoft.Resources/subscriptions/resourcegroups/write
Microsoft.Resources/subscriptions/resourcegroups/read
Microsoft.Network/publicIPAddresses/write
Microsoft.Network/virtualNetworks/write
Microsoft.Compute/virtualMachines/write
  • Ensure you have enough quota resources to create an AKS cluster. Go to the Subscription blade, navigate to “Usage + Quotas”, and make sure you have enough quota for the following resources:
    -Regional vCPUs
    -Standard Dv4 Family vCPUs

Let’s get going

Prepare the Virtual Machine

  • Create a Virtual Machine in Azure. This machine will be added as a kubernetes node in the RKE cluster and will be running with Cilium as the CNI.
  • The virtual machine in this case is based out of Ubuntu 22.04.
  • Install Kubectl

Create the Rancher server in Azure using Terraform

You can deploy this on AWS, GCP and other cloud providers as well.

  • Clone Rancher Quickstart to a folder using git clone https://github.com/rancher/quickstart.
  • Go into the Azure folder containing the Terraform files by executing cd quickstart/rancher/azure.
  • Rename the terraform.tfvars.example file to terraform.tfvars.
  • Edit terraform.tfvars and customize the following variables:
    azure_subscription_id - Microsoft Azure Subscription ID
    azure_client_id - Microsoft Azure Client ID
    azure_client_secret - Microsoft Azure Client Secret
    azure_tenant_id - Microsoft Azure Tenant ID
    rancher_server_admin_password - Admin password for created Rancher server (minimum 12 characters)
  • Optional: Modify optional variables within terraform.tfvars. azure_location - Microsoft Azure region, choose the closest instead of the default (East US)
    prefix - Prefix for all created resources
    instance_type - Compute instance size used, minimum is Standard_DS2_v2 but Standard_DS2_v3 or Standard_DS3_v2 could be used if within budget.
  • Run terraform init.
  • Run terraform fmt to rewrite Terraform configuration files to a canonical format and style.
  • Run terraform validate to validates the configuration files in athe directory, referring only to the configuration,
  • To initiate the creation of the environment, run terraform apply --auto-approve. Then wait for output similar to the following:
Apply complete! Resources: 18 added, 0 changed, 0 destroyed.

Outputs:

rancher_node_ip = 20.163.x.x
rancher_server_url = https://rancher.20.163.x.x.sslip.io
workload_node_ip = yy.yy.yy.yy
  • Paste the rancher_server_url from the output above into the browser. Log in when prompted (default username is admin, use the password set in rancher_server_admin_password).

Create the RKE cluster

  • Login to the Rancher UI.
  • In Rancher UI, navigate to the Clusters page. In the top right, click on the Add Cluster box to create a new cluster.
  • On the Add Cluster page select to create a new cluster from Existing Nodes:
  • On the Add Cluster page that opens, provide a name for the cluster.
  • Select the Kubernetes Version as the latest from the dropdown.
    For this blog, let’s proceed with 1.27.11
  • Select the Container Networkas Cilium
  • Click on Create
  • For registering the rke-node to Rancher server, copy the registration command from the Rancher UI and execute it on the rke-node
  • On the rke-node
curl https://rancher.20.163.x.x.sslip.io/system-agent-install.sh | sudo  sh -s - --server https://rancher.20.163.x.x.sslip.io --label 'cattle.io/os=linux' --token ####################################################### --ca-checksum ####################################################### --etcd --controlplane --worker
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 30887 0 30887 0 0 21040 0 --:--:-- 0:00:01 --:--:-- 21040
[INFO] Label: cattle.io/os=linux
[INFO] Role requested: etcd
[INFO] Role requested: controlplane
[INFO] Role requested: worker
[INFO] Using default agent configuration directory /etc/rancher/agent
[INFO] Using default agent var directory /var/lib/rancher/agent
[INFO] Determined CA is necessary to connect to Rancher
[INFO] Successfully downloaded CA certificate
[INFO] Value from https://rancher.20.163.x.x.sslip.io/cacerts is an x509 certificate
[INFO] Successfully tested Rancher connection
[INFO] Rancher System Agent was detected on this host. Ensuring the rancher-system-agent is stopped.
[INFO] Downloading rancher-system-agent binary from https://rancher.20.163.x.x.sslip.io/assets/rancher-system-agent-amd64
[INFO] Successfully downloaded the rancher-system-agent binary.
[INFO] Downloading rancher-system-agent-uninstall.sh script from https://rancher.20.163.x.x.sslip.io/assets/system-agent-uninstall.sh
[INFO] Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO] Generating Cattle ID
[INFO] Cattle ID was already detected as #######################################################.
[INFO] Successfully downloaded Rancher connection information
[INFO] systemd: Creating service file
[INFO] Creating environment file /etc/systemd/system/rancher-system-agent.env
[INFO] Enabling rancher-system-agent.service
[INFO] Starting/restarting rancher-system-agent.service
  • The rke-node will take atleast 5–10 minutes to register to the rancher server.
  • Copy or download the kubectl config file to check the status of the nodes and pods on the rke-node
  • On the rke-node
export KUBECONFIG=/home/rkenode/config.yml

kubectl get nodes -A -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
rke-node Ready control-plane,etcd,master,worker 20h v1.27.11+rke2r1 10.0.0.5 <none> Ubuntu 20.04.6 LTS 5.15.0-1053-azure containerd://1.7.11-k3s2

kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cattle-fleet-system fleet-agent-7f9ccfb8b-7hd9g 1/1 Running 0 20h 10.42.0.143 rke-node <none> <none>
cattle-system cattle-cluster-agent-975dccfbc-5plgf 1/1 Running 0 20h 10.42.0.253 rke-node <none> <none>
cattle-system rancher-webhook-7dc6679459-mmdgc 1/1 Running 0 20h 10.42.0.170 rke-node <none> <none>
cattle-system system-upgrade-controller-78cfb99bb7-2gm2h 1/1 Running 0 20h 10.42.0.148 rke-node <none> <none>
kube-system cilium-26mkk 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system cilium-operator-86d6785dd8-pbmb4 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system cloud-controller-manager-rke-node 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system etcd-rke-node 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system helm-install-rke2-cilium-gjhtf 0/1 Completed 0 20h 10.0.0.5 rke-node <none> <none>
kube-system helm-install-rke2-coredns-cnn2k 0/1 Completed 0 20h 10.0.0.5 rke-node <none> <none>
kube-system helm-install-rke2-ingress-nginx-8q4w7 0/1 Completed 0 20h 10.42.0.231 rke-node <none> <none>
kube-system helm-install-rke2-metrics-server-jrsnd 0/1 Completed 0 20h 10.42.0.207 rke-node <none> <none>
kube-system helm-install-rke2-snapshot-controller-crd-bxf89 0/1 Completed 0 20h 10.42.0.96 rke-node <none> <none>
kube-system helm-install-rke2-snapshot-controller-n8m8v 0/1 Completed 1 20h 10.42.0.110 rke-node <none> <none>
kube-system helm-install-rke2-snapshot-validation-webhook-mqzlb 0/1 Completed 0 20h 10.42.0.64 rke-node <none> <none>
kube-system kube-apiserver-rke-node 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system kube-controller-manager-rke-node 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system kube-proxy-rke-node 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system kube-scheduler-rke-node 1/1 Running 0 20h 10.0.0.5 rke-node <none> <none>
kube-system rke2-coredns-rke2-coredns-94745d-cqr2g 1/1 Running 0 20h 10.42.0.175 rke-node <none> <none>
kube-system rke2-coredns-rke2-coredns-autoscaler-d8587b89c-5v8q5 1/1 Running 0 20h 10.42.0.194 rke-node <none> <none>
kube-system rke2-ingress-nginx-controller-xgkgs 1/1 Running 0 20h 10.42.0.69 rke-node <none> <none>
kube-system rke2-metrics-server-5c9768ff67-jsj58 1/1 Running 0 20h 10.42.0.206 rke-node <none> <none>
kube-system rke2-snapshot-controller-7d6476d7cb-cbzzc 1/1 Running 0 20h 10.42.0.199 rke-node <none> <none>
kube-system rke2-snapshot-validation-webhook-5649fbd66c-f9fdl 1/1 Running 0 20h 10.42.0.112 rke-node <none> <none>

Validating the Cilium version

  • Using the Cilium CLI we can validate the Cilium version that has been installed.
cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
\__/¯¯\__/ Hubble Relay: disabled
\__/ ClusterMesh: disabled

DaemonSet cilium Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Running: 1
cilium-operator Running: 1
Cluster Pods: 10/10 managed by Cilium
Image versions cilium rancher/mirrored-cilium-cilium:v1.15.1: 1
cilium-operator rancher/mirrored-cilium-operator-generic:v1.15.1: 1

Note- Taking a closer look at the image repositories (rancher/mirrored-cilium-*) indicates that they have been packaged by the Rancher team and hence validates that the correct Cilium version has been installed.

Cluster and Cilium Health Check

  • Let’s check the health of the nodes
kubectl get nodes -A -o wide

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
rke-node Ready control-plane,etcd,master,worker 22h v1.27.11+rke2r1 10.0.0.5 <none> Ubuntu 20.04.6 LTS 5.15.0-1053-azure containerd://1.7.11-k3s2
  • Let’s also check the node-to-node health with cilium-health status
kubectl exec -ti ds/cilium -n kube-system -- cilium-health status

Defaulted container "cilium-agent" out of: cilium-agent, install-portmap-cni-plugin (init), config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), install-cni-binaries (init)
Probe time: 2024-03-11T10:24:03Z
Nodes:
rke-node (localhost):
Host connectivity to 10.0.0.5:
ICMP to stack: OK, RTT=139.702µs
HTTP to agent: OK, RTT=253.903µs
Endpoint connectivity to 10.42.0.213:
ICMP to stack: OK, RTT=115.201µs
HTTP to agent: OK, RTT=235.003µs

Validate the installation

Let’s run a cilium connectivity test (an automated test that checks that Cilium has been deployed correctly and tests intra-node connectivity, inter-node connectivity and network policies) to verify that everything is working as expected.

root@rkenode:~# cilium connectivity test
ℹ️ Single-node environment detected, enabling single-node connectivity test
ℹ️ Monitor aggregation detected, will skip some flow validation steps
✨ [rke-node] Creating namespace cilium-test for connectivity check...
✨ [rke-node] Deploying echo-same-node service...
✨ [rke-node] Deploying DNS test server configmap...
✨ [rke-node] Deploying same-node deployment...
✨ [rke-node] Deploying client deployment...
✨ [rke-tme] Deploying client2 deployment...
⌛ [rke-node] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [rke-node] Waiting for CiliumEndpoint for pod cilium-test/client-6f6788d7cc-fshpx to appear...
⌛ [rke-node] Waiting for CiliumEndpoint for pod cilium-test/client2-bc59f56d5-dszgk to appear...
⌛ [rke-node] Waiting for pod cilium-test/client-6f6788d7cc-fshpx to reach DNS server on cilium-test/echo-same-node-58f99d79f4-w7psc pod...
⌛ [rke-node] Waiting for pod cilium-test/client2-bc59f56d5-dszgk to reach DNS server on cilium-test/echo-same-node-58f99d79f4-w7psc pod...
⌛ [rke-node] Waiting for pod cilium-test/client-6f6788d7cc-fshpx to reach default/kubernetes service...
⌛ [rke-node] Waiting for pod cilium-test/client2-bc59f56d5-dszgk to reach default/kubernetes service...
⌛ [rke-node] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-58f99d79f4-w7psc to appear...
⌛ [rke-node] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [rke-node] Waiting for NodePort 10.0.0.4:32312 (cilium-test/echo-same-node) to become ready...
ℹ️ Skipping IPCache check
🔭 Enabling Hubble telescope...
⚠️ Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4245: connect: connection refused"
ℹ️ Expose Relay locally with:
cilium hubble enable
cilium hubble port-forward&
ℹ️ Cilium version: 1.14.2
🏃 Running tests...
[=] Test [no-policies]
....................
[=] Test [no-policies-extra]
..
[=] Test [allow-all-except-world]
........
[=] Test [client-ingress]
..
[=] Test [client-ingress-knp]
..
[=] Test [allow-all-with-metrics-check]
..
[=] Test [all-ingress-deny]
......
[=] Test [all-ingress-deny-knp]
......
[=] Test [all-egress-deny]
........
[=] Test [all-egress-deny-knp]
........
[=] Test [all-entities-deny]
......
[=] Test [cluster-entity]
..
[=] Test [host-entity]
..
[=] Test [echo-ingress]
..
[=] Test [echo-ingress-knp]
..
[=] Test [client-ingress-icmp]
..
[=] Test [client-egress]
..
[=] Test [client-egress-knp]
..
[=] Test [client-egress-expression]
..
[=] Test [client-egress-expression-knp]
..
[=] Test [client-with-service-account-egress-to-echo]
..
[=] Test [client-egress-to-echo-service-account]
..
[=] Test [to-entities-world]
......
[=] Test [to-cidr-external]
....
[=] Test [to-cidr-external-knp]
....
[=] Test [echo-ingress-from-other-client-deny]
....
[=] Test [client-ingress-from-other-client-icmp-deny]
....
[=] Test [client-egress-to-echo-deny]
....
[=] Test [client-ingress-to-echo-named-port-deny]
..
[=] Test [client-egress-to-echo-expression-deny]
..
[=] Test [client-with-service-account-egress-to-echo-deny]
..
[=] Test [client-egress-to-echo-service-account-deny]
.
[=] Test [client-egress-to-cidr-deny]
....
[=] Test [client-egress-to-cidr-deny-default]
....
[=] Test [health]
.

[=] Skipping Test [north-south-loadbalancing]

[=] Skipping Test [pod-to-pod-encryption]

[=] Skipping Test [node-to-node-encryption]

[=] Skipping Test [egress-gateway-excluded-cidrs]

[=] Skipping Test [north-south-loadbalancing-with-l7-policy]
[=] Test [echo-ingress-l7]
......
[=] Test [echo-ingress-l7-named-port]
......
[=] Test [client-egress-l7-method]
......
[=] Test [client-egress-l7]
........
[=] Test [client-egress-l7-named-port]
........

[=] Skipping Test [client-egress-l7-tls-deny-without-headers]

[=] Skipping Test [client-egress-l7-tls-headers]

[=] Skipping Test [client-egress-l7-set-header]

[=] Skipping Test [echo-ingress-auth-always-fail]

[=] Skipping Test [echo-ingress-mutual-auth-spiffe]

[=] Skipping Test [pod-to-ingress-service]

[=] Skipping Test [pod-to-ingress-service-deny-all]

[=] Skipping Test [pod-to-ingress-service-allow-ingress-identity]
[=] Test [dns-only]
........
[=] Test [to-fqdns]
........

✅ All 42 tests (184 actions) successful, 13 tests skipped, 0 scenarios skipped.

References

Troubleshooting

In case the resource group that you are creating the rancher server doesn’t have the requisiste permissions you will see this error:


│ Error: [ERROR] Timeout trying to login with admin user: invalid character 'p' after top-level value

│ with module.rancher_common.rancher2_bootstrap.admin,
│ on ../rancher-common/rancher.tf line 4, in resource "rancher2_bootstrap" "admin":
│ 4: resource "rancher2_bootstrap" "admin" {

Try out Cilium

  • Try out Cilium and get a first-hand experience of how it solves some real problems and use-cases in your cloud-native or on-prem environments related to Networking, Security or Observability.

🌟Conclusion 🌟

Hopefully, this post gave you a good overview of how to deploy a Rancher server on Azure (using Terraform) in a single-node K3s Kubernetes cluster, with a single-node downstream Kubernetes cluster attached. You can then create a new cluster and run Cilium as the default CNI on it.

Thank you for Reading !! 🙌🏻😁📃, see you in the next blog.

🚀 Feel free to connect/follow with me/on :

LinkedIn: linkedin.com/in/agamitgupta

--

--