Build a managed Kubernetes cluster from scratch — part 3

11 min readMay 18, 2022

As the cluster now has a solid foundation with an external Certificate Authority and the Control Plane is in place, it is time for the fun (and perhaps challenging) pieces. As I want to maintain some sort of readability and with a material that is also digestible, I will not create a big article that will take weeks or even many evenings to write — I realize that waiting for some kind of end state in the land (or sea) of Kubernetes isn’t really of desire. The saga will end shape many times along the path.

The basic cluster after this part is done.

Fellow Kubernauts! We shall now build the Cluster, and that with a Control Plane and a Data Plane that are living in separate network segments and the only means of communication between the different planes are a limited amounts of API endpoints.

Some of the features I’ve managed to achieve (although ingress controller is in the coming new major version of Cilium and only available in the release candidate) is:

Control Plane and Data Plane on separate VLAN
Service LoadBalancer with integrated IPAM
Ingress Controller
External etcd
Persistent Storage delivered by Longhorn
Services deployed by Loki, Prometheus, Grafana Jaeger etc
Services and Ingress resources registers A-records in the internal DNS by annotations to ExternalDNS
Automatic Lets Encrypt certificates by annotation to cert-manager which talks to external DNS for DNS-01 challenges

A feature I’m still struggling a bit with is having BPF networking with native acceleration (XDP) in place. As I read about XDP beeing supported in the virtio_net driver here, here and here, I tried to implement it for the sake of performance tuning (and spoof protection, more on that in a later part). For various reasons I failed to have that up and running (that became a couple of hours long session, from testing acceptable flags in the cilium agent, trial and error with bhyve and ethtool to try to disable LRO, to replace the virtio_net driver with pass-through with physical NIC and realize that the beta version of the agent I was running didn’t appear to be able compile the drivers in the first place. I expect to have XDP up and running later on.

Currently I’m running my bhyve guests with the physical NIC in pass-through mode —but I am looking into the highly popular crowfunded Turing Pi 2 for my worker nodes in the future. Not that I’m out of CPU cores or memory in my own homelab (data center?), but I believe that Turing Pi is a way to reduce power consumption and fan noise.

In this first setup we will focus in is to just have a simple cluster up and running, without all the bells and whistles (and with no focus on hardening), just to warm up and get comfortable to the environment.

Data Plane components

Worker nodes on illumos

Worker nodes of kind illumos is out of scope here. In the beginning that was kind of a dream goal, and that is probably what has made any efforts of Kubernetes on illumos (or solaris) fruitless in the public channels. At least to my knowledge, but if there is anyone already running Kubernetes on illumos before me on either control plane, data plane or both — please add a comment as it would really be interesting to hear about other peoples experiences.

So, illumos Kubernetes workers? At the moment it wouldn’t make sense to me. I was thinking about what could be needed, but what would the gain really be on that one?

There are possibilities to run Alpine as LX-zone (or even Docker containers) through an emulated Linux ABI, but that framework is extensive compared to the cgroups/namespaces concept. LX-zones are incompatible with Linux kernel space.
A win could be on the network side with the advanced Crossbow project integrated into illumos, but that would still be incompatible with the current CNI implementations.
Storage backend natively on ZFS (with snapshot possibilities), great as it is, but in the case of near worker storage — it tends to be distributed and there is no opensource model on scaling out ZFS storage horizontally .

With Linux on the other hand, we already have the concept in place. It is just a matter of having the data plane and worker plane to be able to talk with each other, which isn’t always without challenges, something we will find out later on.

Choice of method to run worker nodes

It doesn’t really matter how the worker nodes will be provisioned, as long as they have a compatible kubelet version (remember, we bootstrapped the cluster in v1.24.0, so the same version goes here), a compatible CRI (I choose CRI-O, but it will do with containerd as well), a compatible CNI — and this is the tricky part as CNI wants Control Plane and Data Plane to talk with each other over a shared network. As Kubernetes classically began as an orchestrator to control docker instances with its iptables rulesets, it became natural to establish an “overlay” network which talks the same vxlan. But for us we will instead do completely without kube-proxy and instead focus on having the worker nodes talk to each other over eBPF.

In other words, do bare metal, raspberry pi, kvm, VMware or whatever hypervisor suits best, it will probably work as fine. Although, my environment is based on illumos, so that is what I will base this guide on.

But first, a quick recap on a method to bootstrap bhyve nodes with help of cloud-init in OmniOS (illumos) with the help of zadmcommand.

Bootstrap of worker nodes in illumos

Download a cloud-init compatible image which fits Cilium’s matrix https://docs.cilium.io/en/stable/operations/system_requirements/#linux-distribution-compatibility-matrix and convert the image to raw format:

curl -L -o /var/tmp/jammy-server-cloudimg-amd64.img https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
qemu-img convert -O raw /var/tmp/jammy-server-cloudimg-amd64.{img,raw} && rm /var/tmp/jammy-server-cloudimg-amd64.img

Generate the password hash, for instance with openssl:

PWDHASH=$(openssl passwd -6)

Define the CRI-O specifics

OS=xUbuntu_22.04
VERSION=1.24
K8SVERSION=1.24.0-00

Copy the certificates from the CA to /var/tmp as nodename.pem, nodename-key.pemand nodename.kubeconfigand generate the cloud-init(s):

First, open a terminal terminal1 and get to the directory where the certificates are created and type the following command to initiate the payload, copy the output.

(tar --transform=s,kubernetes-ca/kubernetes-ca,ca, -czf - \
kubernetes-ca/kubernetes-ca.pem $(for instance in {1..3}; do \
echo worker${instance}.pem worker${instance}-key.pem \
worker${instance}.kubeconfig; done |xargs) |base64)

Then, on the node where the instances will be created, type the following and paste in the output.

(cd /var/tmp; base64 -d | gtar -xzf -)

This would have created ca.pem, nodename.pem, nodename-key.pemand nodename.kubeconfig in /var/tmp which will be added in each respective cloud-init:

KUBERNETES_CA=$(cat /var/tmp/ca.pem|sed "s/^/      /g")for instance in {1..3}; do
NODENAME=worker${instance}
NODECERT=$(cat /var/tmp/${NODENAME}.pem|sed "s/^/      /g")
NODEKEY=$(cat /var/tmp/${NODENAME}-key.pem|sed "s/^/      /g")
KUBECONFIG=$(cat /var/tmp/${NODENAME}.kubeconfig|sed "s/^/      /g")
cat << EOF > /var/tmp/${NODENAME}-init
#cloud-config
users:
  - name: kubernaut
    gecos: Captain Kube
    primary_group: users
    groups: users
    shell: /bin/bash
    expiredate: '2029-12-31'
    lock_passwd: false
    sudo:  ALL=(ALL) NOPASSWD:ALL
    passwd: $PWDHASHbootcmd:
  - systemctl disable --now systemd-networkd-wait-onlinentp:
  enabled: truetimezone: Europe/Stockholmmanage_resolv_conf: true

resolv_conf:
  nameservers: ['9.9.9.9', '1.1.1.1']
  searchdomains:
    - cloud.mylocal
  domain: cloud.mylocal
  options:
    rotate: true
    timeout: 1write_files:
  - path: /etc/sysctl.d/enabled_ipv4_forwarding.conf
    content: |
      net.ipv4.conf.all.forwarding=1
  - path: /etc/modules-load.d/crio.conf
    content: |
      overlay
      br_netfilter
  - path: /etc/sysctl.d/99-kubernetes-cri.conf
    content: |
      net.bridge.bridge-nf-call-iptables  = 1
      net.ipv4.ip_forward                 = 1
      net.bridge.bridge-nf-call-ip6tables = 1
  - path: /etc/sysctl.d/99-override_cilium_rp_filter.conf
    content: |
      net.ipv4.conf.lxc*.rp_filter = 0
  - path: /root/.bashrc
    content: |
      if [ ! -f /usr/bin/resize ]; then
        resize() {
          old=\$(stty -g)
          stty -echo
          printf '\033[18t'
          IFS=';' read -d t _ rows cols _
          stty "\$old"
          stty cols "\$cols" rows "\$rows"
        }
      fi
      if [ "\$(tty)" = "/dev/ttyS0" ]; then
         resize
      fi
    append: true
  - path: /etc/hosts
    content: |
      10.100.0.1        kube-apiserver
      10.200.0.1        worker1
      10.200.0.2        worker2
      10.200.0.3        worker3
    append: true
  - path: /var/lib/kubelet/${NODENAME}.pem
    content: |
${NODECERT}
  - path: /var/lib/kubelet/${NODENAME}-key.pem
    content: |
${NODEKEY}
  - path: /var/lib/kubelet/kubeconfig
    content: |
${KUBECONFIG}
  - path: /var/lib/kubelet/ca.pem
    content: |
${KUBERNETES_CA}
  - path: /var/lib/kubelet/kubelet-config.yaml
    content: |
      kind: KubeletConfiguration
      apiVersion: kubelet.config.k8s.io/v1beta1
      cgroupDriver: systemd
      authentication:
        anonymous:
          enabled: false
        webhook:
          enabled: true
        x509:
          clientCAFile: "/var/lib/kubelet/ca.pem"
      authorization:
        mode: Webhook
      clusterDomain: "cluster.local"
      clusterDNS:
        - "10.96.0.10"
      podCIDR: "10.240.${instance}.0/24"
      resolvConf: "/run/systemd/resolve/resolv.conf"
      runtimeRequestTimeout: "15m"
      tlsCertFile: "/var/lib/kubelet/${NODENAME}.pem"
      tlsPrivateKeyFile: "/var/lib/kubelet/${NODENAME}-key.pem"
  - path: /etc/systemd/system/kubelet.service
    content: |
      [Unit]
      Description=Kubernetes Kubelet
      Documentation=https://github.com/kubernetes/kubernetes
      After=crio.service
      Requires=crio.service

      [Service]
      ExecStart=/usr/bin/kubelet \\
        --config=/var/lib/kubelet/kubelet-config.yaml \\
        --container-runtime=remote \\
        --container-runtime-endpoint=/var/run/crio/crio.sock \\
        --kubeconfig=/var/lib/kubelet/kubeconfig \\
        --register-node=true \\
        --v=2
      Restart=on-failure
      RestartSec=5

      [Install]
      WantedBy=multi-user.targetruncmd:
  - modprobe overlay 
  - modprobe br_netfilter
  - sysctl --system 2>/dev/null
  - curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/Release.key | apt-key add -
  - curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable:/cri-o:/$VERSION/$OS/Release.key | apt-key add -
  - echo "deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/ /" > /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
  - echo "deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable:/cri-o:/$VERSION/$OS/ /" > /etc/apt/sources.list.d/devel:kubic:libcontainers:stable:cri-o:.$VERSION.list 
  - curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
  - echo "deb https://baltocdn.com/helm/stable/debian/ all main" > /etc/apt/sources.list.d/helm-stable-debian.list
  - sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
  - echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" > /etc/apt/sources.list.d/kubernetes.list
  - export DEBIAN_FRONTEND=noninteractive KUBECONFIG=/etc/kubernetes/admin.conf
  - DEBIAN_FRONTEND=noninteractive apt-get update -q -y 
  - DEBIAN_FRONTEND=noninteractive apt-get install -y cri-o cri-o-runc apt-transport-https ca-certificates curl gnupg-agent software-properties-common bpfcc-tools bpftrace nfs-common helm
  - systemctl daemon-reload
  - systemctl enable --now crio
  - DEBIAN_FRONTEND=noninteractive apt-get install -q -y kubelet=$K8SVERSION kubectl=$K8SVERSION 
  - DEBIAN_FRONTEND=noninteractive apt-mark hold kubelet kubectl
EOF
done

Create the worker node definition:

VLAN=50 # Set to whatever your VLAN ID is, or remove vlan-id \
  from the json definition if no VLAN is in use.CPU_SHARES=1  # How many shares, according to Fair share scheduler
NODE_MEM=4G
NODE_VCPU=4
NODE_DNS1=1.1.1.1 # Primary resolver
NODE_DNS2=9.9.9.9 # Secondary resolver
DATADISK=20G # Later on, for Persistent Storage
GW=10.200.0.62
BITMASK=26for instance in {1..3}; doNODENAME=worker${instance}
INTERNAL_IP=10.200.0.${instance}cat <<EOF > /var/tmp/${NODENAME}.json
{
 "acpi" : "on",
 "autoboot" : "true",
 "bootargs" : "",
 "bootdisk" : {
 "blocksize" : "8K",
 "path" : "dpool/bhyve/${NODENAME}/root",
 "size" : "20G",
 "sparse" : "true"
 },
 "disk" : [
      {
         "blocksize" : "8K",
         "path" : "dpool/bhyve/${NODENAME}/longhorn",
         "size" : "${DATADISK}",
         "sparse" : "true"
      }
 ], "bootorder" : "cd",
 "bootrom" : "BHYVE_CSM",
 "brand" : "bhyve",
 "cloud-init" : "/var/tmp/${NODENAME}-init",
 "cpu-shares" : "1",
 "diskif" : "virtio",
 "dns-domain" : "cloud.mylocal",
 "fs-allowed" : "",
 "hostbridge" : "i440fx",
 "hostid" : "",
 "ip-type" : "exclusive",
 "limitpriv" : "default",
 "net" : [
 {
 "allowed-address" : "${INTERNAL_IP}/${BITMASK}",
 "defrouter" : "${GW}",
 "global-nic" : "aggr0",
 "physical" : "${NODENAME}",
 "vlan-id" : "${VLAN}"
 }
 ],
 "netif" : "virtio",
 "pool" : "",
 "ram" : "${NODE_MEM}",
 "resolvers" : [
 "${NODE_DNS1}",
 "${NODE_DNS2}"
 ],
 "rng" : "off",
 "scheduling-class" : "",
 "type" : "generic",
 "vcpus" : "${NODE_VCPU}",
 "vnc" : "on",
 "xhci" : "on",
 "zonename" : "${NODENAME}",
 "zonepath" : "/zones/${NODENAME}"
}
EOF
done

Create and boot up the worker nodes:

for instance in {1..3}; do
zadm create -b bhyve worker${instance} < /var/tmp/worker${instance}.jsonpv /var/tmp/jammy-server-cloudimg-amd64.raw > /dev/zvol/dsk/dpool/bhyve/worker${instance}/root
done

And finally, boot the first worker node in console mode (escape by ~~.) to check that everything is in order, then let the rest of the nodes boot up:

zadm boot -C worker${instance}for instance in {2..3}; dozadm boot worker${instance}
done

Depending on the hardware (and connection) this could take between 30–90 seconds. If time is critical, then it would be feasible to create a snapshot beforehand after the first boot and then reuse that one to have the instances instead replace the instance specific configurations and certificates during boot time. The advantage of this would be that no installation is needed to take place and the node is booting up quickly. The (only?) disadvantage would be that this snapshot needs to be recreated after each version change.

Basic verification

Now that the workers have been installed, an initial test kan be done by the running below command to list the registered nodes (-w flag means that we are watching/tailing) the output

kubectl get nodes -w
NAME      STATUS     ROLES    AGE   VERSION
worker1   NotReady   <none>   10s   v1.24.0
worker2   Ready      <none>   10s   v1.24.0
worker3   NotReady   <none>   10s   v1.24.0
worker3   Ready      <none>   10s   v1.24.0
worker3   Ready      <none>   10s   v1.24.0
worker1   Ready      <none>   10s   v1.24.0
worker1   Ready      <none>   10s   v1.24.0
^C

Installation of CNI (Cilium)

I am going to utilize a popular packaging system for Kubernetes — Helm. As there is no prebuilt binary for illumos we’ll either have to try and compile it, or, as I will do, install it in a a compatible client (might do with one of the worker nodes, another machine or even a LX-zone handles it well) to keep this guide tidy.

Prepare API Server

If the nodes are going to be registered as FQDN against a DNS server where they are known this shouldn’t be an issue, but this guide does not expect that to be the case and we need to populate the /etc/hosts with all the worker nodes on the kube-apiserver host. In this case:

10.200.0.1        worker1
10.200.0.2        worker2
10.200.0.3        worker3

Install Helm

Inside the client, scan the script quickly and then run it.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
cat get_helm.sh 
bash ./get_helm.sh

Install CoreDNS

To be able to install anything (including the CNI), we need to have DNS resolution in place and to our help we will install the nowadays standard component CoreDNS:

curl -s https://raw.githubusercontent.com/coredns/deployment/master/kubernetes/coredns.yaml.sed | sed '/^\s*forward . UPSTREAMNAMESERVER {$/{:a;N;/^\s*}$/M!ba;d};s/CLUSTER_DNS_IP/10.96.0.10/g;s/CLUSTER_DOMAIN REVERSE_CIDRS/cluster.local in-addr.arpa ip6.arpa/g;s/}STUBDOMAINS/}/g;s/# replicas:/replicas: 2 #/g' |kubectl create -f -

Install Cilium

As we don’t (want to) have a kube-proxy in our cluster, we will need to specify to the cilium-agent how the API server should be reached.

KUBE_APISERVER=10.100.0.1
KUBE_APIPORT=6443helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium --version 1.11.5 \
  --namespace kube-system --set kubeProxyReplacement=strict \
    --set k8sServiceHost=${KUBE_APISERVER} \
    --set k8sServicePort=${KUBE_APIPORT}

Verify that Cilium has been installed and that all the pods are up and running:

kubectl get pod -A
NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE
kube-system   cilium-fplqr                       1/1     Running   0          57s
kube-system   cilium-mvqfd                       1/1     Running   0          57s
kube-system   cilium-operator-69f9cc8f68-dcgdj   1/1     Running   0          57s
kube-system   cilium-operator-69f9cc8f68-xdshx   1/1     Running   0          57s
kube-system   cilium-tqjvn                       1/1     Running   0          57s
kube-system   coredns-6cd56d4df4-4slzg           1/1     Running   0          2m39s
kube-system   coredns-6cd56d4df4-njhln           1/1     Running   0          2m39s

Check the status of the cilium-agent

kubectl exec -n kube-system -it $(kubectl  -n kube-system --no-headers=true  get pod -l k8s-app=cilium | head -1) -- cilium status
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), clean-cilium-state (init)
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.24+ (v1.24.0-2+906e9d86543c71) [illumos/amd64]
Kubernetes APIs:        ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Strict   [worker2 10.200.0.2 (Direct Routing)]
Host firewall:          Disabled
Cilium:                 Ok   1.11.5 (v1.11.5-b0d3140)
NodeMonitor:            Listening for events on 4 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok   
IPAM:                   IPv4: 3/254 allocated from 10.0.0.0/24, 
BandwidthManager:       Disabled
Host Routing:           Legacy
Masquerading:           IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:      24/24 healthy
Proxy Status:           OK, ip 10.0.0.64, 0 redirects active on ports 10000-20000
Hubble:                 Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 2.24   Metrics: Disabled
Encryption:             Disabled
Cluster health:         3/3 reachable   (2022-05-18T18:45:13Z)

Troubleshooting

If there is any issues, such as timeouts, check the firewall logs. At the bare minimum the following rules need to be applied:

kube-apiserver -> Worker Plane 10250/TCP
Worker Plane -> kube-apiserver 6443/TCP
Worker Plane -> DNS-servers 53/UDP
Worker Plane -> various registries 443/TCP
Worker Plane -> Various repos (check the cloud-init file)

Also, the Cilium agent has many helpful commandline options.

Observability with Hubble

We can add Hubble to this cluster by running the following helm command:

helm upgrade cilium cilium/cilium --version 1.11.5 \
 --namespace kube-system --reuse-values --set hubble.listenAddress=":4244" \
   --set hubble.relay.enabled=true \
   --set hubble.ui.enabled=true

The cluster at this stage lacks the network LoadBalancer and we don’t have an ingress (more on that in a later part), so we can then either do a kubectl port-forward and browse through a tunnel, or by patching the service type to be a NodePort (it is not a recommended way).

kubectl -n kube-system patch svc hubble-ui -p '{"spec": {"type": "NodePort"}}'
service/hubble-ui patchedkubectl  -n kube-system get svc
NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
hubble-peer    ClusterIP   10.110.25.184    <none>        4254/TCP                 41m
hubble-relay   ClusterIP   10.110.129.215   <none>        80/TCP                   3m1s
hubble-ui      NodePort    10.103.8.52      <none>        80:32606/TCP             3m1s
kube-dns       ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   43m

Conclusion

We now have the cluster in its most basic way. It should without issues handle workloads that don’t rely on WebHooks or LoadBalancer type services.

Isovalent and its Cilium project are expected to have the v1.12 go GA anytime soon and I intend to showcase how we then can add the ingress controller in a streamlined way.

To be continued…