Kubernetes on Illumos — exposing ClusterIP

A messy autumn with Longhorn

8 min readDec 4, 2022

It feels a bit like Pandora’s box, and it probably is — be careful out there. While I desire control over what is going in and out of the cluster - limited tinker time at the pace of all Kubernetes projects had me playing chase and catch. While fun it is also time consuming. My biggest issue as of lately has been with storage and I early chose Longhorn due to the project being a CNCF Incubating Project, a choice that during this autumn has been up to test.

HubbleUI with a great overview of the traffic.

My illumos control plane/Linux data plane cluster has been up and running without any issues (albeit with some restrictions and limitations) for about 300 days, serving my home with a limited amount of services. About two months ago I decided to go for v1.25.1 to realise that LH were expecting some api resources that no longer was present. I downgraded the cluster back to v1.24.x and restored the persistent volumes that was unhappy, and left it that way for a while.

Time for a change

As we soon are to expect Kubernetes v1.26 to be released (6th of December), I decided to go ahead with Longhorn anyway, despite a lack of support for v1.26. The support will arrive in their v1.4 branch which luckily happens to be in their master branch, so I went ahead — living on the edge.

One of the issues that had me downgrade the cluster two months ago was that Longhorn on v1.2 was perfect in my cluster, v1.3 came with both a MutatingWebhookConfiguration and ValidatingWebhookConfiguration, which I configured in the external loadbalancer. This time the my change of the webhooks from Service: to url: in the CRD scheme did not affect the traffic as my changes was overwritten all the time and it appeared to be something with the deployment of the longhorn-manager. I began to clone their GIT repo, but it felt so awkward to try repair something that fragile (that I need to patch myself in every release) and I was tempted to try something else, but first I let that storage issue remain on hold for the moment.

Pandora’s box

The boundaries

To get back to the issue. I have a control plane network that aren’t aware of the data plane network, and vice verse. They live in separate network segments and the communication from the workers has strictly been passing through firewall to port TCP/6443 on the kube-apiserver, and the other way around has been through an external loadbalancer with defined endpoints as the kube-apiserver haven’t been able to speak directly with the ClusterIP CIDR. This is a great way to have control of the external communication, but it inflicts some manual labour during deployments.

First attempt — VTEP

I went back to the network design to see if I should try anything else. Cilium implemented VTEP earlier this year, which should allow an external vxlan tunnel to connect to the cluster with id 2 (which defines an external network in Cilium) and there is support for VXLAN in illumos by creating an overlay device. I had an attempt at it, but while VXLAN possibly aint that complicated, I failed to get a link (could be due to everything being in different VLAN segments at home) between an external illumos node and the data plane.

Second attempt — utilize BGP

As BGP is already in place and announces LoadBalancer CIDR to clients, the missing link was mostly the ClusterIP CIDR (which, by design, is not externally reachable). With cilium (the swiss army knife) there is possibilites to do (well, almost) anything. I decided to go ahead and expose that ClusterIP to the BGP peer (leaf) that communicates with the worker nodes and, in turn, announce that CIDR to the control plane.

I got reminded that a couple of months ago a person commented something wisely about going BFD and null route instead, and while I haven’t yet gone that route I decided to reconfigure the BGP a bit (as I have a wish to implement node autoscaling by dynamically through cloud-init bootstrap nodes on demand) and ended up with this configuration (I have a spine at 10.127.1.1):

frr version 7.5
frr defaults traditional
hostname frr
log syslog
no ipv6 forwarding
service integrated-vtysh-config
!
ip route 10.96.0.0/12 192.168.12.31                                                                                                                             
ip route 10.96.0.0/12 192.168.12.32                                                                                                                             
ip route 10.96.0.0/12 192.168.12.33                                                                                                                             
ip route 10.96.0.0/12 K8S
!
router bgp 64803
 bgp router-id 10.127.1.3
 bgp log-neighbor-changes
 no bgp ebgp-requires-policy
 bgp bestpath as-path multipath-relax
 neighbor K8S peer-group
 neighbor K8S remote-as 64521

 neighbor K8S capability extended-nexthop
 neighbor K8S update-source 10.127.1.3
 bgp listen range 192.168.12.0/24 peer-group K8S


 neighbor 10.127.1.1 remote-as 64801
 neighbor 10.127.1.1 description gw1.infra.ploio.net
 !
  address-family l2vpn evpn
   neighbor K8S activate
   neighbor K8S route-reflector-client
  exit-address-family
  !
 address-family ipv4 unicast
  network 10.96.0.0/12     # Service CIDR
  network 10.248.248.0/24  # Service LoadBalancer CIDR
  network 192.168.12.0/24  # Node CIDR
  neighbor K8S next-hop-self
  no neighbor K8S send-community
  neighbor 10.127.1.1 soft-reconfiguration inbound
  neighbor 10.127.1.1 route-map ALLOW-ALL in
  neighbor 10.127.1.1 route-map ALLOW-ALL out
  maximum-paths 120
 exit-address-family
!
route-map ALLOW-ALL permit 100
!
line vty
!

This is by no means an optimal setup as the integrated metalLB does not seem to be able to expose ClusterIP and as Cilium now are putting their efforts in GoBGP, let’s hope they will also allow for announcing the ClusterIP. Meanwhile, ECMP was what I found out to solve my issue for now. In short, the first part enables a multi path route to the worker nodes, but that part seem to force me to explicitly type in each worker node for the path to be defined.

I should probably revise the configuration once more as I were working in parallell with the Longhorn issue and I believe one of the issues was in fact due to 140k replicas put strain to my api-server.

This is not scientific by any means and just an observation meanwhile background deletion of replicas by hammering kubectl of selected lists, but it might show a scale of my issue. When there was >140k I couldn’t even finish the kubectl output of all replicas before the kube-apiserver crashed (I let it have a rather generous limit of 12G):

+----------+--------------------------------------------------------------+
| Replicas | echo $(time kubectl -n longhorn-system get replicas | wc -l) |
+----------+--------------------------------------------------------------+
|    67030 | 1m28.356s                                                    |
|    64998 | 1m25.653s                                                    |
|    61059 | 1m21.428s                                                    |
|    52542 | 1m10.275s                                                    |
|    47254 | 1m7.148s                                                     |
|    44383 | 0m59.895s                                                    |
|    37623 | 0m49.727s                                                    |
|    29751 | 0m39.100s                                                    |
|    25840 | 0m33.235s                                                    |
|    22715 | 0m29.888s                                                    |
|    19982 | 0m24.990s                                                    |
|    15748 | 0m19.394s                                                    |
|     9910 | 0m12.478s                                                    |
|     7990 | 0m9.568s                                                     |
|     6230 | 0m7.770s                                                     |
|     5603 | 0m6.995s                                                     |
|     4673 | 0m5.859s                                                     |
|     4323 | 0m5.437s                                                     |
|     4093 | 0m4.595s                                                     |
|     3065 | 0m3.663s                                                     |
|     2418 | 0m2.871s                                                     |
|     1397 | 0m1.634s                                                     |
|      852 | 0m1.005s                                                     |
|      663 | 0m0.750s                                                     |
|      405 | 0m0.504s                                                     |
|      279 | 0m0.380s                                                     |
|      196 | 0m0.267s                                                     |
|       25 | 0m0.094s                                                     |
+----------+--------------------------------------------------------------+

Third attempt — for the future

One path I have left would be to utilize the l2vpn part and try to have FRR talk VXLAN to the VTEP interface in the cluster, though I see no direct reason at the moment.

Advantages and disadvantages

Security-wise the access to the data plane is way more permissive now, but there is still options to set filters in FRR and the network policies seem to work well. In OmniOS there is an implementation of ExaBGP that I might look into to bring the routing to bare metal (now FRR runs in a bhyve guest, which implies a longer communication path).

The latest Cilium in place

While I was already onto it, I decided to go ahead and install v1.13.0-rc2 of Cilium, which now has support for gateway-api (v0.5.1) with HTTPRoute. I’ve been running the TLSRoute for a couple of weeks in one of my labs and it seemed to work well enough to have me install it back “at home”.

In order to create gateway objects, the CRD from the gateway-api project must be installed:

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.5.1/config/crd/standard/gateway.networking.k8s.io_gatewayclasses.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.5.1/config/crd/standard/gateway.networking.k8s.io_gateways.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.5.1/config/crd/standard/gateway.networking.k8s.io_httproutes.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v0.5.1/config/crd/experimental/gateway.networking.k8s.io_referencegrants.yaml

Cilium allows for external exponation of ClusterIP with the bpf.lbExternalClusterIP set to true.

KUBE_APISERVER=# IP of the kube-apiserver to help workers initialize
helm upgrade cilium cilium/cilium \
  --install \
  --namespace kube-system \
  --set kubeProxyReplacement=strict \
  --set k8sServiceHost=${KUBE_APISERVER} \
  --set k8sServicePort=6443 \
  --set ingressController.enabled=true \
  --set bgp.enabled=true \
  --set bgp.announce.loadbalancerIP=true \
  --set kubeProxyReplacement=strict \
  --set prometheus.enabled=true \
  --set operator.prometheus.enabled=true \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,Icmp,http}" \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set bgp.announce.podCIDR=true \
  --set bpf.lbExternalClusterIP=true \
  --set gatewayAPI.enabled=true \
  --version 1.13.0-rc2

And please, don’t be like Tony — do set hubble.enabled=true right away, it will be (trust me) tedious to try troubleshooting why :4244 won’t listen on the nodes. ;-)

To have cert-manager order create a certificate automatically for gateway object, there is a need as of this writing to enable a --feature-gates=ExperimentalGatewayAPISupport=true. As of external-dns, the recent deployments should be able to handle gateway objects, but external-dns is so far unable to speak with kubernetes service (the API) and strange enough it is the only (but nevertheless, my environment relies alot on having the externalDNS to create a record in the DNS) deployment so far that has shown issues. Certainly I must have missed something trivial, but I have removed the network policy I had in place and a alpine pod in that namespace can reach the internal Kubernetes service endpoint.

To create a browseable gateway, there needs to be two objects. A Gateway kind object and a Route object of some kind (at the moment, there appears to be no other than HTTPRoute, but expect the option set to grow in time).

First the gateway object:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
  annotations:
    cert-manager.io/cluster-issuer: issuer-production
  name: homeassistant-gateway
  namespace: hass
spec:
  gatewayClassName: cilium
  listeners:
  - allowedRoutes:
      namespaces:
        from: Same
    hostname: hassgw.infra.ploio.net
    name: https-1
    port: 443
    protocol: HTTPS
    tls:
      certificateRefs:
      - group: ""
        kind: Secret
        name: hassgw-cert
      mode: Terminate

Then the HTTPRoute object and there can be multiple backends with different weights (see https://gateway-api.sigs.k8s.io/api-types/httproute/ for more information):

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: https-hass-route-1
  namespace: hass
spec:
  hostnames:
  - hassgw.infra.ploio.net
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: hass-gateway
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: my-home-assistant
      port: 8123
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /

I had some ideas on how the UDPRoute and TCPRoute could be put to work, but looks like it’s not implemented yet in the gateway-api project, so that is for the future. And by the way, Websockets appears to work straight away in HTTPRouter objects:

Example of Home Assistant, served through gateway-api in one lab environment.

That’s it for this time. I plan to dig deeper into the gateway-api later and check out the Cilium BGP Control Plane as they’ve added LB IPAM, I hope it will support announcement of ClusterIP as well.

Kubernetes on Illumos — exposing ClusterIP

A messy autumn with Longhorn

Time for a change

Pandora’s box

The latest Cilium in place

Written by Tony Norlin