Kubernetes multi-cluster networking with Cilium and kops

Some background

Why use multiple clusters

At Sportradar we prefer multiple single-tenant Kubernetes clusters over a single multi-tenant cluster. Sportradar has a distributed development organisation and having multiple clusters makes it easier to ensure there are no global locks on cluster operations. Our current strategy is to have one business/service unit have their own cluster that they use for all their applications.

Why Cilium

Since there is a huge amount of data exchange between units, the challenge with having multiple clusters quickly appears. For many use cases, exposing services out of the cluster in common ways like Ingress or LoadBalancer works just fine. But there are also many cases where you gain very little from the additional hops the traffic takes compared to the overhead. Some services also expose their data as Kafka topics for quick processes, which cannot be exposed out of the cluster using any of these mechanisms.

As we felt exposing services between clusters should be easier than a multi-tenant cluster, we began looking at various ways connecting pods in different clusters directly. This isn’t more magical than ensuring routes exists between various cluster networks. The question is just how to configure this in the easiest way possible given that we would have a lot of cross-cluster communication.

The most common way seems to be using BGP in some form. This shouldn’t really be a surprise to anyone.Both Kube-router and Calico CNI base their cluster networking on BGP. And especially the former has pretty good support for configuring BGP peers to networks outside of a cluster. But even though BGP isn’t that hard, it is still far from as intuitive as we’d expect in the Kubernetes world. And the amount of configuration per cluster is substantial as well.

As luck has it, Cilium announced the ClusterMesh feature in 1.2 just before we were about the get our hands dirty with one of the other CNIs. Looking at the documentation for ClusterMesh this was in a whole different ballpark compared to some of the examples I have seen with other CNIs. We had no doubt this was the option for us.

One thing that is worth mentioning though: Cilium allows you to set up mesh between clusters, but it cannot talk to any other networks. For that you need to combine Cilium with something using BGP.

Setting up the mesh

The Cilium documentation gives a rough overview of the process, but as we configured almost everything from scratch, I thought I would share how we set up two clusters and connected them together.

Creating a couple of Kubernetes clusters

We use kops for operating our Kubernetes clusters. For the clusters using cilium we have a shared subnet with shared routing tables. This just makes it easier to manage routes to other VPCs and to our on-premise sites. For our clusterMesh, this means we don’t have to do much configuration to get node connectivity. All we do is create a Security Group that allows nodes to talk to other members of the security group. The kops documentation details how to configure such a setup.

I won’t go into any details about how to use kops here. Just the key elements of a cluster configuration necessary to get ClusterMesh working. I prefer creating a cluster configuration file and then add/edit the changes necessary.

As Cilium requires etcd3, ensure that you have explicitly set an etcd3 version as follows:

spec:
etcdClusters:
- etcdMembers:
...
name: main
version: 3.1.11
- etcdMembers:
...
name: events
version: 3.1.11

Naturally, we cannot have conflicting routes, so ensure that the internal cluster network CIDRs are not overlapping. You can just split the default 100.64.0.0/10into multiple CIDRs of whatever size you find appropriate. Also add the networking part stating we want to use Cilium and that we want version 1.2.2:

spec:
networking:
cilium:
version: v1.2.2
nonMasqueradeCIDR: 100.<63 + cluster-id>.0.0/16

For every instance group, ensure you are using a Stretch image (or CoreOS if you prefer):

spec:
image: kope.io/k8s-1.9-debian-stretch-amd64-hvm-ebs-2018-03-11

Create a couple of cluster configuration files with the changes above and then run kops create -f <cluster>.yaml for each of them.

Upgrade Cilium

Once you have a couple of clusters up, you need to upgrade the Cilium DaemonSet and RBAC resources. The DaemonSet spec currently used by kops does not set the cluster-id and cluster-name configuration, nor does it mount the ClusterMesh configuration.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/HEAD/examples/kubernetes/1.10/cilium-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/HEAD/examples/kubernetes/1.10/cilium-ds.yaml
kubectl set image daemonset/cilium -n kube-system cilium-agent=docker.io/cilium/cilium:v1.2.2

Ensure you are using at least version 1.2.2 as it contains some essential bugfixes required to get ClusterMesh working.

Configuring the Cilium agents

Next up is adding the cluster-id and cluster-name to each of the clusters. On all your clusters run kubectl -n kube-system edit cm cilium-config and add the following:

spec:
data:
cluster-name: "cluster<id>"
cluster-id: "<id>"

Ensure that the <id> is unique across the clusters.

While editing the files above, also take note of the etcd-cluster literal. Create a file with same name as cluster-name specified above containing the etcd-cluster literal. Then create a secret containing these etcd configs: kubectl create secret generic cilium-cluster-mesh --from-file=./cluster-name1 --from-file=./cluster-name2 ... --from-file=./cluster-nameN

The Cilium agents will recognize and ignore references to its local etcd-cluster, so it is perfectly safe to run the identical command across all clusters.

After creating the secrets, restart the Cilium agent on all clusters: kubectl -n kube-system delete pods -l k8s-app=cilium

That’s it!

Test connectivity

Hopefully you should be able to see remote nodes coming in when running kubectl -n kube-system exec -ti cilium-12345 cilium node list It should look something like this:

Name                       IPv4 Address  Endpoint CIDR  
cluster1/ip-10-0-53-242 10.0.53.242 100.64.129.0/24
cluster1/ip-10-0-53-247 10.0.53.247 100.64.131.0/24
cluster1/ip-10-0-54-163 10.0.54.163 100.64.128.0/24
cluster1/ip-10-0-54-169 10.0.54.169 100.64.133.0/24
cluster1/ip-10-0-55-61 10.0.55.61 100.64.130.0/24
cluster1/ip-10-0-55-86 10.0.55.86 100.64.132.0/24
cluster2/ip-10-0-53-161 10.0.53.161 100.65.133.0/24
cluster2/ip-10-0-53-209 10.0.53.209 100.65.128.0/24
cluster2/ip-10-0-54-229 10.0.54.229 100.65.131.0/24
cluster2/ip-10-0-54-57 10.0.54.57 100.65.129.0/24
cluster2/ip-10-0-55-100 10.0.55.100 100.65.130.0/24
cluster2/ip-10-0-55-111 10.0.55.111 100.65.132.0/24

You can also see connectivity status by runningkubectl -n kube-system exec -ti cilium-12345 cilium status

And of course, try something like running curl or something against a remote pod.

Like what you read? Give Ole Markus With a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.