Connecting multiple Kubernetes clusters on vSphere with the Cilium Cluster Mesh

Published in

FAUN — Developer Community 🐾

21 min readDec 5, 2019

Application deployment and high availability across multiple regions or localities have been ubiquitous for a long time in enterprise setups. The underlying infrastructure may have changed and added to, with Bare metal servers, Virtual Machines and Container Platforms like Kubernetes but the basic gist of the requirements remain the same from an application standpoint.

One of the interesting challenges in the Kubernetes platform is to setup a multi-region cluster. While a cluster can technically be setup spanning nodes across regions as long there is network connectivity, this setup is not recommended due to various reasons especially latency issues with etcd.

Public container platform providers like GKE, AKS alleviate some issues of connecting 2 or more K8S clusters by providing network infrastructure to communicate across zones and regions.

Here today, I attempt to employ the same to a private cloud infrastructure, potentially distributed across multiple geographical regions.

Deployment Architecture

The general idea here is to -

Setup multiple independent K8S clusters
Install a networking infrastructure which spans across the clusters and allows pod IPs to be routable across the clusters
Install a high availability application across the clusters.

For setting up K8S cluster I chose using vSphere as my cloud platform. While this demo can be done using minikube or microK8s, I felt using a vSphere setup to be very close to what a real use deployment might use.

There are multiple ways to setup Kubernetes on vSphere either by using kubeadm scripts or tools like rancher or kublr but I chose to use the new and upcoming cluster api provider from Kubernetes. While still in alpha development, this worked very well. I also installed metallb as a load balancer provider for vSphere.

To connect the clusters at the pod level to create an overlay network I used the cilium network plugin which also has a cluster mesh capability. Another way to do it might be with istio which I plan to try next and share in a future post.

A very good detail of how cilium cluster mesh works can be found here — Cilium Cluster Mesh. In a nutshell, cilium exposes its etcd endpoint proxies via a load balancer in each cluster and then each of the etcd points exchange and triage data between the two clusters to setup the routing.

For the high availability application deployment, I chose cockroachdb. The cockroachdb documentation has a very good outline to setup a multi-region cluster in GKE and I decided to adopt that to my setup to test the same capabilities.

One thing to note here is that the demonstration is just for a functioning high availability application across clusters. It doesn’t account for latency across regions and if data should be distributed this way without sharding etc. Latency can impact the ultimate application deployment design. Therefore those considerations and optimizations in the designs must be incorporated before making this a production ready solution.

Setup

There are a few prerequisites for this particular setup namely

All nodes across the two clusters should be routable.
The pod CIDRs across the two clusters should be unique
Cilium installation needs to be done using its managed etcd process which has its own set of requirements which need to be first met.

Kubernetes Installation

I setup two clusters named east-1 and west-1 on vSphere. For testing purposes I used the same datacenter for both clusters but this will work across data centers as well, as long the nodes are routable.
The Cluster API for Kubernetes attempts to provide a universal and declarative way of installing kubernetes clusters. For in depth details of the workings of the cluster-api check this set of sequence diagrams . The provided quick start of the cluster-api is also useful to familiarize ourselves with it. In summary, the cluster api creates a single node K8s management cluster which is then used to spawn multiple clusters.
For installation of our clusters on vSphere follow these steps — Cluster API for vSphere . The K8s version used here is v1.15.5. There are a couple of minor tweaks we need to do before bootstrapping our clusters -

We need to ensure that our pod CIDRs are unique across the clusters. So change the ‘envvars.txt’ file as noted below for each cluster

2. After generation of the yaml files, we tweak some settings inside those -

One of the requirements for installing the cilium is using the Berkeley Packet Filter by mounting the bpf file system on each node. While this can be done post kubernetes installation, a better way is to add this as part of the node bootstrapping configuration. Since the cluster api uses Kubeadm configuration via cloud init we can pass this as part of the node boot process. So after generating the manifests for creating the cluster, edit the controlplane.yaml and machinedeployment.yaml as follows to add the BPF mount command in the prekubeadmCommands section in each file e.g.

# controlplane.yaml
preKubeadmCommands:
  - hostname "{{ ds.meta_data.hostname }}"
  - echo "::1         ipv6-localhost ipv6-loopback" >/etc/hosts
  - echo "127.0.0.1   localhost {{ ds.meta_data.hostname }}" >>/etc/hosts
  - echo "{{ ds.meta_data.hostname }}" >/etc/hostname
  - mount bpffs /sys/fs/bpf -t bpf
  - echo "bpffs /sys/fs/bpf             bpf     defaults 0 0" >> /etc/fstab

Also edit the cpu/memory and number of instances. In my case, the control plane for K8S is a 4cpu 8G memory single instance and I have 3 workers with 8cpu and 32G memory.
Optionally add NTP settings for the cluster especially if the vSphere nodes don’t have NTP setup.

The final set of configurations are as below

It is time to apply these configurations to setup all our clusters.

> kubectl apply -f out/k8s-cluster-east-1/cluster.yamlcluster.cluster.x-k8s.io/k8s-cluster-east-1 created
vspherecluster.infrastructure.cluster.x-k8s.io/k8s-cluster-east-1 created> kubectl get clustersNAME                     PHASE
k8s-cluster-east-1       provisioned
k8s-management-cluster   provisioned> kubectl apply -f out/k8s-cluster-east-1/controlplane.yamlkubeadmconfig.bootstrap.cluster.x-k8s.io/k8s-cluster-east-1-controlplane-0 created
machine.cluster.x-k8s.io/k8s-cluster-east-1-controlplane-0 created
vspheremachine.infrastructure.cluster.x-k8s.io/k8s-cluster-east-1-controlplane-0 created> kubectl get machinesNAME                                    PROVIDERID                                     PHASEk8s-cluster-east-1-controlplane-0       vsphere://423bc5af-5ae9-d919-2173-2bb9fe410ac2   running
k8s-management-cluster-controlplane-0   vsphere://423b56a1-7bda-6535-3193-b85e254acaea   running> kubectl apply -f out/k8s-cluster-east-1/machinedeployment.yamlkubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/k8s-cluster-east-1-md-0 createdmachinedeployment.cluster.x-k8s.io/k8s-cluster-east-1-md-0 createdvspheremachinetemplate.infrastructure.cluster.x-k8s.io/k8s-cluster-east-1-md-0 created> kubectl get machinesNAME                                      PROVIDERID                                       PHASEk8s-cluster-east-1-controlplane-0         vsphere://423bc5af-5ae9-d919-2173-2bb9fe410ac2   running
k8s-cluster-east-1-md-0-c94ccd8cf-f9blk   vsphere://423b476f-065c-aa41-bec2-14e0b642d02d   running
k8s-cluster-east-1-md-0-c94ccd8cf-gt469   vsphere://423b3765-964f-69b7-5172-b6649831cc29   running
k8s-cluster-east-1-md-0-c94ccd8cf-hfkgb   vsphere://423be238-0e8b-48ac-bf9e-cc29c66fabe7   running
k8s-management-cluster-controlplane-0     vsphere://423b56a1-7bda-6535-3193-b85e254acaea   running## Repeat the same steps for the west-1 cluster and we have all the machines online.>  kubectl get machines --watchNAME                                       PROVIDERID                                       PHASEk8s-cluster-east-1-controlplane-0          vsphere://423bc5af-5ae9-d919-2173-2bb9fe410ac2   running
k8s-cluster-east-1-md-0-c94ccd8cf-f9blk    vsphere://423b476f-065c-aa41-bec2-14e0b642d02d   running
k8s-cluster-east-1-md-0-c94ccd8cf-gt469    vsphere://423b3765-964f-69b7-5172-b6649831cc29   running
k8s-cluster-east-1-md-0-c94ccd8cf-hfkgb    vsphere://423be238-0e8b-48ac-bf9e-cc29c66fabe7   running
k8s-cluster-west-1-controlplane-0          vsphere://423bd53f-e963-bb5e-be08-33cbcbc55ee2   running
k8s-cluster-west-1-md-0-5cc7968fbf-6mpvj   vsphere://423b4618-1978-09d0-60fe-e429a0e147c1   running
k8s-cluster-west-1-md-0-5cc7968fbf-dn68z   vsphere://423bc497-9bc0-83bf-0f39-da9f84b68e3f   running
k8s-cluster-west-1-md-0-5cc7968fbf-vsnjn   vsphere://423b9490-c352-a363-e9ce-25832d4bcd6a   running
k8s-management-cluster-controlplane-0      vsphere://423b56a1-7bda-6535-3193-b85e254acaea   running

Once the clusters are up and running, we can extract the kubeconfigs as

> kubectl get secret k8s-cluster-east-1-kubeconfig - o=jsonpath='{.data.value}' | { base64 -d 2>/dev/null || base64 -D; } > ~/.kube/k8s-cluster-east-1-kubeconfig>kubectl get secret k8s-cluster-west-1-kubeconfig -o=jsonpath='{.data.value}' | { base64 -d 2>/dev/null || base64 -D; } > ~/.kube/k8s-cluster-west-1-kubeconfig

Both the kubeconfig files contain the same user name key so combining them for context switching doesn’t work out of the box. For that reason edit each file and replace user and name and current context to be unique between the files. The two of my kubeconfig files look like this

k8s-cluster-east-1

k8s-cluster-west-1

Now combine them

>touch  ~/.kube/k8s-multi-kubeconfig
>export KUBECONFIG=~/.kube/k8s-multi-kubeconfig:~/.kube/k8s-cluster-west-1-kubeconfig:~/.kube/k8s-cluster-east-1-kubeconfig>kubectl config get-contexts
CURRENT   NAME                 CLUSTER              AUTHINFO             NAMESPACE
*         k8s-cluster-east-1   k8s-cluster-east-1   k8s-cluster-east-1   
          k8s-cluster-west-1   k8s-cluster-west-1   k8s-cluster-west-1

Let us check the nodes

> kubectl get nodes --context=k8s-cluster-east-1NAME                                      STATUS     ROLES    AGE    VERSIONk8s-cluster-east-1-controlplane-0         NotReady   master   13m    v1.15.5
k8s-cluster-east-1-md-0-c94ccd8cf-f9blk   NotReady   <none>   5m3s   v1.15.5
k8s-cluster-east-1-md-0-c94ccd8cf-gt469   NotReady   <none>   5m3s   v1.15.5
k8s-cluster-east-1-md-0-c94ccd8cf-hfkgb   NotReady   <none>   5m4s   v1.15.5>kubectl get nodes  --context=k8s-cluster-west-1NAME                                       STATUS     ROLES    AGE     VERSIONk8s-cluster-west-1-controlplane-0          NotReady   master   14m     v1.15.5
k8s-cluster-west-1-md-0-5cc7968fbf-6mpvj   NotReady   <none>   3m44s   v1.15.5
k8s-cluster-west-1-md-0-5cc7968fbf-dn68z   NotReady   <none>   3m44s
v1.15.5
k8s-cluster-west-1-md-0-5cc7968fbf-vsnjn   NotReady   <none>   3m44s   v1.15.5

We see that nodes on both the cluster are in not ready state. This is expected as the network plugin is not installed yet. Now we are ready to install cilium and the cluster mesh.

Cilium Cluster Mesh Installation

To setup the cilium cluster mesh, first we need to install cilium itself using the managed etcd method . However, there is a small change with respect to our vSphere cluster which uses containerd as the runtime instead of docker. So in the step to generate the yaml from the helm template instead of the one noted in the documentation -

helm template cilium \
   --namespace kube-system \
   --set global.etcd.enabled=true \
   --set global.etcd.managed=true \
   > cilium.yaml

we add the option to change our container runtime to containerd and it becomes

>helm template cilium \
   --namespace kube-system \
   --set global.containerRuntime.integration=containerd \
   --set global.etcd.enabled=true \
   --set global.etcd.managed=true \
   > cilium.yaml

The rest of the instructions remain the same as described in the cilium documentation.Once we apply our configurations to both the clusters we should see all the cilium agents up and running and our nodes in ready state.

> kubectl get pods -n kube-system  --context=k8s-cluster-east-1
NAME                                READY   STATUS    RESTARTS   AGEcilium-2fjqn                        1/1     Running   0        4m58s
cilium-59wzb                        1/1     Running   0        4m58s
cilium-etcd-j8452tfsc6              1/1    Running   0         3m50s
cilium-etcd-jd7p5qntrw               1/1     Running   0       3m10s
cilium-etcd-operator-84f8dc88d5-mc947  1/1     Running   0     4m58s
cilium-etcd-xjhk9mnqz7               1/1     Running   0       4m22s
cilium-gvmfm                         1/1     Running   0       4m58s
cilium-nc8z4                         1/1     Running   0       4m58s
cilium-operator-79b9c5c9c9-48s5j     1/1     Running   0       4m58s
coredns-5c98db65d4-7zwdz             1/1     Running   0       49m
coredns-5c98db65d4-v76fz              1/1     Running   0      49m>  kubectl get pods -n kube-system  --context=k8s-cluster-west-1NAME                                READY   STATUS    RESTARTS   AGEcilium-5rhsj                        1/1     Running   0        5m5s
cilium-6gcx7                        1/1     Running   0        5m5s
cilium-etcd-6lqzz72ltx              1/1     Running   0        3m50s
cilium-etcd-fcn57t64hh              1/1     Running   0        4m28s
cilium-etcd-ff8j8wrgmx              1/1     Running   0        3m26s
cilium-etcd-operator-84f8dc88d5-jsltz   1/1     Running   0    5m5s
cilium-mctf2                        1/1     Running   0        5m5s
cilium-nsbl4                        1/1     Running   0        5m5s
cilium-operator-79b9c5c9c9-8jtfv    1/1     Running   0         5m5s> kubectl get nodes --context=k8s-cluster-east-1NAME                                      STATUS   ROLES    AGE   VERSIONk8s-cluster-east-1-controlplane-0         Ready    master   49m   v1.15.5
k8s-cluster-east-1-md-0-c94ccd8cf-f9blk   Ready    <none>   40m   v1.15.5
k8s-cluster-east-1-md-0-c94ccd8cf-gt469   Ready    <none>   40m   v1.15.5
k8s-cluster-east-1-md-0-c94ccd8cf-hfkgb   Ready    <none>   40m   v1.15.5>kubectl get nodes --context=k8s-cluster-west-1NAME                                       STATUS   ROLES    AGE   VERSIONk8s-cluster-west-1-controlplane-0          Ready    master   30m   v1.15.5
k8s-cluster-west-1-md-0-5cc7968fbf-6mpvj   Ready    <none>   19m   v1.15.5
k8s-cluster-west-1-md-0-5cc7968fbf-dn68z   Ready    <none>   19m   v1.15.5
k8s-cluster-west-1-md-0-5cc7968fbf-vsnjn   Ready    <none>   19m   v1.15.5

Before we proceed to install the cluster mesh, we need to install metallb for the LoadBalancer services on all our clusters. This is pretty straightforward.

> kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.8.3/manifests/metallb.yaml --context=k8s-cluster-east-1namespace/metallb-system created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
daemonset.apps/speaker created
deployment.apps/controller created>  kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.8.3/manifests/metallb.yaml --context=k8s-cluster-west-1namespace/metallb-system created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
daemonset.apps/speaker created
deployment.apps/controller created

After installation of the metallb CRDs, configure the metallb for the block of IPs for use with the LoadBalancer type services. My configs use the layer2 option for configuration and look like this each cluster

metallb-east

metallb-west

> kubectl apply -f metallb-config-east.yaml  --context=k8s-cluster-east-1
configmap/config created> kubectl apply -f metallb-config-west.yaml  --context=k8s-cluster-west-1
configmap/config created

Now we are ready to setup the cluster mesh. The instructions supplied at installing the cluster mesh work verbatim. For our use case we use east-1 as the cluster name and “1” as the cluster-id for the “east” data center and west-1 and “2” for the “west” data center.

> kubectl -n kube-system edit cm cilium-config --context=k8s-cluster-east-1
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#apiVersion: v1
data:
auto-direct-node-routes: "false"
bpf-ct-global-any-max: "262144"
bpf-ct-global-tcp-max: "524288"
cluster-name: east-1
cluster-id: "1"
container-runtime: containerd------> kubectl -n kube-system edit cm cilium-config --context=k8s-cluster-west-1
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#apiVersion: v1
data:
auto-direct-node-routes: "false"
bpf-ct-global-any-max: "262144"
bpf-ct-global-tcp-max: "524288"
cluster-name: west-1
cluster-id: "2"
container-runtime: containerd

The instructions also mention creating a LoadBalancer service for their etcd proxies specific to AWS and GKE but the same yaml works for our setup as well. The annotation specific to GKE is obviously ignored and can be deleted.

> kubectl apply -f  cilium-external-etcd-service.yaml  -n kube-system --context=k8s-cluster-east-1
service/cilium-etcd-external created>kubectl apply -f  cilium-external-etcd-service.yaml  -n kube-system --context=k8s-cluster-west-1
service/cilium-etcd-external created

Verify that the etcd services have been exposed via a LoadBalancer

>kubectl get svc cilium-etcd-external -n kube-system  --context=k8s-cluster-east-1NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGEcilium-etcd-external   LoadBalancer   101.68.158.114   10.11.84.20   2379:30768/TCP   29s
> kubectl get svc cilium-etcd-external -n kube-system  --context=k8s-cluster-west-1NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGEcilium-etcd-external   LoadBalancer   102.66.228.135   10.11.84.31   2379:30906/TCP   25s

Follow the rest of the steps to create the cluster mesh config file and restart all the pods

> git clone https://github.com/cilium/clustermesh-tools.gitCloning into 'clustermesh-tools'...
remote: Enumerating objects: 8, done.
remote: Total 8 (delta 0), reused 0 (delta 0), pack-reused 8
Unpacking objects: 100% (8/8), done.> cd clustermesh-toolsclustermesh-tools >./extract-etcd-secrets.shDerived cluster-name east-1 from present ConfigMap
===================================================
WARNING: The directory config contains private keys.
Delete after use.
====================================================clustermesh-tools> kubectl config use-context k8s-cluster-west-1
Switched to context "k8s-cluster-west-1".clustermesh-tools> ./extract-etcd-secrets.shDerived cluster-name west-1 from present ConfigMap
====================================================
WARNING: The directory config contains private keys.
Delete after use.
====================================================clustermesh-tools>./generate-secret-yaml.sh > clustermesh.yaml
clustermesh-tools>./generate-name-mapping.sh > ds.patch### Verify the ds.patch contains the LoadBalancer IPs for etcd services on both the clusters
clustermesh-tools>cat ds.patchspec:
   template:
         spec:
          hostAliases:
              - ip: "10.11.84.20"
                hostnames:
                   - east-1.mesh.cilium.io
              - ip: "10.11.84.31"
                hostnames:
                   - west-1.mesh.cilium.io### Apply the patch 
clustermesh-tools>  kubectl -n kube-system patch ds cilium -p "$(cat ds.patch)" --context=k8s-cluster-east-1
daemonset.extensions/cilium patchedclustermesh-tools> kubectl -n kube-system patch ds cilium -p "$(cat ds.patch)" --context=k8s-cluster-west-1
daemonset.extensions/cilium patched## Apply the cluster mesh configuration
clustermesh-tools> kubectl -n kube-system apply -f clustermesh.yaml --context=k8s-cluster-east-1
ecret/cilium-clustermesh createdclustermesh-tools > kubectl -n kube-system apply -f clustermesh.yaml --context=k8s-cluster-west-1
secret/cilium-clustermesh created## Finally restart the pods to pickup the new configs
> kubectl -n kube-system delete pod -l k8s-app=cilium --context=k8s-cluster-east-1pod "cilium-24b7d" deleted
pod "cilium-bl82g" deleted
pod "cilium-fcvph" deleted
pod "cilium-lrczs" deleted> kubectl -n kube-system delete pod -l k8s-app=cilium --context=k8s-cluster-west-1pod "cilium-kz8x8" deleted
pod "cilium-m5w5t" deleted
pod "cilium-qk4jz" deleted
pod "cilium-twx8b" deleted> kubectl -n kube-system delete pod -l name=cilium-operator --context=k8s-cluster-east-1pod "cilium-operator-79b9c5c9c9-48s5j" deleted>kubectl -n kube-system delete pod -l name=cilium-operator --context=k8s-cluster-west-1pod "cilium-operator-79b9c5c9c9-8jtfv" deleted

Verify the cluster mesh by dumping the node list from any cilium. It should show all nodes in both the clusters.

>kubectl -n kube-system exec -ti cilium-8blgf  cilium node list --context=k8s-cluster-east-1Name                                              IPv4 Address   Endpoint CIDR   IPv6 Address   Endpoint CIDR
east-1/k8s-cluster-east-1-controlplane-0          10.11.87.185   101.96.0.0/24                  
east-1/k8s-cluster-east-1-md-0-c94ccd8cf-f9blk    10.11.87.184   101.96.3.0/24                  
east-1/k8s-cluster-east-1-md-0-c94ccd8cf-gt469    10.11.86.189   101.96.2.0/24                  
east-1/k8s-cluster-east-1-md-0-c94ccd8cf-hfkgb    10.11.86.178   101.96.1.0/24                  
k8s-cluster-east-1-controlplane-0                 10.11.87.185   101.96.0.0/24                  
k8s-cluster-east-1-md-0-c94ccd8cf-f9blk           10.11.87.184   101.96.3.0/24                  
k8s-cluster-east-1-md-0-c94ccd8cf-gt469           10.11.86.189   101.96.2.0/24                  
k8s-cluster-east-1-md-0-c94ccd8cf-hfkgb           10.11.86.178   101.96.1.0/24                  
west-1/k8s-cluster-west-1-controlplane-0          10.11.87.187   102.96.0.0/24                  
west-1/k8s-cluster-west-1-md-0-5cc7968fbf-6mpvj   10.11.87.164   102.96.1.0/24                  
west-1/k8s-cluster-west-1-md-0-5cc7968fbf-dn68z   10.11.87.165   102.96.3.0/24                  
west-1/k8s-cluster-west-1-md-0-5cc7968fbf-vsnjn   10.11.87.186   102.96.2.0/24

Once our mesh is setup. Now we can use that to deploy our cross cluster application to test .

Setup CockroachDB Across Multiple Kubernetes Clusters

The official cockroachdb documentation outlines these steps for doing on GKE here and the same principles are applied in our setup. In essence this is how it works -

Since we have installed both the clusters with the same subdomain — cluster.local, we collocate the resources to be clustered across multiple K8s clusters by using namespaces. Therefore east-1 namespace in the east-1 cluster and west-1 namespace in the west-1 cluster will contain the resources which are ‘cross-clustered’ . Any number of namespaces can be used , the only catch being that the DNS configuration is constantly updated.
For dns resolution of services across the clusters, first, the DNS servers on all the clusters are chained together. This can be done in multiple ways. One is to use a coredns plugin like kubernetai. The plugin serves DNS from other kubernetes clusters using their api server configuration.
The other way is to simply expose the in-cluster coredns servers via an external load balancer. In either case, once this is done, we modify the config maps for coredns to provide stub domains to forward external domain requests to the load balancer on the cluster as will be detailed below.
While the DNS resolution of services across clusters will work with the above setup for the pods, the services for these pods themselves wont be reachable unless they are headless, so our resources to be clustered have their service defined with ClusterIP:None. This results in the dns lookup pointing directly to the pod ips instead and the dns usually resolves to the first IP in the list. This also is agnostic to pod ips changing due to restarts etc since the DNS name for the service will always remain the same.
There is also a service created in the default namespace of each cluster which points to the namespace scoped service in that cluster using a service of type ExternalName. This allows clients in the default namespace to connect to the cross cluster resources as well.

The installation process for cockroachdb will be mostly the same as outlined here in step 2 onwards except that the python script expects usage of kube-dns instead of coredns. Instead of making a quick fix of the script I decided to outline the steps manually so we have a better understanding of how things work.

First download all the multi region configuration files for cockroach

>curl -OOOOOOOOO https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/multiregion/{README.md,client-secure.yaml,cluster-init-secure.yaml,cockroachdb-statefulset-secure.yaml,dns-lb.yaml,example-app-secure.yaml,external-name-svc.yaml,setup.py,teardown.py}

Create the two namespaces on each cluster which will contain our distributed services

> kubectl create namespace east-1 --context=k8s-cluster-east-1
namespace/east-1 created
> kubectl create namespace west-1 --context=k8s-cluster-west-1
namespace/west-1 created

Create the storage classes on each cluster and make them default. Our storage class definition looks like this

Storage Class Definition

>kubectl apply -f storageclass.yaml --context=k8s-cluster-east-1 >kubectl apply -f storageclass.yaml --context=k8s-cluster-west-1>kubectl patch storageclass thin-disk -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' --context=k8s-cluster-east-1
> kubectl patch storageclass thin-disk -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' --context=k8s-cluster-west-1

Now to setup DNS chaining between the two clusters, we first expose the kube-dns in each cluster via a LoadBalancer IP. For this, apply the dns-lb.yaml from the cockroach distribution downloaded in first step and note down the IPs of each service. The dns-lb.yaml looks like this

>kubectl apply -f dns-lb.yaml --context=k8s-cluster-east-1
>kubectl apply -f dns-lb.yaml --context=k8s-cluster-west-1>kubectl get svc kube-dns-lb -n kube-system --context=k8s-cluster-east-1
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        
kube-dns-lb   LoadBalancer   101.66.234.68   10.11.84.21   53:31312/UDP>kubectl get svc kube-dns-lb -n kube-system --context=k8s-cluster-west-1
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)
kube-dns-lb   LoadBalancer   102.70.23.188   10.11.84.32   53:30855/UD

We have the west-1 load balancer ip as 10.11.84.32 and the east-1 as 10.11.84.21 . We use these values to update our coredns configmap. We edit our east-1 configmap for coredns so it looks like this

kubectl edit cm  coredns -n kube-system --context=k8s-cluster-east-1

And in reverse the west-1 configmap for the coredns looks like

kubectl edit cm  coredns -n kube-system --context=k8s-cluster-west-1

Reload the coredns configuration on each cluster

>kubectl delete pod -n kube-system -l=k8s-app=kube-dns --context=k8s-cluster-east-1
>kubectl delete pod -n kube-system -l=k8s-app=kube-dns --context=k8s-cluster-west-1

Let’s do a test. We create a headless nginx service on the west-1 cluster and try to look it up in the east-1 cluster.

> kubectl apply -f test-nginx.yaml -n west-1 --context=k8s-cluster-west-1
deployment.apps/nginx created
service/nginx created> kubectl run -it --rm --image=tianon/network-toolbox debian --context=k8s-cluster-east-1 
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
root@debian-6cdfdbf577-rt72r:/# nslookup nginx.west-1
Server:  101.64.0.10
Address: 101.64.0.10#53Name: nginx.west-1.svc.cluster.local
Address: 102.96.1.5

As we can see we can reach the headless nginx service on west-1 from a pod on east-1 via dns chaining between the two coredns servers.
Next we install the cockroach binary on our workstation as outlined Install cockroach. For the mac in my case

> brew  install cockroach/tap/cockroachdb

Create the cockroach ca and client certificates and the appropriate cockroach secrets containing the node and client certificates in all the clusters. The client certificate is created in both the region/zone namespace and default namespace for ease of access to the services.

> mkdir certs
> mkdir ca-keys
> cockroach cert create-ca --certs-dir certs --ca-key ca-keys/ca.key 
> cockroach cert create-client root  --certs-dir certs --ca-key ca-keys/ca.key> kubectl create secret generic cockroachdb.client.root --from-file certs --context=k8s-cluster-east-1
kubectl create secret generic cockroachdb.client.root --from-file certs --context=k8s-cluster-west-1> kubectl create secret generic cockroachdb.client.root --from-file certs -n west-1 --context=k8s-cluster-west-1
> kubectl create secret generic cockroachdb.client.root --from-file certs -n east-1 --context=k8s-cluster-east-1#### Create the node certs and secrets for each region/zone> cockroach  cert create-node --certs-dir certs --ca-key ca-keys/ca.key localhost 127.0.0.1 cockroachdb-public cockroachdb-public.default cockroachdb-public.west-1 cockroachdb-public.west-1.svc.cluster.local *.cockroachdb *.cockroachdb.west-1 *.cockroachdb.west-1.svc.cluster.local> kubectl create secret generic cockroachdb.node -n west-1 --from-file  certs --context=k8s-cluster-west-1> rm -rf certs/node.*> cockroach  cert create-node --certs-dir certs --ca-key ca-keys/ca.key localhost 127.0.0.1 cockroachdb-public cockroachdb-public.default cockroachdb-public.east-1 cockroachdb-public.east-1.svc.cluster.local *.cockroachdb *.cockroachdb.east-1 *.cockroachdb.east-1.svc.cluster.local> kubectl create secret generic cockroachdb.node -n east-1 --from-file  certs --context=k8s-cluster-east-1

We create the two external services in the default namespaces in each zone .Once created you can verify that the service in the default namespace points to one in the region namespace.

>sed 's/YOUR_ZONE_HERE/east-1/g' external-name-svc.yaml > external-name-svc-east-1.yaml>kubectl apply  -f external-name-svc-east-1.yaml --context=k8s-cluster-east-1>sed 's/YOUR_ZONE_HERE/west-1/g' external-name-svc.yaml > external-name-svc-west-1.yaml>kubectl apply  -f external-name-svc-west-1.yaml --context=k8s-cluster-west-1## Verify >kubectl get svc cockroachdb-public --context=k8s-cluster-east-1
NAME                 TYPE           CLUSTER-IP   EXTERNAL-IP                                   PORT(S)   AGE
cockroachdb-public   ExternalName   <none>       cockroachdb-public.east-1.svc.cluster.local   <none>    26h>kubectl get svc cockroachdb-public --context=k8s-cluster-west-1
NAME                 TYPE           CLUSTER-IP   EXTERNAL-IP                                   PORT(S)   AGE
cockroachdb-public   ExternalName   <none>       cockroachdb-public.west-1.svc.cluster.local   <none>    26h

We are finally ready to deploy our cockroachdb statefulsets on each cluster. First we generate our join list string of all the cockroachdb instances/pods and apply to the provided template file .

>JOINSTR="cockroachdb-0.cockroachdb.east-1,cockroachdb-1.cockroachdb.east-1,cockroachdb-2.cockroachdb.east-1,cockroachdb-0.cockroachdb.west-1,cockroachdb-1.cockroachdb.west-1,cockroachdb-2.cockroachdb.west-1"> sed 's/JOINLIST/'"${JOINSTR}"'/g;s/LOCALITYLIST/zone=east-1/g' cockroachdb-statefulset-secure.yaml > cockroachdb-statefulset-secure-east-1.yaml> sed 's/JOINLIST/'"${JOINSTR}"'/g;s/LOCALITYLIST/zone=west-1/g' cockroachdb-statefulset-secure.yaml > cockroachdb-statefulset-secure-west-1.yaml

Now we deploy our statefulsets in each cluster.

> kubectl apply  -f cockroachdb-statefulset-secure-east-1.yaml  -n east-1 --context=k8s-cluster-east-1serviceaccount/cockroachdb created
role.rbac.authorization.k8s.io/cockroachdb created
clusterrole.rbac.authorization.k8s.io/cockroachdb created
rolebinding.rbac.authorization.k8s.io/cockroachdb created
clusterrolebinding.rbac.authorization.k8s.io/cockroachdb created
service/cockroachdb-public created
service/cockroachdb created
poddisruptionbudget.policy/cockroachdb-budget created
statefulset.apps/cockroachdb created> kubectl apply  -f cockroachdb-statefulset-secure-west-1.yaml  -n west-1 --context=k8s-cluster-west-1serviceaccount/cockroachdb created
role.rbac.authorization.k8s.io/cockroachdb created
clusterrole.rbac.authorization.k8s.io/cockroachdb created
rolebinding.rbac.authorization.k8s.io/cockroachdb created
clusterrolebinding.rbac.authorization.k8s.io/cockroachdb created
service/cockroachdb-public created
service/cockroachdb created
poddisruptionbudget.policy/cockroachdb-budget created
statefulset.apps/cockroachdb created### wait for the pods to be in running state 
> kubectl get pods -n east-1 --context=k8s-cluster-east-1 
NAME            READY   STATUS    RESTARTS   AGE
cockroachdb-0   0/1     Running   0          42s
cockroachdb-1   0/1     Running   0          42s
cockroachdb-2   0/1     Running   0          42s> kubectl get pods -n west-1 --context=k8s-cluster-west-1
NAME            READY   STATUS    RESTARTS   AGE
cockroachdb-0   0/1     Running   0          38s
cockroachdb-1   0/1     Running   0          38s
cockroachdb-2   0/1     Running   0          38s

Once all the pods across the clusters are in running state, we need to initialize the cockroachdb cluster to move it into ‘Ready’ state. This can be done from any of our ‘regions’

> kubectl apply -f cluster-init-secure.yaml -n east-1 --context=k8s-cluster-east-1job.batch/cluster-init-secure created## All pods should reach ready state
> kubectl get pods -n east-1 --context=k8s-cluster-east-1 --watch 
NAME                        READY   STATUS      RESTARTS   AGE
cluster-init-secure-hlg5j   0/1     Completed   0          24s
cockroachdb-0               1/1     Running     0          3m14s
cockroachdb-1               1/1     Running     0          3m14s
cockroachdb-2               1/1     Running     0          3m14s> kubectl get pods -n west-1 --context=k8s-cluster-west-1 --watch
NAME            READY   STATUS    RESTARTS   AGE
cockroachdb-0   1/1     Running   0          3m18s
cockroachdb-1   1/1     Running   0          3m18s
cockroachdb-2   1/1     Running   0          3m18s

Now verify the cluster by creating some test data as described here Test Cluster Setup

We can also check via the cockroachdb web console that pods across the kubernetes clusters form a unified cockroachdb cluster.

You can simulate a datacenter failure by setting the replicas to zero in any cluster however the number of failures cockroachdb can tolerate is (Replication factor -1)/2. So ensure you have enough nodes or add a 3 region to simulate a 3 region cluster before simulating data center failure.

Conclusions and other open items

There are many possible solutions to tackle the multi-cluster kubernetes deployment of applications and we just looked at one possible way. Some other options include

Istio has a similar architecture option using multiple-clusters which we will look in a future post.
Submariner from rancher also offers a CRD based solution for a similar use case.

All these solutions are orthogonal to the actual application performance and functioning obviously. Latency between the clusters and it impact on the applications can only be determined on a case by case basis e.g. when using consensus based protocols like RAFT which etcd and databases like cockroachdb use. For stateless applications this is still relatively simpler but for applications interacting across the clusters, the impact on latency and performance should be considered before committing it for a production deployment.

What about federation ?

While connecting clusters across regions is addressed by some of the solutions mentioned about, deployment of or federating applications and resources across clusters is an additional topic for discussion. We deployed our cockroachdb application in each cluster independently and natively but it would be ideal if we can deploy this in a singular cluster and federate it across other clusters.

This has been a rapidly changing field of solutions with Kubefed2 being the natively in-built kubernets-sigs project in active development. There are other options e.g. Multicluster-scheduler from admiralty.io . We will take a deeper look at these options in future post; almost all of them have their own little quirks and federating of stateful applications and custom operators is still a challenge in this setup especially because the underlying ‘kind’s are themselves not multi-cluster aware.

References

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬.

To join our community Slack team chat 🗣️ read our weekly Faun topics 🗞️, and connect with the community 📣 click here⬇