[K8s] ETCD — DR Solution

Abdullah Alruwayti
6 min readMar 24, 2024

--

Introduction

In this article I will be discussing a solution to a problem faced by people running a Kubernetes cluster in two zones for disaster recovery purposes (stretched cluster). I will be focusing on the Kubernetes component ETCD, where I will cover why you mght be running a stretched cluster, what the issues are with ETCD in stretched clusters, and the design of the solution.

What is ETCD?

Before we delve into the topic of the DR Solution, I’d like to state this article takes into account that you know what the purpose of the ETCD is in terms of kubernetes.

If not, well to put it into a single sentence; ETCD is used by kubernetes as distributed data storage.

What are stretched clusters?

Long story short; Stretched K8s clusters contain nodes from two different zones.

Now if you’re here for the long story, let’s say your cluster is required to have a DR Solution in case of disasters, where you’d usually have your cluster implemented in a different zone/city with specific parameters e.g. the data centers have to be 100KM apart; You also have ArgoCD and Rancher implemented in your cluster.

The default thinking upon reading the above would be to create a new cluster, replicate the installation, and have two separate clusters and you would be correct, however when you take into account that you might have 100+ projects and have new projects get added daily this might not seem like a valid option depending on how you add projects to your ArgoCD and Rancher, if you use pipelines it could be as simple as adding a new step to your pipeline to also create projects there.

However, what if people want to change stuff? Permissions require changing, repositories get moved, etc. Especially if you have a big team to work with, you might get lost in the changes made in the Main Environment and would not be able to replicate it in the secondary environment. Or I might be completely wrong, but either way for some reason you might find that. having two separate clusters, especially with different Rancher/ArgoCD be annoying and you don’t think this solution is valid.

That’s where a stretched cluster would come in because it would have a single instance of these tools and be replicated in both environments. In my case, I had no choice but to work with what I was given, and some of you reading this might be put in the same spot as me.

What’s the issue?

You start seeing the issue when you start thinking about the master nodes, okay I have two zones, and both of them should have a master node but how would I divide the master nodes between the two zones? Three Masters in zone 1 and One in zone 2? Three in zone 1 and Two in zone 2? Why not equal amounts of masters in both zones? This is where the quorum issue comes in.

Before getting into it, let’s first cover what a quorum is concerning the master nodes

As you know the best practice is to have an odd number of master nodes but why? This is due to the ETCD quorum: (n+1)/2

ETCD Quorum source: etcd.io

The aim of having an odd number of master nodes or etcd members is to increase the failure tolerance for your cluster; How many master nodes can go down without taking the whole cluster with them?

After explaining the quorum, some of you might start seeing where the issue starts seeping in;

Two zones, odd number of master nodes.

The point of having a stretched cluster is being highly available, if one zone goes down we won’t be affected as we have another zone in there, but clearly, that won’t work according to the ETCD quorum and we do need an ETCD cluster, now we can have an external ETCD but that’s a whole other conversation, like where would we put the external cluster? same zone? stretched etcd cluster? same issue different context. No matter how you look at it, there isn’t a workaround for this, OR IS THERE??

The solution

The way k8s works with etcd is, that when etcd is down, you can’t access your cluster since your API uses etcd to retrieve cluster info, but that doesn’t mean my applications within the cluster go down, it just means that the state of every application will not update, only the functionalities of K8s are affected not containerd. ETCD can go down but only for a limited time. I needed a way to implement an automation script for this to work, once ETCD goes down it can implement the solution suggested by etcd.io.

Now before we look at how I went about with this approach, let’s pretend we are working with 3 masters (Zone 1 Two masters Zone 2 One master)

The rough flow of how that would work is as follows:

When Zone 1 goes down, I would use snapshot restore to restore the cluster with the initial cluster being the singular master node in Zone 2, I would also change the cluster token (Will explain why further on) that should allow the etcd cluster to run with the same data it had as a new etcd cluster with only 1 member, the master in Zone 2.

Okay, we have the ETCD cluster working again in one zone, but what happens when Zone 1 comes back up?

When Zone 1 comes back up, the first thing you will observe is that it will try to talk to the master in Zone 2 A LOT but fail because Zone 2 will be rejecting it due to different cluster IDs, what you would end up with is two etcd clusters, one in each zone with Zone 2 being the most up to date cluster. You would also notice that when you query the kube API server more than once, you might get two different responses, this is because the API server will be querying either one of those clusters.

Okay I have two ETCD clusters. What now?

Since the up-to-date data is in Zone 2, you would want to completely wipe the ETCDs in Zone 1, and then add them as a new member to Zone 2. Now we all know most things are easier said than done, and it does apply in this case.

Due to the way ETCD is configured in most cases, there would be an environment variable that etcd uses on startup (in my case /etc/etcd.env), check your etcd system unit to see where it gets the configuration from. In some cases, this might not exist as a file but it is run with the etcd command line in the system unit. e.g.

etcd --data=dir=/var/lib/etc --initial_cluster=https://x.x.x.x:2379

Now the flow to how this can be achieved is as follows:

  1. Use etcdctl member addto add one master, the output should have a line that states the initial cluster, ETCD_INITIAL_CLUSTER save that somewhere.
  2. Then proceed to wipe all files inside your etcd data directory folder (default /var/lib/etcd)
  3. Now modify the etcd environment variable and replace ETCD_INITIAL_CLUSTER alongside ETCD_INITIAL_CLUSTER_TOKEN that you used when creating the Zone 2 cluster.
  4. Start etcd server

Once you add all the members from Zone 1 to Zone 2, make sure to go back through each environment file and update the ETCD_INITIAL_CLUSTER to include the final list of all etcd members.

Well hot diggity you are now done.

Workflow of implementation

Below you will find the flowchart to the script I created, the way it is run is ./script.sh initiate_dr OR ./script.sh recover This is to differentiate between deleting the data from Zone 1 and creating a single cluster in Zone 2 because we use an external method to detect if DR should be activated, which is used to determine when to run the two commands.

Therefore:

  1. If Zone 1 goes down, run ‘initiate_dr’ to create a single etcd cluster in Zone 2
  2. If Zone 1 is back up and running after initiate_dr, run recover to join Zone 1 to Zone 2
Flowchart to bash script Source: Me

--

--

No responses yet