Bootstrapping etcd3 with Consul

For use with Kubernetes

link

We will use DNS SRV records and a bit if cyclic logic to bootstrap our cluster. Unfortunately, ordering is a bit a hassle so we will rely on the Kubelet’s ability to constantly restart pods. Here is a high level overview of the ordering.

  1. Consul Server cluster is started
  2. etcd3 sidekick pods register a healthy etcd3 service for our particular cluster.
  3. etcd3 starts up and queries the for the etcd3 service.

Because our Consul SRV DNS response must contain the actual ports of etcd3, we must register the real etcd3 ports and not the sidekicks ports. However, etcd3 won’t start up unless it finds a healthy service so we have cyclic loop of non-registered etcd3 containers. The ideal situation sequence of events would be:

  1. etcd3 container starts
  2. Service is deemed healthy, SRV are populated
  3. etcd3 continues to bootstrap based on SRV records

In short, we are moving step 2 above step 1.

If the Consul cluster is unreachable at boot, both the registration Pod and etcd3 pod will continue to restart. Ideally, this means our infrastructure will converge rather than simply fail.

The Registration Sidekick

The sidekick’s job has one important job: create a healthy service registration for etcd3. Later, we could update this service to include a health check after the sidekick is sure the cluster is healthy. If the etcd3 node were to become unhealthy again, it would have to replace the an unhealthy record with a healthy one.

The Service registration will look like the following:

{
"ID": "<node-hostname>-<tag-type>",
"Name": "etcd3",
"Service": "etcd-registration",
"Tags": [
"_etcd-<tag-type>._tcp.<your-cluster-name>"
],
"Address": "<node-IP>",
"Port": 2379
}

This registration should be sent to a consul agent running locally: localhost:8500/v1/agent/service/register .

Our hostname must be unique. In AWS, we would have something like IP-1–2–3–4 which would suffice. <tag-type> is either server or client .

Consul has the following naming scheme for DNS requests: [tag.]<service>.service[.datacenter].<domain>. Because we may have many etcd3 clusters per Consul Datacenter, we will need to embed the cluster name into the tag. Also, etcd3 expects a particular prefix for SRV records. Luckily, we can prefix our tag with this and we will be set. We will be left with the following dig request to find our services:

$ dig +noall +answer @172.17.8.50 -p 8600 _etcd-server._tcp.mycluster.etcd3.service.dc1.consul.
_etcd-server._tcp.mycluster.etcd3.service.dc1.consul. 0 IN A 172.17.8.101
_etcd-server._tcp.mycluster.etcd3.service.dc1.consul. 0 IN A 172.17.8.102
_etcd-server._tcp.mycluster.etcd3.service.dc1.consul. 0 IN A 172.17.8.103

So if we can register our etcd3 services whenever we want and start the etcd3 container, the containers will eventually discover each other and reach consensus, right? Wrong!

The Readiness Sidekick

If you read this awesome blog article about etcd3 clustering in AWS, you’ll find that they mention issues about what happens when you start the cluster with only 2 nodes. You’ll receive some error messages listed there. In my completion of this project, I have found that one of my Vagrant machines was starting up just a bit earlier than the others. It was the first to register its IP with DNS and the first to start etcd3. When this service queried DNS for other peers, it would only find itself and become the leader. The other etcd3 containers would soon start but also have the --cluster=new option set. As a result, they would be advertised to the rest of the cluster but they have already reached consensus and aren’t expecting anyone to join. As a consequence, you’ll see mismatching Cluster IDs and have a 3 nodes: 1 with an etcd3 cluster of size 1 and 2 confused nodes (likely, one of size 2 and one of size 3).

One solution is to ensure there are 3 DNS records before starting the etcd3 container. We can simply run a dig request and wait until there are 3 responses. After, we can move the etcd3 Pod manifest to the kubelet’s pod manifest directory and the kubelet will take it from there.

The Final Result

Notes

I have included Kubernetes asset generation inside of the scripts which may help some of you later if you want your etcd nodes joining the cluster to have monitoring agents schedule to these nodes.

I originally started with docker-compose but ran into too many problems with advertised IPs, hostnames, and ordering. I have included that file for testing.

Also, Consul won’t bind to port 53 because it runs as a non-root user. Instead, I have used socat to forward port UDP 53 to 8600. In a way, you can think of this like your DNS server delegating to Consul.