How We Load Balance Our Rancher 2.0 Cluster on vSphere

If you’re using Rancher 2.0 with the vSphere provider, you’re going to want to read this.

Published in

Parkside

11 min readApr 1, 2019

They’re load balancing. Sort of. (Photo by Artem Bali on Unsplash)

You’re not running your Kubernetes cluster on vSphere, are you? If you are, you stop that right now and go use a proper Kubernetes provider like Google Kubernetes Engine or Elastic Kubernetes Services. Leave vSphere to those of us who are stuck in our stubborn ways.

At least, that’s the impression you get when you first spin up a cluster on vSphere. Well, maybe not first, but eventually, you will want to configure load-balancing for your cluster and then you’ll realize that you have no idea how you’re supposed to do that. Moreover, after browsing forums for hours, you’ll realize that nobody has any idea either.

With a first-class provider like GKE, your load balancers are fully automatic. They expose a single address that you can point your hostname to and it handles balancing your requests among all the nodes in your cluster. That’s how it’s supposed to work.

However, on vSphere, Rancher installs a “load balancer” in your cluster that has the same number of addresses as the number of worker nodes in your cluster. In other words, it’s not a load balancer at all. Why? Well, if you point your hostname to any one of these addresses, the corresponding node becomes responsible for proxying your entire load (take note of the conspicuous absence of balance). So instead, you’d have to balance those addresses somehow.

Rancher when you ask it to load balance your vSphere nodes (Giphy)

If you don’t have the same requirements as we do, then you can get away with using a round-robin DNS configuration on those nodes. If you don’t know what that is, then now’s the time to go read up and thank your lucky stars that you have an easy way out of this.

This wasn’t good enough for us though. Here are our two main issues with this:

We don’t have stable IP addresses for our Kubernetes nodes. Not only are they using DHCP, but we have a history of creating new ones and killing old ones every once in a while. The consequence is that any time the IP addresses of our node pool change, we would have to update every single one of our DNS entries. This is awful but still moot because:
We generate a lot of internal hostnames for our QA cluster dynamically, but our internal DNS doesn’t have a programmatic API to manipulate DNS entries. The best we can do is point a domain to a single IP address. That might make you think, isn’t that the exact thing you’re trying to avoid? Yes, it is — but worse; instead of only one hostname routing its load through a single node, now we would have an entire domain doing that.

Round-Robin DNS Is Still Useful

Using the DNS system to distribute requests is a pretty dope shortcut. When a client makes a request for a hostname, the DNS can return multiple addresses, rotating the order in which they appear. In theory, every subsequent client asking for the route to stuff.parkside.at chooses a different IP address when requesting content. This means that requests to stuff.parkside.at would be distributed across the number of machines whose addresses are rotated. We get that for free using the intrinsic properties of the DNS system.

Of course, as far as load balancing goes it’s pretty basic. Unlike a bona fide load-balancer, the DNS has no way of knowing how taxed any of the machines serving stuff.parkside.at are. There are more complex approaches which use metrics like this one to choose the best machine to send requests to. They are proven, fast, expensive and difficult to configure correctly.

In the face of that, round-robin DNS is still useful; it provides rudimentary load distribution at very little extra cost. You can probably see where I’m going with this — we ended up using round-robin DNS to distribute the load to our cluster. The rest of this article is about how we worked around the issues that prevented us from doing so.

Getting Technical

From this point on, I’ll discuss some pretty technical details of our load distribution system. To do that, I need to assume you have some prior knowledge about Kubernetes, including what Ingresses, Scheduling, Taints, Tolerations, Pods, and Deployments are. Knowledge of the DNS system will also help.

Getting Around Unstable IP Addresses

What do you do when your cluster doesn’t have stable IP addresses? You add some stable IP addresses.

That’s a bit harder than it sounds because of the way Rancher manages nodes. In Rancher, nodes are managed as part of a node pool. A node pool is a collection of nodes which are all created from the same template. The difficulty is that there are no customization options for the nodes within a pool.

Instead, nodes created from the pool get their network addresses assigned by DHCP. Now, there are ways to configure our DHCP server to assign stable IP addresses, but these rely on other properties like the node’s MAC address which we can’t configure here either.

The only way I found to give a machine a fixed IP address is by providing it as initialization configuration in the node template. Of course, this is less than ideal since all of our workers are created from the same template and they can’t all share the same fixed address.

This is to say that having stable addresses isn’t really practical for our worker nodes. However, this doesn’t mean that the whole affair is a lost cause; it’s still possible to assign a fixed IP address to a node as long as it is the only node in the node pool. You may wonder, how is that useful?

For us that comes in the form of a node whose entire purpose is to act as a DNS for the cluster. We made it responsible for resolving all hostnames within our cluster’s domains.

It’s reasonable to ask what the point of putting this node in a node template is, rather than just creating it manually and importing it to the cluster. We did it this way because we still want Rancher to manage it in a node pool. The benefit is that Rancher tries to ensure that the number of nodes in the node pool is as configured. That’s to say, if this DNS node dies, Rancher would automatically create a new one for us.

Note: Our cluster is internal and not very large. This approach is good enough for our setup. If you’re thinking of using this for a seriously large cluster, then relying on a single DNS is almost comical; don’t do it.

Dynamically Registering Hostnames

Since we put a DNS inside the cluster (let’s call this one cDNS from now on), dynamically generating hostnames for our cluster becomes easier. We configured our internal DNS (iDNS) to forward all hostname lookups for our cluster domains to cDNS. Then we configured cDNS to watch ingresses inside the cluster. When it’s queried for a hostname with a corresponding ingress, it responds with the IP addresses of all nodes from which the ingress is available.

As a consequence, any ingresses we create in the cluster are automatically picked up by cDNS, making our cluster’s hostnames fully dynamic.

Here’s a diagram to help you fully understand how this works:

Example: DNS lookup for an HTTP request into the cluster

How the Cluster DNS is Configured

cDNS has two main components, both run from within the same pod:

A simple dnsmasq container which responds to queries with entries in a hosts file.
A sidecar which uses the Kubernetes API to watch the cluster’s ingresses and updates the hosts file with entries that associate every host name it finds to the IP addresses of the nodes that the host’s ingress is available from.

The containers are linked together with a volume so that changes made to the hosts file in the sidecar are visible from the dnsmasq container. This makes sure that dnsmasq is able to resolve hosts from new ingresses almost instantly. A nice bonus is that, since the hosts file usually contains multiple IPs for any single hostname, dnsmasq automatically rotates the order of the IP addresses it resolves, enabling round-robin DNS.

Here’s a snippet of this pod’s spec:

Scheduling the DNS Pod

You may have noticed that our podspec contains both a node selector and some tolerations. The node is labeled with purpose=cdns in the template so the node selector makes sure that the pod will end up on this node. We also tainted the node to prevent other pods from scheduling and executing on it. This isn’t configurable in the node template so we did that manually¹:

kubectl taint nodes rancher-cdns-1 purpose=cdns:NoSchedule
kubectl taint nodes rancher-cdns-1 purpose=cdns:NoExecute

These features combined ensure that the DNS pod and only the DNS pod will schedule on this node.

Another thing to note is that the pod is configured to bind to the node’s port 53/udp. That’s the well-known port that’s used by the DNS system. This will cause the pod to fail to schedule if there’s something else listening on that port. That’s okay because we don’t allow anything else to run on that node, but it’s worth noting that this will prevent more than one instance of this pod from running on the node. That is, when used in a pod controller, then it must always be set to scale: 1.

Health-checking the Sidecar

I mentioned that the DNS pod contains a sidecar that watches the cluster for ingress changes and then populates a host file. It’s a rudimentary Python script that connects to the cluster’s API service receive ingress events. It’s important to periodically health-check this script because its connection to the API service can die. This would leave the script in an unrecoverable state, meaning that further ingress changes would not be noticed by dnsmasq.

To achieve this, it includes a health check HTTP endpoint which, when called, returns a status code based on whether the connection to the Kubernetes API is still alive. This endpoint is called periodically by the pod scheduler such that if it returns a status code indicating failure, then the scheduler will restart the container.

This is enabled by a liveliness probe config in the container spec:

ports:
- name: liveness-port
  containerPort: 9090
livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  initialDelaySeconds: 10
  failureThreshold: 1

failureThreshold: 1 means that the container will be restarted immediately if the health check fails. That’s chosen because there is no way built-in for the script to recover from a failed health check anyway.

What Happens When the cDNS Fails?

Funny you should ask.

The day after we deployed the cDNS and moved all of our QA deployments into the cluster, I woke up to find this:

A quick check revealed that not a single hostname in the cluster was being resolved at all. Sure enough, querying the cDNS directly revealed something strange:

$ nslookup dashboard-master.qa.parkside 172.16.0.161
;; Truncated, retrying in TCP mode.
;; Connection to 172.16.0.161#53(172.16.0.161) for dashboard-master.qa.parkside.at failed: connection refused.

Investigating the logs for dnsmasq, I found something odd:

dnsmasq[6]: query[A] dashboard-master.qa.parkside.at from 172.16.1.16
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.161
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.160
dnsmasq[6]: /data dashboard-master.qa.parkside.at is 172.16.2.147
...

The three host entries for dashboard-master.qa.parkside.at were being repeated to the point where they were too large for a UDP packet!

Each time dnsmasq noticed that we overwrote the host's file, it collected all the entries inside the host's file without flushing its cache. In other words, if the entries were present before, they became duplicated. This meant that every time the sidecar overwrote the host's file, some duplicates would potentially be added to dnsmasq’s cache. Eventually, those duplicates would become too much data for UDP and cause the DNS queries to fail.

The fix was simple: whenever we overwrite the host's file, we also signal dnsmasq to flush its cache. Still, it made me think that the cDNS needed to be more resilient to failures; if some part of it would cease to function, then the cluster’s hostnames would not resolve, rendering the cluster inaccessible.

Since the node is in a node pool, it is more or less managed for us. Rancher would spawn a new one if it dies. The same goes for the pod; it’s in a Deployment with scale=1, which means that Kubernetes would schedule a new pod whenever one dies.

This leaves only the containers. You saw previously that the sidecar has an HTTP endpoint that Kubernetes probes to determine if the sidecar script is still healthy. This works great, but I did not want to add any complexity to the dnsmasq container by adding such an endpoint to it. I decided instead to use a different type of probe on the dnsmasq container; it periodically tries to resolve a hostname in the cluster, and should that fail, it would know that the dnsmasq container is no longer healthy:

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c,
    - nslookup qa.parkside.at 127.0.0.1
  initialDelaySeconds: 10
  failureThreshold: 3

The command nslookup qa.parkside.at 127.0.0.1 makes a DNS query to the loopback interface. Since the probe is run inside the dnsmasq container, it will hit the dnsmasq instance. If this command fails three times ( failureThreshold: 3 ), then we assume that dnsmasq is no longer healthy, causing Kubernetes to restart the container.

Wrapping up

One unaccounted failure point in our cDNS system is that we do not have the taints automatically applied to the cDNS node, meaning that it is theoretically possible for other workloads to be scheduled on a fresh node and squeeze out the cDNS pod. This would require exceptional circumstances, and yet if it did occur, it can be remedied by manually applying the purpose=cdns:NoExecute taint once again.

We could even do this automatically using some monitoring. For example, we could create a downtime alert for the cDNS node and trigger an internal webhook when it comes up again which will apply that taint for us.

This approach can be extended to add more cDNS nodes, but it would require creating a separate node pool for each one so that they each have a stable IP address. Then the additional nodes can be configured as secondary name servers for the cluster domains.

All in all, coming up with a load distribution solution that worked for us required a bit of DIY, but in the end, we achieved something that’s stable and resilient enough for our needs.