Creating an HA cluster using Kubespray and understanding how the control plane’s components behave

Luc Juggery
Aug 12 · 7 min read

There are several tools out there to set up a Kubernetes Cluster. The options include, but aren’t limited to:

  • kubeadm: can be used to deploy a single master or an HA cluster
  • Kubespray: based on the Ansible playbook and uses kubeadm behind the scenes to deploy single or multi masters clusters
  • eksctl: dedicated to deploying a cluster on AWS infrastructure
  • Rancher: provides a great web UI to manage several clusters from one location
  • … and the list goes on

In this piece, we will see how to set up an HA cluster using Kubespray.


Set Up the Dependencies

We start by cloning the Kubespray repository. It contains all the Ansible playbooks needed to set up a cluster.

$ git clone git@github.com:kubernetes-sigs/kubespray.git

We then install the dependencies with Python’s pip:

$ cd kubespray$ pip3 install -r requirements.txt

During this process, the following pieces are installed:

$ cat requirements.txt
ansible==2.7.12
jinja2==2.10.1
netaddr==0.7.19
pbr==5.2.0
hvac==0.8.2
jmespath==0.9.4
ruamel.yaml==0.15.96

Provisioning the Infrastructure

In this example, we are using three nodes created on DigitalOcean. Each node has the following properties:

  • Standard type (ideal for dev/test environments)
  • Ubuntu Server 18.04
  • 4 GB / 2 CPUs
  • Located in the London datacenter
  • Configured with a predefined ssh key (used later on to automate ssh connection from the Ansible playbooks)

Cluster Configuration

When working with Kubespray, it is first advised to copy the default sample configuration from inventory/sample:

$ cp -rfp inventory/sample inventory/mycluster

Then, we can customize the Ansible variables within the following files:

  • inventory/mycluster/group_vars/all/all.yml
  • inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In the current example, it’s totally fine to use the default values.

Inventory file

Kubespray has a helper script to create an inventory from a list of IP addresses. The following commands set the IPs of our three nodes and create an inventory in yaml :

$ declare -a IPS=(165.22.119.207 68.183.36.52 104.248.164.246)$ CONFIG_FILE=inventory/mycluster/hosts.yml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

The generated inventory file is the following:

all:
hosts:
node1:
ansible_host: 165.22.119.207
ip: 165.22.119.207
access_ip: 165.22.119.207
node2:
ansible_host: 68.183.36.52
ip: 68.183.36.52
access_ip: 68.183.36.52
node3:
ansible_host: 104.248.164.246
ip: 104.248.164.246
access_ip: 104.248.164.246
children:
kube-master:
hosts:
node1:
node2:
kube-node:
hosts:
node1:
node2:
node3:
etcd:
hosts:
node1:
node2:
node3:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}

If you prefer playing with inventory files in an INI-like format (as I do), you can easily reformat the content and save it as an hosts.ini file:

[all]
node1 ansible_host=165.22.119.207
node2 ansible_host=68.183.36.52
node3 ansible_host=104.248.164.246
[kube-master]
node1
node2
[etcd]
node1
node2
node3
[kube-node]
node2
node3
[k8s-cluster:children]
kube-master
kube-node

Several important things to note here :

  • The three nodes are defined in the [all] section
  • The [master] section contains node1 and node2 ensuring the administrative processes (API Server, scheduler, controller manager) run on each master
  • The [etcd] section contains the three nodes, meaning an instance of etcd will run on each one of them (minimum required to have an HA etcd cluster)
  • The [kube-node] section contains node2 and node3 so user’s workload can be scheduled on those nodes. The NoExecute taint that is set by default on each master node will not be set on node2. This taint is used to prevent user workload from being scheduled on a node
  • No host is defined under the calico-rr key in the yaml inventory, so we do not specify any section here

This inventory defines a cluster with a stack etcd topology. This means the etcd instances run on the master nodes.

etcd cluster deployed on the master nodes (source: Kubernetes documentation)

Note: we could provisioned more VMs and dedicate 3 of them to run the etcd cluster so its external to Kubernetes. This will insure more security / resiliency of the cluster but comes at the cost of additional hardware.

Building the cluster

Once everything is in place, we can run the Ansible playbook to build the cluster. The following command executes the action specified in the cluster.yml file:

$ ansible-playbook -i hosts.ini -u root -b --key-file=~/.ssh/do-key.pem cluster.yml

It only requires a couple of minutes for the cluster to be ready. Once it’s up and running, we can get a kube config file from the /etc/kubernetes/admin.conf location on a master. We can configure the kubectl client to use it through the KUBECONFIG environment variable :

$ scp root@MASTER_X_IP:/etc/kubernetes/admin.conf kubespray-do.conf$ export KUBECONFIG=$PWD/kubespray-do.conf

We can then get the list of nodes :

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready master 29m v1.15.2
node2 Ready master 28m v1.15.2
node3 Ready <none> 27m v1.15.2

A closer look at the Control Plane

Let’s list all the Pods running on the cluster. As we didn’t run any workload, all the Pods belong to the kube-system namespace, they are dedicated to administrative tasks.

$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-64c..-dtnzv 1/1 Running 0 27m
calico-node-j62dh 1/1 Running 1 28m
calico-node-jtfml 1/1 Running 1 28m
calico-node-qh8rw 1/1 Running 1 28m
coredns-74c9d4d795-cp274 1/1 Running 0 27m
coredns-74c9d4d795-hrnqd 1/1 Running 0 27m
dns-autoscaler-7d95989447-t54wv 1/1 Running 0 27m
kube-apiserver-node1 1/1 Running 0 29m
kube-apiserver-node2 1/1 Running 0 28m
kube-controller-manager-node1 1/1 Running 0 29m
kube-controller-manager-node2 1/1 Running 0 28m
kube-proxy-6v5tf 1/1 Running 0 28m
kube-proxy-dbhvs 1/1 Running 0 28m
kube-proxy-tv4kg 1/1 Running 0 28m
kube-scheduler-node1 1/1 Running 0 29m
kube-scheduler-node2 1/1 Running 0 28m
kubernetes-dashboard-7c547b4c64-q2gds 1/1 Running 0 27m
nginx-proxy-node3 1/1 Running 0 28m
nodelocaldns-52rwd 1/1 Running 0 27m
nodelocaldns-dgzk2 1/1 Running 0 27m
nodelocaldns-grfsq 1/1 Running 0 27m

Listing the other resources of the cluster we could easily see that :

  • the Pods calico-kube-controllers, coredns, dns-autoscaler and kubernetes-dashboard are each managed by a Deployment resource
  • the Pods calico-node, kube-proxy, nodelocaldns are each managed by a DaemonSet resource

Things are a bit different when it comes to the Pods used within the control plane: kube-apiserver, kube-controller-manager, and kube-scheduler. These Pods are not managed by any higher resources (Deployment, DaemonSet, …) and contain in their name the master node they are running on.

As those processes are critical to the cluster, we could imagine two instances of each cannot run concurrently. Let’s then check how those processes are handled.

API Server

To connect to the API Server, the worker nodes go through a load balancer. On node3, the Pod nginx-proxy-node3 is running. If we check its configuration, we can see it proxies each request towards one instance of the API Server (in bold in the output below).

$ kubectl exec -ti pod/nginx-proxy-node3 -n kube-system -- sh
# cat /etc/nginx/nginx.conf
error_log stderr notice;
worker_processes 2;
worker_rlimit_nofile 130048;
worker_shutdown_timeout 10s;
events {
multi_accept on;
use epoll;
worker_connections 16384;
}
stream {
upstream kube_apiserver {
least_conn;
server 165.22.119.207:6443;
server 68.183.36.52:6443;

}
server {
listen 127.0.0.1:6443;
proxy_pass kube_apiserver;
proxy_timeout 10m;
proxy_connect_timeout 1s;
}
}
http {
aio threads;
aio_write on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 75s;
keepalive_requests 100;
reset_timedout_connection on;
server_tokens off;
autoindex off;
}

Thus, if one API Server is not healthy (detected via the keepalive instructions), the other one will be used to handle the requests.

Controller Manager & Scheduler

As defined in the Kubernetes documentation, those components use a lease mechanism to make sure only one instance of each of them is active in the cluster. Let’s have a closer look considering the scheduler.

First of all, we need to get the list of endpoints, which define a way to access other resources (endpoints are used by Service resources to load balance requests to backend Pods):

$ kubectl get endpoints -n kube-system
NAME ENDPOINTS AGE
coredns 10.233.90.2:53,10.233.96.1:53 + 3 more 109m
kube-controller-manager <none> 112m
kube-scheduler <none> 112m
kubernetes-dashboard 10.233.92.1:8443 109m

We can inspect what is inside the kube-scheduler one :

$ kubectl get endpoints kube-scheduler -n kube-system -o yaml
apiVersion: v1
kind: Endpoints
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"node1_e7f79dcf-ed72-43c0-902a-6fc62aac2a69","leaseDurationSeconds":15,"acquireTime":"2019-08-11T11:30:08Z","renewTime":"2019-08-11T13:24:55Z","leaderTransitions":0}'
creationTimestamp: "2019-08-11T11:30:08Z"
name: kube-scheduler
namespace: kube-system
resourceVersion: "13193"
selfLink: /api/v1/namespaces/kube-system/endpoints/kube-scheduler
uid: 5437cbe6-7e5d-4dd2-a491-e345dc09a73b

and then focus on the control-plane.alpha.kubernetes.io/leader annotation :

{
"holderIdentity":"node1_e7f79dcf-ed72-43c0-902a-6fc62aac2a69",
"leaseDurationSeconds":15,
"acquireTime":"2019-08-11T11:30:08Z",
"renewTime":"2019-08-11T13:24:55Z",
"leaderTransitions":0
}

This one defines which scheduler is the leader, in this case, the one running on node1. The current leader has a lease that must be renewed to make sure it is still alive. If it cannot renew the lease then a new leader election will take place.

From the DigitalOcean interface, we can simulate an outage and destroy node1.

If we check the content of the control-plane.alpha.kubernetes.io/leader key within the annotations of the kube-scheduler endpoint, we can see the leader is not the scheduler running on node1 anymore. The new leader is now the scheduler running on node2.

{
"holderIdentity":"node2_a13c0374-44ea-419e-ac07-f868faddab3e",
"leaseDurationSeconds":15,
"acquireTime":"2019-08-11T14:19:51Z",
"renewTime":"2019-08-11T14:26:31Z",
"leaderTransitions":1
}

The new leader election took place because the previous scheduler was not able to update the lease.

The same process applies to the controller-manager so a single instance is used at a time.


Summary

In this piece, we created an HA cluster using Kubespray and we saw how the control plane’s components behave in this cluster. The load balancer used to access the API Servers and the lease mechanism are important to understand when working with an HA cluster.

To get more resiliency and security, an external etcd cluster should be considered so the etcd instances are not dependent upon the nodes of the Kubernetes cluster.

Better Programming

Advice for programmers.

Luc Juggery

Written by

#DockerCaptain #Startups #Software #中文学生 Learning&Sharing

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade