Kubernetes Cluster Bootstrap

Bartek Antoniak
VirtusLab
Published in
7 min readMar 1, 2018

Introduction

Have you ever wondered what’s happening under the hood of Kubernetes bootstrap process?

In this blog post I will take you on a journey through the steps of the Kubernetes cluster provisioning in general without focusing on any particular cloud provider or hardware virtualization.

It will give you an overview how the various components fit and work together.

Kubernetes components

This section assume that you are familiar with Kubernetes and know the basics.

Before we start digging into the actual bootstrap process let’s take a look at the Kubernetes components in order to understand what we’re going to provision.

As you can see we have 3 nodes — two workers and one master node.

API component is running on master node which exposes Kuberentes API. An authenticated and authorized user or service can view or modify Kubernetes objects described in manifests.

Scheduler and Controller are running on master node and are responsible for managing cluster state based on Kubernetes Objects stored in etcd (highly available key-value datastore).

Kubelet is “primary” node agent and it’s focused on running containers, it doesn’t manage containers which were not created on Kubernetes.

Kube-proxy is responsible for TCP and UDP forwarding (iptables or ipvs) for Kubernetes Services.

Pod is just a group of containers (docker, rkt) running on the same node with common life cycle.

Under the Kubernetes abstraction layer there is nothing more a operating system, container runtime, systemd services and some networking stuff.

Communication between containers is done by CNI (Container Network Interface) which manages network virtual interfaces (e.g. Flannel, Calico).

In some cases there might be virtualization environment running underneath (e.g. in the could).

Bootstrap on baremetal

In order to understand the Kubernetes bootstrap process from very beginning let’s start with bare-metal example.

It involves number of technologies, like:

  • PXE enabled network boot
  • DHCP
  • TFTP
  • OS image (Container Linux)
  • OS configuration (Ignition from Container Linux)
  • Docker

PXE network boot

PXE (Preboot Execution Environment) is one of the possible ways to network boot.

PXE server is nothing more that DHCP and TFTP servers which provide kernel image and root file system.

NIC (Network Interface Card) of PXE client sends DHCP request and receives network configuration like: IP, subnet, mask, DNS and gateway.

In addition it provides location of TFTP server.

Client connects to TFTP server in order to download boot image.

Client executes boot image and downloads all the files it needs (kernel image, root file system).

Optionally we can mount NFS (Network File System).

Dnsmasq

Dnsmasq is a mature project that can be used for a number of purposes:

  • works as DHCP server
  • also can work as DHCP in proxy mode if you already have DHCP server on your network
  • TFTP server

It can be run as a container:


sudo docker run --rm --cap-add=NET_ADMIN --net=host quay.io/coreos/dnsmasq \
-d -q \
--dhcp-range=192.168.1.3,192.168.1.254 \
--enable-tftp --tftp-root=/var/lib/tftpboot \
--dhcp-match=set:bios,option:client-arch,0 \
--dhcp-boot=tag:bios,undionly.kpxe \
--dhcp-match=set:efi32,option:client-arch,6 \
--dhcp-boot=tag:efi32,ipxe.efi \
--dhcp-match=set:efibc,option:client-arch,7 \
--dhcp-boot=tag:efibc,ipxe.efi \
--dhcp-match=set:efi64,option:client-arch,9 \
--dhcp-boot=tag:efi64,ipxe.efi \
--dhcp-userclass=set:ipxe,iPXE \
--dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
--address=/matchbox.example/192.168.1.2 \
--log-queries \
--log-dhcp

CoreOS and Ignition

As the operating system we’ll use CoreOS Container Linux which is

Minimal container operating system with basic userland utilities. Runs on nearly any platform whether physical, virtual or private/public cloud.

It supports a container runtime (docker and rkt) out of the box.

Ignition is a provisioning utility for Container Linux based on YAML files. It executes in early boot process in initrd. It’s able to partition disk, configure network, systemd services and many more.

---
systemd:
units:
- name: docker.service
enable: true
- name: locksmithd.service
mask: true
- name: kubelet.path
enable: true
contents: |
[Unit]
Description=Watch for kubeconfig
[Path]
PathExists=/etc/kubernetes/kubeconfig
[Install]
WantedBy=multi-user.target
- name: wait-for-dns.service
enable: true
contents: |
[Unit]
Description=Wait for DNS entries
Wants=systemd-resolved.service
Before=kubelet.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
[Install]
RequiredBy=kubelet.service

In order to understand when and where Ignition is executed let’s take a look at high level Linux Boot Process:

  1. BIOS: perform POST, load MBR
  2. MBR: load GRUB Boot Loader
  3. GRUB2: load kernel image and initramfs ← Ignition here
  4. KERNEL: load modules and start system 1st process
  5. SYSTEMD: read from /etc/systemd/

Matchbox

Matchbox came from CoreOS and actually it’s just a web and gRPC server which takes care about mapping baremetal machines based on labels like MAC address to PXE boot profiles.

Matchbox is configured by files and directory structure, by default it uses /var/lib/matchbox. It can be also configured using Terrform module.

Directory consists of:

  • assets — just static files
  • groups — mapping between baremetal machines and boot profiles
  • profiles — kernel/initrd images
  • ignition — OS configuration like: systemd services, network configuration

It can be run as a container:

sudo docker run --rm quay.io/coreos/matchbox:latest \
-p 8080:8080 \
-v /var/lib/matchbox:/var/lib/matchbox:Z \
-address=0.0.0.0:8080 \
-log-level=debug

For more examples please take a look at coreos/matchbox/examples.

Groups

Groups are selector for phycical machines based on label e.g. MAC address with metadata and corresponding profile.

{
"id": "node1",
"name": "Worker Node",
"profile": "worker",
"selector": {
"mac": "52:54:00:b2:2f:86"
},
"metadata": {
"domain_name": "node1.example.com",
"k8s_dns_service_ip": "10.3.0.10",
"pxe": "true",
"ssh_authorized_keys": [
"ssh-rsa XXXXXXXXXXXX fake-test-key-REMOVE-ME"
]
}
}

Profiles

Profiles specify the kernel and initrd, kernel arguments, iPXE config, GRUB config or other configuration values a given machine should use.

{
"id": "worker",
"name": "Worker",
"boot": {
"kernel": "/assets/coreos/1465.8.0/coreos_production_pxe.vmlinuz",
"initrd": ["/assets/coreos/1465.8.0/coreos_production_pxe_image.cpio.gz"],
"args": [
"initrd=coreos_production_pxe_image.cpio.gz",
"root=/dev/sda1",
"coreos.config.url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}",
"coreos.first_boot=yes",
"console=tty0",
"console=ttyS0",
"coreos.autologin"
]
},
"ignition_id": "worker.yaml"
}

Assets

This directory consists of static assets, like Container Linux image.

$ tree /var/lib/matchbox/assets
/var/lib/matchbox/assets/
├── coreos
│ └── 1465.8.0
│ ├── CoreOS_Image_Signing_Key.asc
│ ├── coreos_production_image.bin.bz2
│ ├── coreos_production_image.bin.bz2.sig
│ ├── coreos_production_pxe_image.cpio.gz
│ ├── coreos_production_pxe_image.cpio.gz.sig
│ ├── coreos_production_pxe.vmlinuz
│ └── coreos_production_pxe.vmlinuz.sig

Ignition

This directory consists of ignition templates for profiles.

---
systemd:
units:
- name: docker.service
enable: true
- name: locksmithd.service
mask: true
- name: kubelet.path
enable: true
contents: |
[Unit]
Description=Watch for kubeconfig
[Path]
PathExists=/etc/kubernetes/kubeconfig
[Install]
WantedBy=multi-user.target
- name: wait-for-dns.service
enable: true
contents: |
[Unit]
Description=Wait for DNS entries
Wants=systemd-resolved.service
Before=kubelet.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
[Install]
RequiredBy=kubelet.service

Kubernetes node boot process

After we covered all the building blocks needed for Kubernetes node boot process, let’s summarize it briefly:

  1. Power on baremetal machine.
  2. Receive network configuration from DHCP.
  3. PXE network boot.
  4. First OS boot with configuration via Ignition.
  5. Provide additional assets (TLS certificates, kubernetes manifests).
  6. Systemd kubelet.service starts and loads kubernetes manifests.

kubelet.service

Kubelet is configured as systemd service and runs on every node.

- name: kubelet.service
command: start
runtime: true
content: |
[Unit]
Description=Kubelet via Hyperkube ACI

[Service]
Environment=KUBELET_IMAGE=quay.io/coreos/hyperkube:v1.9.2_coreos.0
ExecStart=/usr/lib/coreos/kubelet-wrapper \
--kubeconfig=/etc/kubernetes/kubelet-kubeconfig.yaml \
--require-kubeconfig \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--network-plugin=cni \
--lock-file=/var/run/lock/kubelet.lock \
--exit-on-lock-contention \
--pod-manifest-path=/etc/kubernetes/manifests \
--allow-privileged \
--node-labels="node-role.kubernetes.io/node",type=worker,cluster=baremetal \
--cni-bin-dir=/var/lib/cni/bin \
--minimum-container-ttl-duration=6m0s \
--cluster_dns=10.5.0.10 \
--cluster-domain=cluster.local \
--client-ca-file=/etc/kubernetes/ssl/ca.pem \
--anonymous-auth=false \
--register-node=true

ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

It loads all Kubernetes manifests from directory specified by pod-manifest-path (kube-apiserver, kube-proxy, kube-dns, etc).

On the other hand kubelet is responsible for configuring CNI network based on cni-conf-dir parameter.

Example CNI configuration (for Calico):

{
"name": "k8s-pod-network",
"type": "calico",
"etcd_endpoints": "__ETCD_ENDPOINTS__",
"etcd_key_file": "__ETCD_KEY_FILE__",
"etcd_cert_file": "__ETCD_CERT_FILE__",
"etcd_ca_cert_file": "__ETCD_CA_CERT_FILE__",
"log_level": "__LOG_LEVEL__",
"ipam": {
"type": "calico-ipam"
},
"policy": {
"type": "k8s",
"k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
"k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
},
"kubernetes": {
"kubeconfig": "__KUBECONFIG_FILEPATH__"
}
}

kube-proxy

Once kubelet.service is up and running, it applies kube-proxy manifest:

- path: /etc/kubernetes/manifests/kube-proxy.yaml
content: |
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
labels:
k8s-app: kube-proxy
spec:
containers:
- name: kube-proxy
image: quay.io/coreos/hyperkube:v1.9.2_coreos.0
command:
- ./hyperkube
- proxy
- --kubeconfig=/etc/kubernetes/kubelet-kubeconfig.yaml
- --proxy-mode=iptables
- --cluster-cidr=10.123.0.0/16
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: ssl-certs-host
readOnly: true
- name: etc-kubernetes
mountPath: /etc/kubernetes
readOnly: true
hostNetwork: true
volumes:
- hostPath:
path: /usr/share/ca-certificates
name: ssl-certs-host
- name: etc-kubernetes
hostPath:
path: /etc/kubernetes

We’ve seen before that kube-proxy is responsible for TCP and UDP forwarding using iptables or ipvs rules.

That’s pretty much all what is needed to bootstrap minimal Kubernetes Cluster. We can dig deeper into kube-apiserver, kube-dns, scheduler or even more configuration files but this is not the scope of this blog post :)

Bootstrap on cloud

Bootstrap on cloud is not so different than on baremetal instead of providing physical machines we have to take care of provisioning the nodes using different tools running on higher level of abstraction.

For example in terms of AWS cloud provider we can take an advantage of CloudFormation templates in order to create and provision EC2 instances.

Note that every cloud provider use different virtualization environment which sometimes adds some level of complexity.

Summary

Obviously there are more ways how to bootstrap Kubernetes cluster like kubeadm, kubespray, tectonic-installer, etc.

I decided to write this blog post because I couldn’t find anything on the internet which covers high level overview of the bootstrap process.

--

--