Kubernetes Cluster Bootstrap

Bartek Antoniak

Published in

VirtusLab

7 min readMar 1, 2018

Introduction

Have you ever wondered what’s happening under the hood of Kubernetes bootstrap process?

In this blog post I will take you on a journey through the steps of the Kubernetes cluster provisioning in general without focusing on any particular cloud provider or hardware virtualization.

It will give you an overview how the various components fit and work together.

Kubernetes components

This section assume that you are familiar with Kubernetes and know the basics.

Before we start digging into the actual bootstrap process let’s take a look at the Kubernetes components in order to understand what we’re going to provision.

As you can see we have 3 nodes — two workers and one master node.

API component is running on master node which exposes Kuberentes API. An authenticated and authorized user or service can view or modify Kubernetes objects described in manifests.

Scheduler and Controller are running on master node and are responsible for managing cluster state based on Kubernetes Objects stored in etcd (highly available key-value datastore).

Kubelet is “primary” node agent and it’s focused on running containers, it doesn’t manage containers which were not created on Kubernetes.

Kube-proxy is responsible for TCP and UDP forwarding (iptables or ipvs) for Kubernetes Services.

Pod is just a group of containers (docker, rkt) running on the same node with common life cycle.

Under the Kubernetes abstraction layer there is nothing more a operating system, container runtime, systemd services and some networking stuff.

Communication between containers is done by CNI (Container Network Interface) which manages network virtual interfaces (e.g. Flannel, Calico).

In some cases there might be virtualization environment running underneath (e.g. in the could).

Bootstrap on baremetal

In order to understand the Kubernetes bootstrap process from very beginning let’s start with bare-metal example.

It involves number of technologies, like:

PXE enabled network boot
DHCP
TFTP
OS image (Container Linux)
OS configuration (Ignition from Container Linux)
Docker

PXE network boot

PXE (Preboot Execution Environment) is one of the possible ways to network boot.

PXE server is nothing more that DHCP and TFTP servers which provide kernel image and root file system.

NIC (Network Interface Card) of PXE client sends DHCP request and receives network configuration like: IP, subnet, mask, DNS and gateway.

In addition it provides location of TFTP server.

Client connects to TFTP server in order to download boot image.

Client executes boot image and downloads all the files it needs (kernel image, root file system).

Optionally we can mount NFS (Network File System).

Dnsmasq

Dnsmasq is a mature project that can be used for a number of purposes:

works as DHCP server
also can work as DHCP in proxy mode if you already have DHCP server on your network
TFTP server

It can be run as a container:


sudo docker run --rm --cap-add=NET_ADMIN --net=host quay.io/coreos/dnsmasq \
  -d -q \
  --dhcp-range=192.168.1.3,192.168.1.254 \
  --enable-tftp --tftp-root=/var/lib/tftpboot \
  --dhcp-match=set:bios,option:client-arch,0 \
  --dhcp-boot=tag:bios,undionly.kpxe \
  --dhcp-match=set:efi32,option:client-arch,6 \
  --dhcp-boot=tag:efi32,ipxe.efi \
  --dhcp-match=set:efibc,option:client-arch,7 \
  --dhcp-boot=tag:efibc,ipxe.efi \
  --dhcp-match=set:efi64,option:client-arch,9 \
  --dhcp-boot=tag:efi64,ipxe.efi \
  --dhcp-userclass=set:ipxe,iPXE \
  --dhcp-boot=tag:ipxe,http://matchbox.example.com:8080/boot.ipxe \
  --address=/matchbox.example/192.168.1.2 \
  --log-queries \
  --log-dhcp

CoreOS and Ignition

As the operating system we’ll use CoreOS Container Linux which is

“Minimal container operating system with basic userland utilities. Runs on nearly any platform whether physical, virtual or private/public cloud.”

It supports a container runtime (docker and rkt) out of the box.

Ignition is a provisioning utility for Container Linux based on YAML files. It executes in early boot process in initrd. It’s able to partition disk, configure network, systemd services and many more.

---
systemd:
  units:
    - name: docker.service
      enable: true
    - name: locksmithd.service
      mask: true
    - name: kubelet.path
      enable: true
      contents: |
        [Unit]
        Description=Watch for kubeconfig
        [Path]
        PathExists=/etc/kubernetes/kubeconfig
        [Install]
        WantedBy=multi-user.target
    - name: wait-for-dns.service
      enable: true
      contents: |
        [Unit]
        Description=Wait for DNS entries
        Wants=systemd-resolved.service
        Before=kubelet.service
        [Service]
        Type=oneshot
        RemainAfterExit=true
        ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
        [Install]
        RequiredBy=kubelet.service

In order to understand when and where Ignition is executed let’s take a look at high level Linux Boot Process:

BIOS: perform POST, load MBR
MBR: load GRUB Boot Loader
GRUB2: load kernel image and initramfs ← Ignition here
KERNEL: load modules and start system 1st process
SYSTEMD: read from /etc/systemd/

Matchbox

Matchbox came from CoreOS and actually it’s just a web and gRPC server which takes care about mapping baremetal machines based on labels like MAC address to PXE boot profiles.

Matchbox is configured by files and directory structure, by default it uses /var/lib/matchbox. It can be also configured using Terrform module.

Directory consists of:

assets — just static files
groups — mapping between baremetal machines and boot profiles
profiles — kernel/initrd images
ignition — OS configuration like: systemd services, network configuration

It can be run as a container:

sudo docker run --rm quay.io/coreos/matchbox:latest \
 -p 8080:8080 \
 -v /var/lib/matchbox:/var/lib/matchbox:Z \
 -address=0.0.0.0:8080 \
 -log-level=debug

For more examples please take a look at coreos/matchbox/examples.

Groups

Groups are selector for phycical machines based on label e.g. MAC address with metadata and corresponding profile.

{
  "id": "node1",
  "name": "Worker Node",
  "profile": "worker",
  "selector": {
    "mac": "52:54:00:b2:2f:86"
  },
  "metadata": {
    "domain_name": "node1.example.com",
    "k8s_dns_service_ip": "10.3.0.10",
    "pxe": "true",
    "ssh_authorized_keys": [
      "ssh-rsa XXXXXXXXXXXX fake-test-key-REMOVE-ME"
    ]
  }
}

Profiles

Profiles specify the kernel and initrd, kernel arguments, iPXE config, GRUB config or other configuration values a given machine should use.

{
  "id": "worker",
  "name": "Worker",
  "boot": {
    "kernel": "/assets/coreos/1465.8.0/coreos_production_pxe.vmlinuz",
    "initrd": ["/assets/coreos/1465.8.0/coreos_production_pxe_image.cpio.gz"],
    "args": [
      "initrd=coreos_production_pxe_image.cpio.gz",
      "root=/dev/sda1",
      "coreos.config.url=http://matchbox.example.com:8080/ignition?uuid=${uuid}&mac=${mac:hexhyp}",
      "coreos.first_boot=yes",
      "console=tty0",
      "console=ttyS0",
      "coreos.autologin"
    ]
  },
  "ignition_id": "worker.yaml"
}

Assets

This directory consists of static assets, like Container Linux image.

$ tree /var/lib/matchbox/assets
/var/lib/matchbox/assets/
├── coreos
│   └── 1465.8.0
│       ├── CoreOS_Image_Signing_Key.asc
│       ├── coreos_production_image.bin.bz2
│       ├── coreos_production_image.bin.bz2.sig
│       ├── coreos_production_pxe_image.cpio.gz
│       ├── coreos_production_pxe_image.cpio.gz.sig
│       ├── coreos_production_pxe.vmlinuz
│       └── coreos_production_pxe.vmlinuz.sig

Ignition

This directory consists of ignition templates for profiles.

---
systemd:
  units:
    - name: docker.service
      enable: true
    - name: locksmithd.service
      mask: true
    - name: kubelet.path
      enable: true
      contents: |
        [Unit]
        Description=Watch for kubeconfig
        [Path]
        PathExists=/etc/kubernetes/kubeconfig
        [Install]
        WantedBy=multi-user.target
    - name: wait-for-dns.service
      enable: true
      contents: |
        [Unit]
        Description=Wait for DNS entries
        Wants=systemd-resolved.service
        Before=kubelet.service
        [Service]
        Type=oneshot
        RemainAfterExit=true
        ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done'
        [Install]
        RequiredBy=kubelet.service

Kubernetes node boot process

After we covered all the building blocks needed for Kubernetes node boot process, let’s summarize it briefly:

Power on baremetal machine.
Receive network configuration from DHCP.
PXE network boot.
First OS boot with configuration via Ignition.
Provide additional assets (TLS certificates, kubernetes manifests).
Systemd kubelet.service starts and loads kubernetes manifests.

kubelet.service

Kubelet is configured as systemd service and runs on every node.

- name: kubelet.service
  command: start
  runtime: true
  content: |
    [Unit]
    Description=Kubelet via Hyperkube ACI

    [Service]
Environment=KUBELET_IMAGE=quay.io/coreos/hyperkube:v1.9.2_coreos.0
    ExecStart=/usr/lib/coreos/kubelet-wrapper \
      --kubeconfig=/etc/kubernetes/kubelet-kubeconfig.yaml \
      --require-kubeconfig \
      --cni-conf-dir=/etc/kubernetes/cni/net.d \
      --network-plugin=cni \
      --lock-file=/var/run/lock/kubelet.lock \
      --exit-on-lock-contention \
      --pod-manifest-path=/etc/kubernetes/manifests \
      --allow-privileged \
      --node-labels="node-role.kubernetes.io/node",type=worker,cluster=baremetal \
      --cni-bin-dir=/var/lib/cni/bin \
      --minimum-container-ttl-duration=6m0s \
      --cluster_dns=10.5.0.10 \
      --cluster-domain=cluster.local \
      --client-ca-file=/etc/kubernetes/ssl/ca.pem \
      --anonymous-auth=false \
      --register-node=true

    ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid

    Restart=always
    RestartSec=10

    [Install]
    WantedBy=multi-user.target

It loads all Kubernetes manifests from directory specified by pod-manifest-path (kube-apiserver, kube-proxy, kube-dns, etc).

On the other hand kubelet is responsible for configuring CNI network based on cni-conf-dir parameter.

Example CNI configuration (for Calico):

{
    "name": "k8s-pod-network",
    "type": "calico",
    "etcd_endpoints": "__ETCD_ENDPOINTS__",
    "etcd_key_file": "__ETCD_KEY_FILE__",
    "etcd_cert_file": "__ETCD_CERT_FILE__",
    "etcd_ca_cert_file": "__ETCD_CA_CERT_FILE__",
    "log_level": "__LOG_LEVEL__",
    "ipam": {
        "type": "calico-ipam"
    },
    "policy": {
        "type": "k8s",
        "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
        "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
    },
    "kubernetes": {
        "kubeconfig": "__KUBECONFIG_FILEPATH__"
    }
}

kube-proxy

Once kubelet.service is up and running, it applies kube-proxy manifest:

- path: /etc/kubernetes/manifests/kube-proxy.yaml
  content: |
    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-proxy
      namespace: kube-system
      labels:
        k8s-app: kube-proxy
    spec:
      containers:
      - name: kube-proxy
        image: quay.io/coreos/hyperkube:v1.9.2_coreos.0
        command:
        - ./hyperkube
        - proxy
        - --kubeconfig=/etc/kubernetes/kubelet-kubeconfig.yaml
        - --proxy-mode=iptables
        - --cluster-cidr=10.123.0.0/16
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: ssl-certs-host
          readOnly: true
        - name: etc-kubernetes
          mountPath: /etc/kubernetes
          readOnly: true
      hostNetwork: true
      volumes:
      - hostPath:
          path: /usr/share/ca-certificates
        name: ssl-certs-host
      - name: etc-kubernetes
        hostPath:
          path: /etc/kubernetes

We’ve seen before that kube-proxy is responsible for TCP and UDP forwarding using iptables or ipvs rules.

That’s pretty much all what is needed to bootstrap minimal Kubernetes Cluster. We can dig deeper into kube-apiserver, kube-dns, scheduler or even more configuration files but this is not the scope of this blog post :)

Bootstrap on cloud

Bootstrap on cloud is not so different than on baremetal instead of providing physical machines we have to take care of provisioning the nodes using different tools running on higher level of abstraction.

For example in terms of AWS cloud provider we can take an advantage of CloudFormation templates in order to create and provision EC2 instances.

Note that every cloud provider use different virtualization environment which sometimes adds some level of complexity.

Summary

Obviously there are more ways how to bootstrap Kubernetes cluster like kubeadm, kubespray, tectonic-installer, etc.

I decided to write this blog post because I couldn’t find anything on the internet which covers high level overview of the bootstrap process.