Demystifying K8s

Abhishek Mitra
8 min readNov 14, 2023

--

What is kubernetes ?

Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

The above definition is taken as-is from the kubernetes documentation.

I have previously shared a post on how kubernetes packet path works . Stemming from that is another post which is a humble attempt to articulate the sequence of events that take place when a simple operation is done on K8s.

Most of the article is based on my working knowledge on Kubernetes and some really great references as well. The article is pretty detailed and a bit lengthy as well.

With that lets begin our journey on understanding the sequence of events that unfold on executing the one of the following commands

kubectl create deployment test-nginx--image=nginx:alpine --replicas=3 

OR

kubectl apply -f deployment.yaml

A lot of the terms here like generators,informers etc are used with an assumption to exposure to building microservices with golang.

Client Interactions (Kubectl)

At this point the client kubectl will perform various client side validations to implement a fail-fast mechanism so that incorrect requests are gated here and not sent to the API server reducing the load on the API server (example invalid image mapping).

kubectl then uses generators to serialize the input the construct an HTTP request to send to the API server. The type of resource is inferred by the K8s client using various command line parameters or the yaml file inputs.

Eventually in the above case a deployment is created using DeploymentAppsV1 to create a runtime object. The runtime object is a generic term for a resource.

Next the client discovers the API paths on remote API server by scanning all the relative paths /api. These are the OpenAPI endpoints (Swagger) and the client caches them under ~/.kube/cache/discovery by creating a folder for each new API server.

Before sending the message the client needs to authenticate itself with the API server . It parses the kubeconfig file from either default location or specified by env variable KUBECONFIG or command line parameter --kubeconfig and decorates the HTTP request accordingly

  • x509 certs are sent using tls.TLSConfig.
  • bearer tokens are sent using HTTP Authorization header.
  • username and password sent via basic HTTP authentication.
  • OpenID auth process is handled manually by the user beforehand producing a token which is sent like a bearer token .

The final step is to actually send the HTTP request. Once it does so and gets back a successful response, kubectl will then print out a success message based on the desired output format.

Client Interactions

API Server Interactions

When the request reaches the API server the following sequence of events take place

  • API server aggregation — The request will first hit the aggregation layer that authenticates and authorizes the request. The aggregation layer extends K8s API server to allow additional APIs than what's available out of the box.
  • authNAuthentication is done by the API server to verify if the request was sent by a user who is what they say they are. This is done by the api server by looking at the cli options and creating a list of suitable authenticators.

If every authenticator fails, the request fails and an aggregate error is returned.

If authentication succeeds, the Authorization header is removed from the request, and user information is added to its context.

This gives future steps (such as authorization and admission controllers) the ability to access the previously established identity of the user.

  • authZAuthorization is done to determine whether the authenticated user has the rights to perform the specific request. K8s supports the following auth modes at the time of writing this article — Node, RBAC, ABAC and Webhook.
  • Admission Control — The focus of this layer is to validate whether the request sent meets the system requirements before storing in the database — etcd. Admission controllers are stricter than authorization since as the request is sent through a chain of admission controllers , the request is failed and not persisted even if one of the admission controllers fail the request. For more info refer DAC.
  • Resource Handling — This is basically how the API server knows what to invoke. A typical HTTP Server implements handlers that map to the api paths like /api/xx. This informs the server to invoke the desired method when the API is invoked via a GET or POST call. Similarly, the K8s API server install REST mappings for every HTTP route and thats how it knows that it has to invoke a create resource handler when the kubectl create command is executed.
  • As part of Resource handling input the resource is created and persisted in etcd .
  • Finally a response object is created and persisted objects status are put in the response and sent back.
API Server

Controller Interactions

At this point our resource / runtime object i.e deployment exists only in ETCD and is not visible to the other resources until some initializers are run. These are basically bootstrap operations to be run on the saved etc objects . The feature is enabled by default and currently these are achieved as part of custom resources.

  • Now the object is made available to be seen/visible via the kube-api server.
  • Once it is visible the Deployment controller detects this objects and springs into action
  • The above detection happens through a concept called informers which is a pattern which a controller subscribes to and will receive a notification whenever such an object is made available. The informer takes care of resource caching to reduce unnecessary API server connections and also allows multiple controllers to interact with the API server in a thread safe manner.
  • The deployment controller then scans the system to find an existing ReplicaSet or Pod that matches the labels it has recieved, if not it adds the create request to an internal queue to create a new ReplicaSet resource where the pod spec is copied over from the Deployment.
  • Once the Replica Set is created the Replica Set Controller steps in (again based on events it receives). The whole job of this controller is to create Pods and ensure there is consistency between what exists and what's required and starts creating Pods.
  • Pod creation is controlled by SlowStartInitialBatchSize, so that pod creations can be batched efectively to handle scale requests.
  • At this point we have a pod object and its stuck in Pending state since its not been scheduled to a node.
  • This is where the kubernetes scheduler (yet another controller) steps in. The scheduler the filters pods that have empty node name. Then it finds all the nodes and scores them based on metrics like CPU , RAM and calculates whether the pod capacity can be satisfied on a specific node.
  • Once the node is finalized, the scheduler creates a binding object and sends it to the API server
  • The API server on receiving the binding object updates the Pod with the desired nodename and sets the PodScheduled condition to True
Controller Interactions

Agent Interactions (Kubelet)

Now that the scheduler has scheduled the pods on specific nodes, the kubelet steps in. This is the starting point of creation of a POD and uptil now it existed only as an object in etcd.

  • the kubelet behaves as a kubernetes agent whose job is to translate the specifics of a POD to actual containers or microservice engines(could be VM’s as well).
  • the kubelet is also a controller and once it recieves a lit of pods on the node its running it performs various actions like create data directories and pod directories and retrieving image pull secrets before handling the control over to the Container Runtime Interface.
  • CRI such as docker or crio launches the containers required for a pod to exist.
  • The first container that gets created per pod is a pause container that hosts the namespace for that pod. In Linux Kernel, namespaces allow host OS to carve dedicated set of resources to offer to a process. cgroups are used to govern these resource allocations.
  • The pause containers allows a ways to host this namespace for the child containers of a POD to share them . This allows two advantages — 1) the child containers can all use localhost to address each other in the same pod. 2) Since pause is an init process it takes the responsibility to reap dead processes within the same namespace.
  • Next comes the Container Network Interface (CNI) whose job is to allow different network providers use different networking implementations for containers.
  • The kubelet interacts with the CNI plugins using config files located in /etc/cni/net.d example
{
"cniVersion": "0.3.1",
"name": "bridge",
"type": "bridge",
"bridge": "cnio0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "${POD_CIDR}"}]
],
"routes": [{"dst": "0.0.0.0/0"}]
}
}
  • Next the CNI plugin will setup a local Linux bridge (switch) and create interfaces on each containers network namespace.
    More details on how container networking works can be found here.
  • kubelet then assigns an IP to the pause container’s interface and set up the routes. This will result in the Pod having its own IP address. IP assignment is delegated to the IPAM providers specified to the JSON configuration.
  • For DNS, the kubelet will specify the internal DNS server IP address to the CNI plugin, which will ensure that the container’s resolv.conf file is set appropriately.
  • Once the process is complete, the plugin will return JSON data back to the kubelet indicating the result of the operation.
  • I have explained interhost networking over multiple articles here and here in an in depth fashion.
  • Now the container runtime will pull the image and use image pull secrets if specified and start creating the container .
  • This also involves lifecycle hooks thats passed down from the POD spec and deserialized by the container runtime and converted to container run time spec.

References

--

--