Kubernetes Series: Part 1

Published in

priceline labs

11 min readNov 4, 2019

Kubernetes Architecture Deep-dive Walkthrough

Kubernetes is becoming the de facto standard for orchestrating containers. It is being widely used and adopted across the tech industry.

Over the next few weeks as a part of this series, I will post deep-dive tutorials and walkthroughs exploring the internals of different components of Kubernetes and customizing them for different use cases.

In the first of the series, we will explore the Kubernetes architecture and how this distributed set of components come together to form a system that is powering everything from ecommerce sites with billions of users, to trade applications and IOT farms.

*source:https://commons.wikimedia.org/wiki/File:Kubernetes.png*

Kubernetes Architecture and Functioning:

Kubernetes cluster consists of master and worker nodes, where master nodes run the controlling processes and worker nodes run the workloads (though you can run workloads on master nodes by running a kubelet process there as well).

Kubernetes master nodes consist of following:

API-Server
Control Manager
Scheduler
etcd*

*For larger deployments it’s recommended to run etcd outside of master nodes.

Kubernetes worker nodes consist of following:

Kubelet
Kube-proxy
Container engine (docker/containerd/cri-o)

The best way to understand what each of these does and how they work is to follow an operation in Kubernetes is performed. For simplicity we are taking an example where we run a command on the CLI to provision and run an application.

Kubectl

The CLI command to interact with Kubernetes is kubectl. It is a client side program responsible for sending commands and displaying the results to the user. The first thing the kubectl does is perform a parsing operation on the request and then syntax-validation. This works as a failsafe mechanism by blocking all malformed requests from hitting the cluster. This improves the performance of masters by rejecting incompatible requests.

Once validation is complete, kubectl then creates an HTTP request (REST) by using generators that take care of the order in which the commands should go to API-Server. The API-Server services REST operations and is the way to interact with whole Kubernetes cluster.

The first line of almost any YAML file that we use to configure or interact with Kubernetes starts with something like apiVersion: apps/v1, because Kubernetes uses a versioned api. Api-Server uses OpenAPI format schema which is accessible at /apis path. Kubectl also maintains a cached copy of the schema locally at client side. By using the schema, the request object is generated. Kubectl then authenticates with API-Server using the credentials that are provided as one of the following:

Specified in the CLI using — kubeconfig flag
Stored in the environment variable $KUBECONFIG
File ~/.kube/config

It goes on to parse the file and determine the context it will use to authenticate with the API-Server. It also generates the request with all or some of the following values, depending on config embedded into the request:

UserAgent
Transport
WrapTransport
TLSConfig (includes CA Cert, Client Cert, Key file)
Username
Password
Bearer token etc.

If user passes any CLI argument such as — username or — client-certificate etc., it will take precedence.

So once the kubectl has auth in place, it generates the request, encapsulating it with the auth info and generated REST request, then sends it over to API-Server.

API-Server

At this point, the API-Server receives the request and is the first component of Kubernetes cluster to get activated. The first operation it performs is authentication. Remember that our request has all the information, like username, certificate, token etc. It parses the request and attaches authenticator for each of these, such as basicauth handler for username and password, bearer token handler for — token-auth-file, x-509 handler for TLS key signed by the CA root cert, and odic handler for OpenID Connect protocol.

It passes through each of these until any succeed, and if none do then the request is rejected and an aggregate error is generated for the client. If any do succeed, user information is added to the request and is passed on to authorization and admission controllers.

Once the request is authenticated, api servicer then checks if the client making the request is authorized to perform the operation specified in the request. The authorization process proceeds in a similar manner with the API-Server attaching multiple authorizers based upon the flags in the request. If all authorizers deny the request, it is rejected with an error, and if any one of them approves, it is allowed. Various authorizers available are:

ABAC, this enforces policies based on files that are provided for each user/group.
RBAC, this enforces policies based upon roles added by the cluster admin as Kubernetes resources.
Node, this allows running commands on specific set of resources based on nodes.
Webhook, this allows running commands as HTTP post calls by the clients on the cluster.

After the request is authenticated and authorized, the request is passed onto admission controller. Here the request is parsed and is checked to see if it obeys all rules and regulation of the cluster. This is the last check before the object is persisted to etcd (more on that later). The admission controller works by passing the request through various controllers and if any of these fail the request is rejected with an error.

The admission controller checks the request for security, resilience, resources and taints. Various admission controllers that enabled by default can be viewed by running kube-apiserver -h | grep enable-admission-plugins.

ETCD

After the admission controller has approved the request, the request persists inside our datastore, which is etcd. This is done via the following steps:

For every API object, the API-Server has a route for every version of the object with specific handler for each (for example for version v1 deployment there will be a separate route to map the incoming request, and a different one for services, similar to websites directories).
When the request is parsed, API-Server iterates over each part of the request. It matches api version with the objects and invokes a handler for each match; if no matches exist a /404 is returned.
Once the handler is found, api request is decoded, parsed and a final validation is performed to check if it matches the specified api’s specs.
If validation is passed, the resource will be committed to etcd by the storage provider and a success http response will be sent back.

This ensures that our object is in etcd, but it doesn’t mean it actually exists as a resource in Kubernetes.

Initializer

To make it visible in Kubernetes, a series of initializers are run. An initializer is a controller that runs on the resource committed to etcd before it is made available externally. If there are no initializers running, the resource will go straight to get deployed. Initializers are used for functions such as:

Inserting secrets to pods
Performing validation on secrets
Inserting sidecars proxies etcs

We can have a custom pod initializer for each pod by creating an initializerConfiguration object with type of resources specified in it. This can enable us to enforce certain policies for certain types of pods; for example, we can have PCI / HIPAA related custom configs done for certain types of pods before they are sent for scheduling.

At this point we have a record in etcd that is initialized and is ready to run. This is achieved by running control loops for each level of hierarchy. Standard Kubernetes hierarchy in Deployments consists of replica sets, which consist of pods, which is the smallest unit.

As soon as the newly initialized record is made available by the API-Server, kube-controller-manager comes into action. Its job is to run various controllers specific to resources being created. Kubernetes goes top down, so the first one to come into effect is deployment controller.

Kube-controller-manager

The Deployment Controller will look at the etcd deployment object just made visible by the API-Server. It parses the object and adds to the workqueue for it to be created. Once it continues parsing, if it finds it consists of a ReplicaSet, it looks at API-Server and realises it doesn’t exist, so it triggers a scaling process to create the ReplicaSet in order to match the specs of deployment project. After the ReplicaSet is created and registered to the current deployment, the job for deployment controller is done. If, in the meantime, the state of deployment is updated, the same process executes in the loop to match the state.

Once the deployment controller is complete, we have a deployment with a replica set. At this point ReplicaSet Controller will kick in and inspect the newly created ReplicaSet. It will realize this replica set currently doesn’t have a single pod and will try to initialize the pods. The pods are initialized in a batched manner incorporating a slow start mechanism. It’s there as a failsafe mechanism in the event that a number of pods end up in error state.

At this point we have all resources added in etcd and API-Server has made these visible. The resources at this point will be in pending state. Because they are not yet in running state, the final controller process that runs on these is Scheduler, which will push these to a particular worker node.

Scheduler

Scheduler runs as an independent process within the controlplane, and as with every other control process, it attempts to reach the desired state as per etcd values from current state. Scheduler finds all pods that do not have any node assigned to them; that is, they have an empty NodeName field.

Scheduling is an optimization problem and it starts with finding suitable nodes that have resources that can host a pod. For example, if a pod has specific sets of requirements in terms of CPU and Memory passed in the request, then only those nodes will be selected that have these requirements. This is handled by a set of predicates which execute in chain filtering nodes for each given parameter, like ports, hostname, resources, node pressure in terms of processes and CPU usage. As nodes get evaluated against the parameters, each node gets a ranking showing their suitability, with the highest ranking node finally getting selected. Once node is selected a binding object is created which has namespace, pod name and uid with reference to the node selected. This binding object then ends up getting sent to API-Server via a POST. Once the API-Server receives this request it updates the etcd entry for the pod object with the given node name, and changes the PodScheduled to true. Once this update is done, it’s time to get the pod running on a worker node.

At this point the object exists in etcd with no actual physical resources assigned. Next the object will be sent over to the set of nodes referenced to the object in etcd. This is done by a pull mechanism executed by kublet.

Kublet

Kublet is an agent that runs on each worker node. It polls API-Server for pods bound to the node it is running on by default every 20 seconds. If it detects a change compared to its own state, it begins to synchronize to the new state. It works through following steps:

If it’s a new pod, register it and publish startup metrics.
Generate a pod status object with possible values like Pending, Running, Succeeded, Failed and Unknown, as these represent the states a pod can be in. It is determined by running pod specs to through a chain of PodSyncHandlers. Each handler checks if pod should run on the node or not. If any one of these fail, the pod will transition to evicted state.
Upon generation of Pod status, it will be sent to etcd for updating. Pod is then ran through a set of node level admission handlers like AppArmor profiles and privilege evaluations.
If pod has specific cgroup requirements, these will be enforced and attached to the pod.
Data directories are then created for pod data, volumes and related plugins.
Any volumes required will be created and attached.
Secrets, if needed, are pulled from api server and made available for injection into the pod.
Image-related info such as secrets and url is gathered and made available for the container runtime to execute.
Finally, all this info is passed to container runtime to actually run the container.

The time has now come to make our container live by actually making it run on the physical resources. This is achieved by invoking CRI (container runtime interface), which is an interface layer between Kubernetes and container runtimes, such as docker, containerd, rkt and more.

CRI

The kubelet invokes a remote procedure call on CRI called RunPodSandbox. This enacts a set of containers that are specified in pod as one object. In case of docker, it will be a pause container, which will serve as parent / binding container for all other containers in the pod. This pause container will reserve resources such as PID, namespace and IPC on the linux systems. These resources will be made available to all of the containers that are part of the pod of which the pause container is reference. Linux has namespaces that allow it to assign a part of resources to a particular set of processes and cgroups, which help to manage these resources. Docker uses these to carve out containers with independent sets of resources for a given container. So our pause container carves out the resources and all child containers end up using those resources as shared, and controls the child containers as part of one namespace.

CNI

We now have have a container with its share of physical resources ready to be run. It needs to go through two more steps, which are to get a network and get started by running the image. The task of assigning and attaching the network is assigned to CNI plugin, which stands for “container network interface.” This works as follows:

Kubelet send container config like namespace, name and type of network to CNI Plugin.
This begins setting up a bridged network in the root linux network interface.
It then creates a veth interface pair with one end in the container and the other plugged to the network bridge that was created previously.
It then allocates IP to the POD, then updates the IPAM with pod entry that marks the ip as reserved and updates the routes.
Then finally injects the dns config, which is updated in the resolv.conf of container.

This process enables the pod to communicate with the host and to communicate with other pods on the same host. For inter host networking, usually an overlay network provider is used. Some of the popular ones are flannel, kube-ovm and jaguar. Interhost networking is achieved by running a network that spans across all hosts with encapsulating packets in UDP datagrams and assigning it destination addresses based on route table available with the overlay network provider.

Now we are at the final step in the container lifecycle, running the workload. This happens via the following steps:

Pull Image — this step downloads the image from the image registry.
Pull secrets — this step gets the secrets defined at the initialization available at run time.
Create the container — this step binds all the resources and specs like image, label, command, volumes and environment variables, and passes it on to CRI to execute a run container sequence.
Finally, the container is started and and post-start lifecycle hooks are run and their results evaluated. If these fail will result in the container getting failed.

At this point we have a container running with all its resources and shown as running on the api server.