Why is the architecture of Kubernetes like this now?

13 min readApr 27, 2019

In this article, I will introduce the architecture of Kubernetes in detail. Before you read this article, you need to know what Kubernetes is and some basic concepts.

This is the second article in a series on the architecture of scheduling systems, the first article can be read here too, it is about the evolution of the cluster scheduling system. In this article, we get to know the architecture of the scheduling system and the evolution through introducing the system of Hadoop MRv1, YARN and Mesos. So what does the latest and most popular Kubernetes architecture look like? This article will introduce the overall structure of Kubernetes to you, and will deeply explain 2 in-depth issues.

Architecture of Kubernetes

First, Kubernetes’ official structure looks like this:

This architecture diagram may be a little confusing. It doesn’t matter. I will again simplify the official architecture diagram. It will be easier to understand:

ETCD is used to store the state of all Kubernetes. It not only has the function of the state storage but also the event subscription and leader election. As for the function of the event subscription, the communication among other components is not completed each other by calling API. But to write the state into ETCD (equivalent to write a message). Other components listen to the state change on ETCD and do the subsequence process. As for the leader election, the other components, like Scheduler, in order to achieve high availability, one of them is selected from multiple (usually 3) instances through ETCD to be the Master. The rest is Standby.
API Server: since the ETCD is the core of the whole system, all communication among components need to be done through ETCD. In fact, they do not directly access ETCD but visit a proxy. The proxy encapsulated the ETCD interface call through the standard RESTFul API. In addition, the proxy also implements some additional features, such as identity authentication, cache, etc. The proxy is just API Server.
Controller Manager is used to implementing the scheduling of a task. You can refer to the previous articles about task scheduling. In a word, for those which directly request Kubernetes for scheduling are all tasks, such as Deployment, Deamon Set or Job. After each task request is sent to Kubernetes, it is handled by Controller Manager. Each task type corresponds to a Controller Manager. For example, a Deployment corresponds to a Deployment Controller and a DaemonSet corresponds to a DaemonSet Controller.
Scheduler is used for resource scheduling (please refer to the previous article for the specific meaning of resource scheduling). The Controller Manager writes resource requirements, which are actually Pod, into the ETCD. If Scheduler monitors a new Pod that needs to be scheduled. The Pod will be allocated to specific nodes according to the state of the whole cluster.
Kubelet is an agent that runs on each node. It monitors the Pod information in ETCD and it will run the corresponding Pod on the node and update the status back to ETCD if the Pod assigned to its node needs to run.
Kubectl is a command tool that calls API Server to send a request and write the status to ETCD or to query the status of ETCD.

Isn’t that much clear? If it is still unclear, we will use an example of deploying a service to explain the whole process. Suppose you want to run a multi-instance Nginx. Then inside Kubernetes, the whole process looks like this:

From the Kubectl command line, a Deployment object containing Nginx is created. Kubectl will call API Server to write a Deployment object into the ETCD. The object information can be obtained if the Deployment Controller monitors the new Deployment object being written. Deployment Controller conducts the task scheduling according to the object information and creates the corresponding Replica Set object.
Replica Set Controller monitors a new object being created and reads object information for task scheduling and creates the corresponding Pod.
Scheduler monitors the creation of a new Pod, reads the Pod object information, schedules the Pod to a certain node based on the cluster status, and updates the Pod (the internal operation is to bind the Pod to the Node).
Kubelet runs the Pod according to the object information when the current node is monitored to be specified a new Pod.

The above is how the whole Deployment process is created inside Kubernetes. The process is meant to explain the responsibilities of each component and how they work together, the details are not displayed.

Up to now, we have researched several very typical scheduling systems: Hadoop MRv1, YARN, Mesos, and Kubernetes. After I learned the architecture of these scheduling systems, I actually had two big questions in my mind:

Is Kubernetes a two-level scheduling architecture like Mesos, and what is the scalability compared with Mesos?
Why are all scheduling systems unable to horizontally scale?

We will discuss these two issues in depth later.

Is Kubernetes a two-level scheduler like Mesos？and what is the expansibility like compared with Mesos?

In a paper of Google about their internal Omega scheduling system, the scheduling system is divided into three categories: Monolithic, Two-level scheduling and Shared state. According to its classification, Google Borg is usually assigned to the Monolithic category, Mesos is taken as two-level scheduling, and Google’s own Omega is taken as the third category of “Shared state”.The author of the paper is also one of the designers of Mesos, who later joined Google’s new Omega system design, and published papers. The main purpose of the paper is proposed a new kind of “Shard State” model to solve the problem of scheduling system performance and scalability. But I actually think that the Shared State model is too idealistic. The Omega system developed according to the model seems not to be widely used inside Google, and no large-scale scheduling system uses the Shared State model.

Because most of the design of Kubernetes inherit from Borg, and the core components of Kubernetes (Controller Manager and Scheduler) are also bound and deployed together, and the state is all stored in ETCD. So people usually regard Kubernetes as a “Monolithic” scheduling system. Actually, I disagree with that.

In my opinion, Kubernetes’ scheduling model is also completely two-level scheduling like Mesos. Task scheduling and resource scheduling are completely separated. The Controller Manager is responsible for task scheduling, while the Scheduler is responsible for resource scheduling.

In fact, the biggest difference between Kubernetes and Mesos scheduling is how resource scheduling requests. Is push or pull:

The way of proactively push, it is the way adopted by Mesos, that is, the resource scheduling component (Mesos Master) of Mesos proactively pushes the Offer of resources to the Framework. The Framework cannot proactively request resources, but it can only decide to accept or reject according to the information of Offer.
The way of passive pull, which is the way of Kubernetes. The resource scheduling component Scheduler passively responds to the Controller Manager’s resource requests

The difference between the two approaches will be analyzed from the following five aspects. In addition, it should be noted that all the advantages and disadvantages I compared are theoretically analyzed, and there will be differences in engineering implementation, and I myself have not tested some of the indicators in practice.

1) Resource utilization: Kubernetes wins

Theoretically, Kubernetes should be able to achieve more efficient cluster resource utilization. The reason is that the duty of resource scheduling is completely done by a component of Scheduler, it has plenty of information to allocate resources from an overall perspective. And Mesos can’t do that, because the resource scheduling responsibilities are segmented to two components of the Framework and Mesos Master. When Framework selects the Offer, it has no any workload information of other Framework. So it’s impossible to make an optimal decision. For example, we want to schedule CPU-consuming and memory-consuming workloads to the same host, which is not easy to implement in Mesos because the two types of workloads belong to different frameworks.

2) Expansibility: Mesos wins

Expansibility means how easy to support a new type of workload scheduling. In theory, Mesos is a bit more expansible. The reason is that Mesos’ approach to resource scheduling makes it easier to migrate the existing task scheduling. As an example, assuming that you already have a task scheduling system, such as Spark, which is now migrating to a cluster scheduling platform, it is theoretically easier to migrate to Mesos than Kubernetes.

If it is migrated to Mesos, the pull resource scheduling way will not change its original workflow and logic. The original logic is: a job request comes, the scheduling system divides the job into small tasks. Then selects a node from the resource pool to run the task and records the selected node IP and port number to track the status of the task. After migrating to Mesos, the same logic remains. The only thing that needs to change is the resource pool, which used to be a self-managed resource pool and now becomes the Offer list provided by Mesos.

If it is migrated to Kubernetes, the original basic logic is needed to modify to accommodate Kubernetes. The scheduling of resources is completely needed to be done by calling external components, and this process becomes asynchronous.

3) The flexible strategy of task scheduling: Mesos wins

Mesos also need to support better scheduling policies for various tasks. For example, if a job requires a policy of All or Nothing, Mesos can make it happen but Kubernetes cannot support it at All. Here “All or Nothing” means that ff the whole job needs to run 10 tasks, these 10 tasks need to be able to start implementation with all resources, otherwise none of them will be implemented.

4) Performance: Mesos wins

Mesos should perform better because the resource scheduling components, Mesos Master, leave part of resource scheduling responsibility to the Framework. The Framework is responsible for selecting resources from the resource pool which Messos Master offered. That makes Mesos Master easier do the scheduling. This is also true according to performance testing data. Years ago, Twitter’s own Mesos cluster was able to manage over 80,000 nodes, while Kubernetes 1.13 could only support 5,000 nodes

5) Scheduling delay: Kubernetes wins

Kubernetes scheduling delays are even better. Because Mesos’s scheduling mechanism that takes turns providing offers to the Framework. It wastes a lot of time providing offers to the Framework that really don’t need resources.

Why do all scheduling systems not support scale-out architecture?

It may be noticed that almost all of the cluster scheduling system cannot scale out. Such as the early management node of Hadoop MRv1 is a single node, the maximum of a cluster is 5000 machines. And only one instance works in YARN resource management cluster, the rest is the backup. The upper limit of the management machines is 10000 nodes. By optimizing the Mesos, one cluster can manage 80,000 nodes. The current version of Kubernetes is 1.13, the maximum number of cluster management nodes is 5,000.

The architecture of any cluster scheduling system is not scalable, and the only way to manage more servers is to create multiple clusters. The universal architecture of the cluster scheduling system looks like this:

The central resource scheduler is the most central component. Although it is often made up of multiple (usually three) instances, it is all single active. That meaning that only one node is working and the other nodes are in the standby state. Why is that? It seems that it does not conform to the architectural design principles of Internet applications. Nowadays, most Internet applications, such as the application of e-commerce, can easily achieve horizontal expansion through some distributed technologies. In the promotion, the throughput of services can be improved by adding servers to the cluster. If you follow the architecture of an Internet application, scheduler’s architecture would look like this:

Scheduler is supposed to be multiple active that can provide services to any number of instances together. When a consumer or a provider of resource access the Scheduler, a load-balancing component or device is needed to go through and to allocate the request to a certain scheduler instance. Why does this architecture not work in cluster scheduling systems? To understand this, let’s discuss the prerequisites for horizontal scaling through an example of an Internet application architecture.

The prerequisites for horizontally scaling architecture

Suppose we want to implement such an e-commerce system:

This is a second-hand book trading platform. There are a lot of sellers offering second-hand books on the platform, let’s call each used book an Inventory.
According to the barcode of every second-hand book in the seller’s inventory, a book can be found in the book catalog. Let’s call this book a Product.
When the seller registers the second-hand book Inventory, in addition to which Product it belongs to, it also needs to input other information, such as the degree of old and new, price, shipping address and so on.
Buyers browse the book product catalog, select a book, and place an order. According to the requirements of buyers (price preference, delivery address, etc). The algorithm is used to match an inventory that meets the requirements from all inventory pool behind this book. We can call this process order matching.

For such a system, by looking at the model, there is little distinction between e-commerce and cluster scheduling systems. On the platform, there is a resource provider (the seller), who provides some resources (Inventory) to compose a resource pool (all Inventory). As well as the resources consumers(buyer), who present the demand for resources. Then the resource scheduler system (order system) matches a resource (a second-hand book inventory) according to the algorithm. It is clear that the e-commerce system can be designed to be a scale-out architecture, why? What is the difference between the e-commerce system and the cluster scheduling system? Before answering this question, I would like to answer another question: is there an upper limit on the number of nodes in the horizontal scaling of this e-commerce system? What is the upper limit? What factors determine the upper limit?

The maximum number of concurrent requests system can handle determines the maximum number of scaling out instances

How to understand this thing? Let’s assume that the system architecture design does not take any physical restrictions (such as the resources of the machine size, bandwidth, etc.), into consideration, and it is able to handle 1000 concurrent requests. So obviously the upper limit of scaling out nodes number is 1000. Even if 1001 instances are deployed at any time, there is always a node is idle. More deployment of nodes has been completely unable to improve the performance of the system. The question becomes: what are the factors that determine the number of concurrent requests that the system can theoretically handle?

The maximum number of concurrent requests system can handle is determined by the number of “independent resource pool”.

“Independent resource pool” is a term that I made up myself because I couldn’t think of a better one. Let’s take the above e-commerce system as an example, what determines the maximum number of concurrent requests (purchase order requests) that can be handled theoretically by the order system? Take a look at the picture below:

When an order system gets to match the demand, it actually should be run like this: when an order request comes up, it needs to line up according to the purchase of the product. The request for buying the same product is placed in a queue. Then the order scheduling system starts to process the request one by one from the queue. Whenever an order match is needed to be done, it is required to select the best match of an inventory according to all the inventory of the current product.

Although at the time of implementing the system, the queue may not be a message queue. It could be a relational database lock. Such as an order of purchasing a book of “Steve Jobs”. The system first needs to find out the book from all the inventories when the system processes it and lock all those inventories accords in the database. The database lock can only be released after achieving the order match and updating the inventory (Set the inventory item of completed orders to be the “unavailable” state).

Actually, at this time, all the consequence order requests that purchase “Steve Jobs” is waiting in the queue. Some systems use “optimistic lock” for performance optimization. That is the inventory would not be locked in the first beginning when an order is processed each time. It actually gets locked at the last step of updating the inventory. If two orders match the same inventory items, then one of the processing of the orders has to give up and try again. The two implementations are different, but the principle is the same.

Therefore, we can see from the above story why all the orders for purchasing need to be processed in a queue? The reason is when we do the order match every time, all the inventory information of the commodity of Steve Jobs is needed, and part of the inventory information status will be finally modified. In this order-matching scenario, we call all the inventory information of the of Steve Jobs as an “independent resource pool”. Plus, the maximum number of concurrent orders in the “scheduling system” is entirely dependent on the number of independent resource pools, that is, the number of the product. Let’s assume that if this used book platform only sells one product “Steve Jobs”, then all requests would end up in a queue, and this system is almost impossible to scale out either.

The number of “independent resource pools” in a cluster scheduling system is ONE

Let’s again take a look at the cluster scheduling system. Each server node is a resource. Whenever a resource consumer requests a resource, what is the independent resource pool used for making the scheduling algorithm of the scheduling system? The answer should be the entire cluster resources. There is no way to split it anymore. The reasons are below:

The responsibility of the scheduling system is to find the most optimal resource matching in the global. So entire cluster resources are required by the scheduling algorithm.
In addition, even if it is not necessary to find the most optimal resource match, the resource scheduler still cannot determine which part of the entire cluster resources should be selected from.

For that very reason, the number of “independent resource pool” is 1, so the cluster scheduling system cannot achieve scale-out.