Kubernetes fault tolerance mechanisms

Understanding Kubernetes architecture through its fault tolerance mechanisms

6 min readMar 29, 2023

In this article, I want to describe what are the essential components of Kubernetes architecture and how they work together to make Kubernetes cluster fault tolerant. This knowledge might be useful for devops engineers who manage on-premise Kubernetes clusters but also for anyone interested in distributed systems fault tolerance mechanisms

Kubernetes components overview

Control plane

API Server — gateway to the control plane, have REST API for managing Kubernetes cluster resources, watching for changes in cluster state and extending standard set of resources

etcd — strongly consistent key-value store which contains current cluster state, desired state represented by resources provided by the engineer, also have a useful watch feature to stream change notifications on a key or path and is used by API Server to stream events to clients

Scheduler — service that is responsible for assigning pods to nodes based on requested resources for pod creation and current worker nodes load. Strategy for such assignment can be plugged in by the engineer

Controller manager — component which manages control loop controllers that keep different cluster resources in a desired state. It contains some core controllers like ReplicaSet, Deployment and NodeController, but also can be extended with the custom controllers relevant for your operational environment

Data plane

kubelet — an agent that is running on every data plane node and acts as a bridge between the data plane and control plane components. It is responsible for worker node registration, image pulling, volume mounting, pod health monitoring, and reporting to API Server status of the pods

kube-proxy — component running on every data plane node to ensure proper network routing to the pod (by changing iptables rules) and traffic policy enforcement

container-runtime — usual container management runtime like Docker or containerd that uses cgroups and namespace Linux kernel features to isolate and constrain every application we want to run

What is fault tolerance in distributed systems

Fault tolerance is an ability of a system to continue to operate when part of the system fails. In the book “Designing Data-intensive applications”, the author Martin Kleppman defines various types of faults that are associated with hardware, software, or can result from human errors.

To overcome faults in distributed systems there are different strategies including:

process redundancy — several instances of the process are running on different machines and when one is failing other can still process requests
data replication — keeping the copy of some data on different machines so that when one machine fails others can continue serve client needs
checkpointing — storing the state of the system after every unit of work done so that it is easier to recover the state after failure

These strategies might be combined to increase the overall fault tolerance of the system.

In Kubernetes process redundancy and data replication are used for fault tolerance and we will review them below

How Kubernetes manages Fault tolerance

Let’s see how Kubernetes tolerates faults in the control and data plane

Control plane faults

First of all, let’s eliminate the scenario when we run Kubernetes on 1 node only. This is not a fault tolerant deployment mode and can be used for testing or experimenting only. Thus we consider only HA mode for Kubernetes here. HA mode suggests having a cluster with a minimum of 3 nodes (depending on the topology we choose), each having instance of the Kubernetes component. Let’s see how every component of the control plane works in this scenario supporting fault tolerance:

API Server is a stateless REST service that doesn’t need any internal coordination to process requests. The usual way to make such services fault tolerant is process redundancy with Load Balancer in front of them. Load Balancer monitors instances and is responsible for routing requests only to live API Server replicas. That means when some replicas fail we just loosing in the overall throughput of requests but service is still operational as long as at least one API Server is alive
etcd uses a combination of leader-based data replication and process redundancy with RAFT distributed consensus protocol to make sure that data is consistent even in scenarios of instance failures. Raft protocol states that there should be quorum or a majority of nodes available for its highly consistent properties to work. For example, to confirm to the client that some write was successful etcd needs to replicate data to the majority of nodes. Now this is helpful because when there is a network partition happened and we have split-brain situation nodes that have majority in their part of the network always have node with the latest data. Same if the minority of nodes have failed there is always a node that contains the last committed data and so etcd can recover easily from both situations by using RAFT protocol routines.
The situation when the majority of nodes have failed is clear — etcd is not able to commit any write request and so Kubernetes control plane is not able to operate properly.
What happens in case of an etcd follower node failure — well in theory there should not be any interruption in service.
Finally, when the leader node fails etcd also cannot work properly for quite a short period of time as leader is used for all the write requests. That is why when the leader heartbeats are no longer received by followers they elect new leader between themself using RAFT protocol and etcd starts to accept write requests again
Scheduler and Controller Manager are both using process redundancy with one active leader and on standby followers. Only one active process is a requirement to avoid conflicts and manage consistency of the cluster. Kubernetes have an internal leader election mechanism implemented in the API Server based on the concept of leases. The first process replica that acquires the lock on the resource object will be selected as a leader and will update its lease time on the resource object to confirm its healthy status to other replicas. Other instances will be on standby doing nothing except monitoring if the leader failed.
So the idea is that when non leader process fails Scheduler or Controller Manager just loses in the amount of redundant processes they have and so the risk of failure increases.
When the leader instance fails no work can be done until a new leader is elected by other replicas.

Data plane faults

We consider the case when we run some stateless application deployed as a ReplicaSet or Deployment with couple of replicas and there is a Service in front of it to balance the load. Process redundancy is a native feature of ReplicaSet/Deployment and we will look into the ways components collaborate during pod or node failure to make sure requests are not routed to failed pods:

Pod failures — When some pod fails kubelet agent that monitors every pod state on the same machine reports the pod state change to the API Server which further stores this information in etcd. After this Endpoint Controller that watches changes in pods states via API Server removes pod from the Service endpoint object and stores it in etcd. Update on endpoint object is monitored by kubelet-proxy via API Server. After kubelet-proxy receives an update it changes iptables rules to reflect routing change to the pods on a node. And that is how Kubernetes makes sure that traffic is not routed to the failed pod. This procedure ends with the kick-off of the failure recovery mechanism to replace failed pod
Node failures — during the node failure API Server no longer sees heartbeats sent by kubelet of a failed node and so marks node as unhealthy. The Node Controller is responsible for initiating the pod eviction mechanism, which is followed by the procedures outlined in the pod failures section. The replicas of the failed node’s pods continue to operate on other worker nodes.
There might be a situation when all pod replicas are assigned to the same worker and its failure means application failure. To avoid such situations Kubernetes have feature called anti-affinity rules. This feature allows you to distribute pods across nodes and avoid application failure

Summary

In general, we reviewed control and data plane components of Kubernetes cluster and how they are able to continue operating when the cluster experiences physical or virtual node failures.