Leader Election Architecture — Kubernetes

Published in

Hybrid Cloud Hobbyist

7 min readApr 23, 2021

Leader election is an important implementation for a distributed application architecture. It involves the selection of a leader among the healthy candidates eligible for the election. This process continues to do the heartbeat in a loop, in case a leader fails any criteria, making it unfit to hold the position. Once the leader’s health check fails, the candidates start a re-election to elect the next leader among them. This is critical for services where the leader is the single point of contact to orchestrate the service delivery. Zookeeper, etcd are such services which rely on a leadership election process.

There are some informative guides available on how to implement leader election, but all of them point to the base idea mentioned in the Kubernetes blog https://kubernetes.io/blog/2016/01/simple-leader-election-with-kubernetes/

After going through the above blog in detail, the following implementation can be inferred from it:

An instance which is ready first and is eligible for the election, will create endpoint or configmap kubernetes object and add an annotation (control plane.alpha.kubernetes.io/leader) to expose the information to the other possible candidates regarding it’s elected role.
The leader instance shall constantly renew its lease time to indicate its eligibility to hold the position. The leader must update lease time before lease duration is expired.
Other candidates will constantly check the endpoint or configmap kubernetes object and if it is created already by the leader then the candidates will do an additional lease (renewTime) renew checking against the current time. If lease time (renewTime) is older than Now which means the leader failed to update its lease duration, which implies leader has crashed or something bad has happened. In that case, a new leader election process is started until a follower successfully claims leadership by updating annotation with its own Identity and lease duration.
This endpoint or configmap kubernetes object is used as a Lock to prevent any candidate to create the same endpoint or configmap kubernetes object in the same namespace.

Points to ponder on this implementation

Multiple KubeAPI calls are required to maintain and ensure the leader is healthy
Polling the endpoint or configmap kubernetes object to discover the leader identity might cause a little delay to start a re-election to elect a new leader
Possibility of existence of two leaders among the eligible candidates at a certain time
KubeAPI calls increase significantly as the size of the replica sets increase, such another factor is number of deployments of such applications

Looking for an alternative approach

The election results and leader identity is maintained at the endpoint/configmap kubernetes object in the form of annotation. The endpoint/configmap kubernetes object is queried and updated by the candidates and leader, during each heartbeat of the leader election criteria. The sidecar container with each container does this logic and the main container can query the sidecar to check if the leader has shifted to itself. This passive way of knowing the information is causing the delay in this entire approach.

How about the information of leader & election process is held among all the candidates themselves and not in some common lock. The candidates instead of querying the lock, can check among themselves who have joined the flock and are healthy. This brings the consensus and election logic internal to the candidates making them independent of any external locking mechanism. The consumption of the KubeAPI is also a factor which needs to be considered. Multiple deployments with large replica sets will consume a lot of resources of KubeAPI server to query and update the endpoint/configmap kubernetes object.

To our help, we found a library called Democracy.js written in Node.js. It provides an object, constructed with the knowledge of the network addresses of peers and the network address about itself. Each democracy object in the network has a random weight added to it.

The Democracy object starts sending ping requests and checks on acknowledgement from them over UDP unicast. Once a new object is created, it sends signal to all the peers about it’s presence in the network and an added event is thrown. Hence, every new democracy object starts with the added event where it discovers about the peers present in the network. The new democracy node adds every new node to their local cache and checks for their health on confirmation of the acknowledgment. Once a peer doesn’t respond over the ping request, it is removed from the local peer cache and the removed event is thrown. The object provides listeners on events such as:

elected
resigned
removed
added

Every candidate in the democracy network, checks for the presence of a leader. If a leader is not present, the peer with the highest weight is selected as a leader. This would throw an elected event on the chosen leader node. If there are two leaders in the peer network, the leader with lesser weightage resigns, throwing the event resigned.

These are pro-active events, the application can choose to take a decision based on these events. To implement this on our Kubernetes we need to form the object with the source address and peers, the trick is where do we get it from. Again to our help, the Endpoints object maintains the name of the Pod and the network addresses of the pod between which the election has to happen. Endpoints object is created when we add pod label as a selector and create a kubernetes service object.

The following is an example of a endpoints object yaml:

Each pod when comes up with label selector added in kubernetes service object, can discover few information regarding itself, such as the pod name assigned to it. This can be done by passing the metadata name in the deployment YAML.

Following yaml depicts how to read the pod’s name. The environment variable MY_POD_NAME will contain the pod name.

Once the pod is ready, it gets a network address assigned to it when associated with a service. The endpoints object contains an array, with each element in the object containing the information of the pod name and network address associated in it. The array of information can be parsed to form the democracy object.

Advantages over the earlier implementation:

This only consumes the KubeAPIs on the start up phase once in their lifecycle. Once the startup phase is over, the heartbeat is done over the Service network among the Replica Sets. On the other hand. in the earlier implementation, the heartbeat is done on the endpoints objects. Thus, on a cluster multiple such deployments will not overload the KubeAPI
In the democracy.js implementation, we only require the read only access of the Endpoints kubernetes object, whereas in the earlier implementation we need create, read, write access.
In the earlier approach, the endpoints or configmap object maintains the lock as well as the information regarding leader’s identity. The implementation based on democracy.js maintains the leader information in each pods.
In the earlier approach, as the leader has to write and update it’s identity to the endpoints or configmap kubernetes object, the heartbeat mechanism becomes a hefty implementation.
The chances of split brain is much higher in the earlier implementation because of the limitation of number of requests per seconds in the KubeAPIs. Such scenarios are negligible in the new implementation.
In the Democracy.js Implementation, the leader selection is an event, thus the switch over is instantaneous, whereas the leader selection in the earlier implementation is a polling one thus, it has to poll every time to know the identity about the leader.
The heartbeat mechanism in the Node.js implementation uses the service based network to transmit the packets, thus secluded from other Deployments. The earlier discussed Implementation use KubeAPIs which affects the cluster performance globally and may slow down the other leader election pods.

Implemented this using Node Js :

Once a pod comes up, the leader election pod accesses the endpoints object to discover, i.e. the siblings and their registered IP addresses, the IP address of the container Itself using the pod name, it also retries on the Endpoint Object till an IP is assigned to it [this may take a few seconds once the pod comes up].
Once the discovery of the siblings is done, it starts a heartbeat mechanism to send UDP packets to each of the siblings and find a leader.
If the leader is not available in the system, the members listed in endpoint kubernetes object will vote and elect a new leader.
This is an event received in the democracy.js library which in turn triggers the leader’s responsibility.
Whenever the replica set is scaled or new pods added with same label selector added in kubernetes service object and if pod has leader election container running, the discovery call by the new pod makes sure that a leader is present in the network.
Until the leader pod dies, and there is no active leader present, democracy.js doesn’t decide on a new leader.
There are various configuration on the heartbeat frequency which can provide the switch over time, once a leader dies.
Based on the heartbeat frequency, if the leader dies, a new leader is selected immediately.

Implementation of Leader Election using democracy.js

This section will cover the actual code implementation and handling corner cases and will be addressed in the part 2 of this Blog