Deep dive into Kubernetes Simple Leader Election

Michael Bi
Jan 16 · 4 min read
Photo by Markus Spiske on Unsplash

I always think Kubernetes Controller Manager or Scheduler components are leveraging etcd to perform leader election ever since I learned these components should have their leader in HA mode.

But recently, when I tried to review Kubernetes Controller Manager configuration(config.yaml), I suddenly noticed there is actually no such command flag for adding etcd connection-string.

I decided to ask Google for any information around the mechanism of K8s control plane components leader election. There is good stuff online such as Simple leader election with Kubernetes and Docker but the leader election mechanism that kubernetes performs is confusing.

Following statements were what I copied from the aforementioned blog.

## Implementing leader election in KubernetesThe first requirement in leader election is the specification of the set of candidates for becoming 
the leader.
Kubernetes already uses Endpoints to represent a replicated set of pods that comprise a service,
so we will re-use this same object. (aside: You might have thought that we would use
ReplicationControllers, but they are tied to a specific binary, and generally you want to have a
single leader even if you are in the process of performing a rolling update)
To perform leader election, we use two properties of all Kubernetes API objects: ResourceVersions - Every API object has a unique ResourceVersion, and you can use these
versions to perform compare-and-swap on Kubernetes objects
Annotations - Every API object can be annotated with arbitrary key/value
pairs to be used by clients.

Who is the leader of K8s controller manager?

Before jumping into code implementation, I learned from the blog on how to inspect who is the leader of a leader election enabled service like k8s controller manager.

In kubernetes, any leader enabled service will generate an EndPoint with an annotation suggests leadership in the service. Take k8s controller manager as an example.

> kubectl describe ep -n kube-system kube-controller-managerName:         kube-controller-manager
Namespace: kube-system
Labels: <none>
Annotations: control-plane.alpha.kubernetes.io/leader:
{"holderIdentity":"a3e0b5e2-e869-488d-9c14-49a60f3878df_a69decf6-192d-11e9-8a88-e6202bae2e50","leaseDurationSeconds":15,"acquireTime":"201...
Subsets:
Events: <none>

The endpoint annotation (Annotations: control-plane.alpha.kubernetes.io/leader) suggests that the instance who’s identity or hostname is a3e0b5e2-e869 currently is the leader.

{
"holderIdentity": "a3e0b5e2-e869-488d-9c14-49a60f3878df_a69decf6-192d-11e9-8a88-e6202bae2e50",
"leaseDurationSeconds": 15,
"acquireTime": "2019-01-16T01:25:47Z",
"renewTime": "2019-01-16T07:30:31Z",
"leaderTransitions": 0
}

That JSON object gives us some clue on k8s leader election mechanism.


K8S leaderelection.go

K8s leader election package is hosted at leaderelection.go

// keep in mind, this struct will be Marshaled to a service endpoint // as value of  
// Annotations: control-plane.alpha.kubernetes.io/leader
leaderElectionRecord := LeaderElectionRecord{
HolderIdentity: le.config.Identity,
LeaseDurationSeconds: int(le.config.LeaseDuration / time.Second),
RenewTime: now,
AcquireTime: now,
}
....
// Instance needs to check if a endpoint is created/existed in //certain Namespace.
e, err := le.config.Client.Endpoints(le.config.EndpointsMeta.Namespace).Get(le.config.EndpointsMeta.Name)
if err != nil {
if !errors.IsNotFound(err) {
glog.Errorf("error retrieving endpoint: %v", err)
return false
}
leaderElectionRecordBytes, err := json.Marshal(leaderElectionRecord)
if err != nil {
return false
}
_, err = le.config.Client.Endpoints(le.config.EndpointsMeta.Namespace).Create(&api.Endpoints{
ObjectMeta: api.ObjectMeta{
Name: le.config.EndpointsMeta.Name,
Namespace: le.config.EndpointsMeta.Namespace,
Annotations: map[string]string{
LeaderElectionRecordAnnotationKey: string(leaderElectionRecordBytes),
},
},
})
if err != nil {
glog.Errorf("error initially creating endpoints: %v", err)
return false
}
le.observedRecord = leaderElectionRecord
le.observedTime = time.Now()
return true
}
...
leaderElectionRecordBytes, err := json.Marshal(leaderElectionRecord)
if err != nil {
glog.Errorf("err marshaling leader election record: %v", err)
return false
}
//
// const (
//
// LeaderElectionRecordAnnotationKey = "control-plane.alpha.kubernetes.io/leader"
//
// )e.Annotations[LeaderElectionRecordAnnotationKey] = string(leaderElectionRecordBytes)// Instance creates or Update Endpoints to claim leadership role.
_, err = le.config.Client.Endpoints(le.config.EndpointsMeta.Namespace).Update(e)
if err != nil {
glog.Errorf("err: %v", err)
return false
}
le.observedRecord = leaderElectionRecord
le.observedTime = time.Now()
...

This code snippet gives me some clue on how k8s implements leader election. It obviously leverages k8s endpoint resource as some sort of LOCK primitive.

If a service is required to elect a leader in k8s, instances in this service will compete to LOCK (via Create/Update) a k8s EndPoint resource of this service.

// LeaderElectionRecordAnnotationKey = "control-plane.alpha.kubernetes.io/leader"
e.Annotations[LeaderElectionRecordAnnotationKey] = string(leaderElectionRecordBytes)
le.config.Client.Endpoints(le.config.EndpointsMeta.Namespace).Update(e)le.observedRecord = leaderElectionRecord
le.observedTime = time.Now()
...

So that makes sense to me now, an instance in a k8s service claims leadership role by creating an endpoint and adding an annotation (control-plane.alpha.kubernetes.io/leader).

This service endpoint is used as a Lock to prevent any follower to create the same endpoint in this same Namespace.

That is awesome!!


Summary of K8S simple leader election

Implementation is boring. Simply put, K8s simple leader election follows below steps.

  1. Instance who firstly creates a service endpoint is the leader of this service at very beginning. This instance then adds (control plane.alpha.kubernetes.io/leader) annotation to expose leadership information to other followers or application in the cluster.
  2. The leader instance shall constantly renew its lease time to indicate its existence. In below, the leaseDuration is 15 seconds. The leader will update lease time (renewTime) before lease duration is expired.
  3. Followers will constantly check the existence of service endpoint and if it is created already by the leader then the followers will do an additional lease (renewTime) renew checking against the current time. If lease time (renewTime) is older than Now which means leader failed to update its lease duration, hence suggests Leader is crashed or something bad has happened. In that case, a new leader election process is started until a follower successfully claims leadership by updating endpoint with its own Identity and lease duration.
{
"holderIdentity": "a3e0b5e2-e869-488d-9c14-49a60f3878df_a69decf6-192d-11e9-8a88-e6202bae2e50",
"leaseDurationSeconds": 15,
"acquireTime": "2019-01-16T01:25:47Z",
"renewTime": "2019-01-16T07:30:31Z",
"leaderTransitions": 0
}

Probably there is other edge cases in implementation details but it doesn’t concern me just now.

If you think there are something wrong in this article, please let me know.

@michaelbi_22303

A solution architect at Mesosphere. Loves distributed technologies.

Michael Bi

Written by

@michaelbi_22303

A solution architect at Mesosphere. Loves distributed technologies.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade