Clearing the Obstacles Around Kubernetes Stateful Applications

Published in

AppsFlyer Engineering

8 min readNov 29, 2023

Redesigning legacy infrastructure is always considered an opportunity for growth and improvement, especially when Kubernetes comes into play.

A crucial aspect in the redesign process involves eliminating any human interventions and substituting them with automated flows.

When you choose to run a stateful application over Kubernetes while maintaining local storage with persistent volumes, it might still require you to take manual action to recover your infrastructure from internal failures or third parties’ unexpected behavior, such as node failure.

In this blog post, I will cover how to overcome some common obstacles when running stateful applications — all while leveraging Kubernetes’ native components to deal with local persistent volumes’ lifecycles.

Kafka as a Stateful Application

When I joined the AppsFlyer Platform group, our team’s first mission was to redesign the legacy infrastructure of our Kafka deployment running on self-hosted instances over AWS.

Our team’s main challenge was to design an infrastructure with maximum performance at the lowest cost — a challenge that most of us still face today. This is a challenge that high-scale data-driven companies like AppsFlyer cannot compromise on, especially when it comes to a backbone component like Kafka.

When we started our exploration, choosing Kubernetes as the orchestrator engine for our application was a native choice. But beyond that, the decision-making process became more complex. Throughout our journey, we had to benchmark multiple storage solutions. One of them involved the evaluation of local versus network-based storage, where we found a significant performance difference — particularly in terms of IOPS and fetching times. That comparison eventually paved the way for us to adopt optimized storage instances with local SSD drives when running our Kafka deployment.

Defining and Managing Local Instance Stores in Kubernetes

As we began our exploration and defined our Kafka deployment over Kubernetes, one of the first tasks we had was to mount the Kubernetes local storage to our pod so that it could be used as the data disk.

When mounting a local storage device in Kubernetes, you first need to define your volume as a Kubernetes resource. This is called a Persistent Volume (PV). Defining your local volume as a PV allows your pods to mount it using Persistent Volume Claims (PVC) requests. As stated in the official Kubernetes documentation:

A PV is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like volumes, but have a lifecycle independent of any individual pod that uses the PV. This API object captures the details of the implementation of the storage — be that NFS, iSCSI, or a cloud provider-specific storage system.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and memory). Claims can request specific size and access modes.”

When a PVC is created by a pod request, it exclusively binds the PV to it and consumes it as a resource as long as it shares a common Kubernetes storage class. Pods running on a cluster use PVC to request these kinds of physical storage.

Generally, maintaining large-scale environments such as in AppsFlyer requires you to adopt fully automated solutions without human intervention.

In order to dynamically create Kubernetes resources (PVs), we deployed an open-source tool called “local-path-provisioner,” developed by Rancher. The local-path-provisioner serves as a Kubernetes controller that dynamically listens for resource requests (PVCs), and creates new PVs by demand, according to the specified storage class profile.

There are multiple solutions in the Kubernetes community that enable the dynamic creation of local type storage PVs. We chose the “local-path-provisioner,” mainly because of its advantage in working with a Kubernetes cluster auto-scaler that utilizes AWS auto-scaling groups. I will not go into too much detail here about the different existing solutions, but I do recommend investing some time to explore them.

Maintaining PVs

Running stateful applications may not always sound like a challenging task, but running distributed systems like Apache Kafka at such a high scale always requires you to automate the recovery procedure when a failed node in a cluster crashes.

This might sound obvious when running stateless applications over Kubernetes, where the Kubernetes scheduler will simply deploy your pod on a healthy node. But how can the same be achieved when running stateful applications over a node’s local storage in Kubernetes? And what will happen to our PVC when the underlying storage device associated with our PV becomes unavailable?

Our team had to deal with this many times before, where third-party interference — such as hardware failures that result from node termination — takes place. In such cases where we have a failed node, the PV will not be reachable anymore. The corresponding Kubernetes PV and PVC objects will still exist, but the underlying hardware will not.

In such cases, our pod state moves to “Pending,” and we will meet the “volume affinity conflict” event generated by the Kubernetes scheduler that attempts to schedule the pod to the same node and persistent volume once they become available again. Since node failures usually end up with node termination, recovery is not an option.

By design, Kubernetes does not remove any storage-related resources without explicit instructions, so we cannot expect automatic recovery. To overcome this constraint, a PVC resource cleanup must be done to allow the Kubernetes scheduler to schedule your pod on a fresh instance after a new PVC creation is complete.

This may sound like an effortless procedure, but we don’t rush to make this happen in every instance failure. We might encounter, for example, a planned instance restart that completes successfully — making both the host and the storage available again. So we must first ensure that our instance is terminated. Afterward, we manually delete our PVC to allow a new one to be created instead. When the local-path-provisioner identifies that resource request (PVC), it will automatically create a new PV as long as available nodes exist in the cluster.

Once we understood that this semi-manual flow of actions was not maintainable at scale and constantly involved human intervention, we decided to develop a more Kubernetes-native solution by using a dedicated Kubernetes controller — the missing piece in the puzzle that will help us manage the lifecycle of our PVCs.

Our Solution

To help overcome these challenges, we developed a controller that listens to the events generated by the “node-controller” service (part of the Kubernetes control-plane), and responds to cases where the service is removing a node from the cluster. We named it “Local PVC Releaser”.

For example, the following Kubernetes event illustrates an event that was generated by the “node-controller” service, reporting about a node removal by specifying the exact node name:

Name:             kafka-1.my.domain.1752aaf5df1c3a15
Namespace:        default
API Version:      v1
Count:            1
Involved Object:
  API Version:   v1
  Kind:          Node
  Name:          my-instance-id.eu-west-1.my.domain
  UID:           b88e65c9-4bb3-4406-99d0-4200b3a780f4
Kind:            Event
Message:         Node my-instance-id.eu-west-1.my.domain event: Removing Node my-instance-id.eu-west-1.my.domain from Controller
Metadata:
  Managed Fields:
    API Version:  v1
    Manager:          kube-controller-manager
    Operation:        Update
Reason:               RemovingNode
Source:
  Component:  node-controller

Selecting the appropriate event type for your controller is vital, as sometimes multiple services can respond and generate events as the result of an incident. Ensuring that the controller can be reused for different purposes — even if they don’t necessarily support the same product or use case — was a mandatory requirement for us. Therefore, we determined that we should focus on events generated by one of the Kubernetes control-plane components (in this case, the node-controller).

It is important to mention that these kinds of events can be triggered as a result of intentional node removal of the user/auto-scaler, or unintentional removal caused by unexpected behavior such as degraded hardware of the node.

Once we identified the type and kind of event that we wanted to act upon, we started developing our controller. To kickstart our project, we used the “Kubebuilder” SDK (developed by Google and part of the Kubernetes SIG) that was built on top of the native Kubernetes controller libraries.

The first step we took was to define the controller with its manager. We stated a method that will predicate the related events (“onEventCreationPredicate”) on Kubernetes “node” objects.

func (r *PVCReconciler) SetupWithManager(mgr ctrl.Manager) error {
 return ctrl.NewControllerManagedBy(mgr).
  For(&v1.Event{}).WithEventFilter(onNodeTerminationEventCreatedPredicate()).
  Complete(r)
}

func onEventDeletionPredicate() predicate.Predicate {
 return predicate.Funcs{
  DeleteFunc: func(e event.DeleteEvent) bool {
   obj := e.Object.(*v1.Event)
   return obj.Source.Component == "node-controller" && obj.Reason == RemovingNode
  },
 }
}

Based on this event, we ensured that the Kubernetes node-controller decided to remove a node from the cluster. Afterwards, we started iterating over all existing PV objects on the cluster to identify if one (or more) of them represents a local volume on that node.

The following is an example of a PV object that represents a local volume on a Kubernetes worker node:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rancher.io/local-path
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-c34d74df-4337-4337-4337-aeb225020289
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 3412Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-cluster-1-kafka-0
    namespace: strimzi
  local:
    path: /mnt/data/pvc-c34d74df-e65a-4337-877b-aeb225020289_strimzi_data-cluster-1-kafka-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - ip-10-0-0-0.eu-west-1.compute.internal
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-path 
  volumeMode: Filesystem
status:
  phase: Bound

For filtering the PV objects, we used the “NodeAffinity” rule, while matching the faulty worker node name. This ensured that we collected only the PV objects that were allocated from its local instance store.

func (r *PVCReconciler) FilterPVListByNodeName(pvList *v1.PersistentVolumeList, nodeName string) []*v1.PersistentVolume {
 var relatedPVs []*v1.PersistentVolume

 for i := 0; i < len(pvList.Items); i++ {
  pv := &pvList.Items[i]
  // Ignoring PVs without affinity rules or PVs that already got released
  if pv.Spec.NodeAffinity != nil && pv.Spec.NodeAffinity.Required == nil || pv.Status.Phase != v1.VolumeBound {
   continue
  }

  for _, nst := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
   for _, matchEx := range nst.MatchExpressions {
    if slices.Contains(matchEx.Values, nodeName) {
     r.Logger.Info(fmt.Sprintf("pv - %s is bounded to node - %s. will be marked for pvc cleanup", pv.Name, nodeName))
     relatedPVs = append(relatedPVs, pv)

     break
    }
   }
  }
 }

 return relatedPVs
}

After iterating, when the conditions above were met, we proceeded to extract the name of the PV claim from the PV object and initiated the deletion of the PVC. This marked the completion of the reconcile process, as illustrated below:

Once the PVC cleanup was done, the job of our controller was complete. The pod/operator can now raise a new resource request (PVC) that triggers our storage provisioner for creating new PV objects (as long as available nodes and volumes exist in the cluster). When the new PVC is bound to the newly created PV, the pod can be successfully scheduled again.

Making Kubernetes Work for You

Throughout this process, we were able to automate the recovery of instance failures and make our stateful deployments more resilient to unexpected behaviors. Choosing a Kubernetes-native approach such as developing a controller allowed us to integrate more easily with inner Kubernetes components, make wiser decisions, and turn it into a true orchestra.

By implementing this solution, we reduced the operational overhead and our Kafka clusters’ stability. We manage more than 60 Kubernetes clusters at AppsFlyer, and our day-to-day operations have become much more robust and efficient, especially during Kubernetes version and instance upgrades.