Managing Stateful Workloads in Kubernetes

Amarjit Singh
branch-engineering
Published in
5 min readApr 5, 2019

At Branch we process and store vast amounts of data; data generated from over 15 billion user events every day. Successfully doing so — while maintaining a near 100% SLA — requires running hundreds of stateless microservices, which we run in Kubernetes at scale. On any given work day at Branch, you can find an open terminal interacting with Kubernetes on any given engineer’s laptop.

In conjunction, all of these stateless services communicate with multiple persistent, low latency, redundant and always available databases and data processing workloads such as our self-managed NoSQL Aerospike databases, Kafka, Elasticsearch etc. Find out more on how we use DynamoDB and Kafka to efficiently support our clients at scale.

Challenges Scaling Stateful Workloads

Our stateful workloads — on the other hand — are orchestrated with other various tools and often require specialized and dedicated interaction to properly handle growth and emergency situations. For a company like Branch, which is growing extremely fast resulting in a tremendous growth of data, fragmentation in orchestration and configuration management starts to appear very quickly. These types of fragmentation make it extremely difficult to maintain standard operating procedures for ensuring site reliability.

If the performance and availability of such stateful workloads can be met with the orchestration and configuration management of Kubernetes, we could solve some of the above pain points. Enter the use of Kubernetes’ Statefulset scheduling. Let’s discuss where we started and how we plan to scale them with proper selection of underlying storage layer in AWS cloud.

Choosing Persistent Volumes

Persistent Volumes attach to pods orchestrated by statefulsets scheduler in Kubernetes. When a pod is restarted (because of deletion or scaling ), the Kubernetes scheduler has to assign the same volume on which it was running previously. In AWS, Kubernetes can leverage EBS or Ephemeral storage (as local volume in Kubernetes statefulsets) to provide Persistent Volumes to pods.

While pods will restart and live through an instance failure/termination when using EBS persistent volumes, they would not be able to live through an instance failure when using ephemeral/instance store as local volumes. Such failures would require manually deleting PVCs before affected statefulset pod would restart.

Provisioning Statefulsets Based on Workload Profile

The following decision tree provides a starting point when choosing type of volume for stateful set workloads in AWS cloud.

Ref: https://www.slideshare.net/AmazonWebServices/stg306deep-dive-on-amazon-ebs/14

Average Performance Workloads

For average performance workloads we decided to use AWS EBS gp2 volume as a statefulset persistent data store. We run the Elasticsearch and Kafka workload storing stateful data in EBS volumes. Since EBS volume is network based and can be attached to any instance (or Kubernetes node), losing a node does not result in losing data. In such an event, the volume is detached from the failed node and attached to a new node where the impacted pod gets started.

At high level, common steps to create statefulsets using EBS as persistent volumes are:-

  1. Create a Storage class specifying type of EBS volume i.e gp2, io1, sc1, st1 (ref this doc for help)
  2. Create a Persistent Volume Claim which needs Persistent Volume either created manually or dynamiclally. (more information here).
  3. Consume PVC and create a statefulset workload (example and more information here)

High performance Workloads

High performance stateful set workloads require instance based storage; e.g NVMe storage on AWS i3 instances. This requires using a special type of Persistent Volumes called Local Volumes in Kubernetes. In other words, local persistent volumes in Kubernetes can be seen as mapped to instance-based storage also known as ephemeral storage.

Ephemeral storage is fast, but comes with a small disadvantage; data is lost, when an instance is terminated or lost (in case of spots or scaling down activities). In such an event, statefulset pod is lost and would be restarted on a new node. This might require deleting old PVC associated with terminated instances. A new statefulset pod with fresh volume would mean that underlying application/workload would have to rebalance and recreate lost copies of data on new instance(with new volume).

Statefulsets with Local Persistent Volumes can be created with the same steps as described above, however the following steps must first be performed before following the above-mentioned steps. When creating one more important Local ephemeral storage disks, they need to be mounted before they can be used by Kubernetes nodes.

At a high level, steps to create Local Persistent Volume based statefulsets would be:-

  1. Partition NVME disks when Kubernetes node is added to cluster. (see below for details)
  2. Format and mount partition at node level.(see below for details)
  3. Create Storage Class specifying local persistent volumes.
  4. Run external dynamic provisioner .Kubernetes does not support dynamic provisioning for local persistent volumes natively but can be achieved by using an external provisioner. External dynamic provisioner would look for the volumes mounted under a specific mount point, create PVs from them and make them ready to be consumed by above Storage Class. We use the one listed here. Using helm, you would generate a yaml to create ConfigMap, RBAC and DaemonSet which runs external provisioned. Refer to an example shows at the end of this article.
  5. Create statefulset workload as needed.

#1 and #2 can be automated by adding a user-data script which gets executed when a new instance is provisioned. If Kubernetes is provisioned via kops, the following yaml can be used to create an instance group which provides local instance storage based persistence volumes.

apiVersion: kops/v1alpha2

kind: InstanceGroup

metadata:

labels:

kops.k8s.io/cluster: your_cluster_name

name: your_node_name

spec:

additionalUserData:

- name: mountvolumes.sh

type: text/x-shellscript

content: |

#!/bin/bash

cp /etc/fstab /etc/fstab.orig

lsblk|grep “^nv”|awk ‘{print $1}’|while read A

do

fdisk /dev/$A << EOF

n

p

1

+400G

n

p

2

+400G

n

p

3

+400G

n

p

w

EOF

mkfs.ext4 /dev/${A}p1

mkfs.ext4 /dev/${A}p2

mkfs.ext4 /dev/${A}p3

mkfs.ext4 /dev/${A}p4

mkdir /mnt/${A}p1

mkdir /mnt/${A}p2

mkdir /mnt/${A}p3

mkdir /mnt/${A}p4

echo “/dev/${A}p1 /mnt/${A}p1 ext4 defaults 1 2” >>/etc/fstab

echo “/dev/${A}p2 /mnt/${A}p2 ext4 defaults 1 2” >>/etc/fstab

echo “/dev/${A}p3 /mnt/${A}p3 ext4 defaults 1 2” >>/etc/fstab

echo “/dev/${A}p4 /mnt/${A}p4 ext4 defaults 1 2” >>/etc/fstab

done

mount -a

cloudLabels:

TAG1: Value1

TAG2: Value2

image: ac_number/path/to/image

machineType: i3.2xlarge

nodeLabels:

app: your_node_label

instance_type: i3

kops.k8s.io/instancegroup: your_instance_group_name

role: Node

subnets:

- us-west-1a

Dynamic Provisioner

— -

#Source:provisioner/templates/provisioner.yaml

apiVersion:v1

kind:ConfigMap

metadata:

name:local-provisioner-config

namespace:your_namespace

labels:

heritage:”Tiller”

release:”RELEASE-NAME”

chart:provisioner-2.3.0

data:

storageClassMap:|

fast-disks:

hostDir:/mnt

mountDir:/mnt

blockCleanerCommand:

-”/scripts/shred.sh”

-”2"

volumeMode:Filesystem

fsType:ext4

— -

apiVersion:apps/v1

kind:DaemonSet

metadata:

name:local-volume-provisioner

namespace:your_namespace

labels:

app:local-volume-provisioner

heritage:”Tiller”

release:”RELEASE-NAME”

chart:provisioner-2.3.0

spec:

selector:

matchLabels:

app:local-volume-provisioner

template:

metadata:

labels:

app:local-volume-provisioner

spec:

serviceAccountName:local-storage-admin

nodeSelector:

instance_type:i3

tolerations:

-effect:NoSchedule

key:app

operator:Equal

value:your_namespace

priorityClassName:system-node-critical

containers:

-image:”quay.io/external_storage/local-volume-provisioner:v2.3.0"

name:provisioner

securityContext:

privileged:true

env:

-name:MY_NODE_NAME

valueFrom:

fieldRef:

fieldPath:spec.nodeName

-name:MY_NAMESPACE

valueFrom:

fieldRef:

fieldPath:metadata.namespace

-name:JOB_CONTAINER_IMAGE

value:”quay.io/external_storage/local-volume-provisioner:v2.3.0"

volumeMounts:

-mountPath:/etc/provisioner/config

name:provisioner-config

readOnly:true

-mountPath:/dev

name:provisioner-dev

-mountPath:/mnt

name:fast-disks

mountPropagation:”HostToContainer”

volumes:

-name:provisioner-config

configMap:

name:local-provisioner-config

-name:provisioner-dev

hostPath:

path:/dev

-name:fast-disks

hostPath:

path:/mnt

— -

#Source:provisioner/templates/provisioner-service-account.yaml

apiVersion:v1

kind:ServiceAccount

metadata:

name:local-storage-admin

namespace:your_namespace

labels:

heritage:”Tiller”

release:”RELEASE-NAME”

chart:provisioner-2.3.0

— -

#Source:provisioner/templates/provisioner-cluster-role-binding.yaml

apiVersion:rbac.authorization.k8s.io/v1

kind:ClusterRoleBinding

metadata:

name:local-storage-provisioner-pv-binding

labels:

heritage:”Tiller”

release:”RELEASE-NAME”

chart:provisioner-2.3.0

subjects:

-kind:ServiceAccount

name:local-storage-admin

namespace:your_namespace

roleRef:

kind:ClusterRole

name:system:persistent-volume-provisioner

apiGroup:rbac.authorization.k8s.io

— -

apiVersion:rbac.authorization.k8s.io/v1

kind:ClusterRole

metadata:

name:local-storage-provisioner-node-clusterrole

labels:

heritage:”Tiller”

release:”RELEASE-NAME”

chart:provisioner-2.3.0

rules:

-apiGroups:[“”]

resources:[“nodes”]

verbs:[“get”]

— -

apiVersion:rbac.authorization.k8s.io/v1

kind:ClusterRoleBinding

metadata:

name:local-storage-provisioner-node-binding

labels:

heritage:”Tiller”

release:”RELEASE-NAME”

chart:provisioner-2.3.0

subjects:

-kind:ServiceAccount

name:local-storage-admin

namespace:your_namespace

roleRef:

kind:ClusterRole

name:local-storage-provisioner-node-clusterrole

apiGroup:rbac.authorization.k8s.io

— -

#Source:provisioner/templates/namespace.yaml

--

--