Managing Stateful Workloads in Kubernetes
At Branch we process and store vast amounts of data; data generated from over 15 billion user events every day. Successfully doing so — while maintaining a near 100% SLA — requires running hundreds of stateless microservices, which we run in Kubernetes at scale. On any given work day at Branch, you can find an open terminal interacting with Kubernetes on any given engineer’s laptop.
In conjunction, all of these stateless services communicate with multiple persistent, low latency, redundant and always available databases and data processing workloads such as our self-managed NoSQL Aerospike databases, Kafka, Elasticsearch etc. Find out more on how we use DynamoDB and Kafka to efficiently support our clients at scale.
Challenges Scaling Stateful Workloads
Our stateful workloads — on the other hand — are orchestrated with other various tools and often require specialized and dedicated interaction to properly handle growth and emergency situations. For a company like Branch, which is growing extremely fast resulting in a tremendous growth of data, fragmentation in orchestration and configuration management starts to appear very quickly. These types of fragmentation make it extremely difficult to maintain standard operating procedures for ensuring site reliability.
If the performance and availability of such stateful workloads can be met with the orchestration and configuration management of Kubernetes, we could solve some of the above pain points. Enter the use of Kubernetes’ Statefulset scheduling. Let’s discuss where we started and how we plan to scale them with proper selection of underlying storage layer in AWS cloud.
Choosing Persistent Volumes
Persistent Volumes attach to pods orchestrated by statefulsets scheduler in Kubernetes. When a pod is restarted (because of deletion or scaling ), the Kubernetes scheduler has to assign the same volume on which it was running previously. In AWS, Kubernetes can leverage EBS or Ephemeral storage (as local volume in Kubernetes statefulsets) to provide Persistent Volumes to pods.
While pods will restart and live through an instance failure/termination when using EBS persistent volumes, they would not be able to live through an instance failure when using ephemeral/instance store as local volumes. Such failures would require manually deleting PVCs before affected statefulset pod would restart.
Provisioning Statefulsets Based on Workload Profile
The following decision tree provides a starting point when choosing type of volume for stateful set workloads in AWS cloud.
Ref: https://www.slideshare.net/AmazonWebServices/stg306deep-dive-on-amazon-ebs/14
Average Performance Workloads
For average performance workloads we decided to use AWS EBS gp2 volume as a statefulset persistent data store. We run the Elasticsearch and Kafka workload storing stateful data in EBS volumes. Since EBS volume is network based and can be attached to any instance (or Kubernetes node), losing a node does not result in losing data. In such an event, the volume is detached from the failed node and attached to a new node where the impacted pod gets started.
At high level, common steps to create statefulsets using EBS as persistent volumes are:-
- Create a Storage class specifying type of EBS volume i.e gp2, io1, sc1, st1 (ref this doc for help)
- Create a Persistent Volume Claim which needs Persistent Volume either created manually or dynamiclally. (more information here).
- Consume PVC and create a statefulset workload (example and more information here)
High performance Workloads
High performance stateful set workloads require instance based storage; e.g NVMe storage on AWS i3 instances. This requires using a special type of Persistent Volumes called Local Volumes in Kubernetes. In other words, local persistent volumes in Kubernetes can be seen as mapped to instance-based storage also known as ephemeral storage.
Ephemeral storage is fast, but comes with a small disadvantage; data is lost, when an instance is terminated or lost (in case of spots or scaling down activities). In such an event, statefulset pod is lost and would be restarted on a new node. This might require deleting old PVC associated with terminated instances. A new statefulset pod with fresh volume would mean that underlying application/workload would have to rebalance and recreate lost copies of data on new instance(with new volume).
Statefulsets with Local Persistent Volumes can be created with the same steps as described above, however the following steps must first be performed before following the above-mentioned steps. When creating one more important Local ephemeral storage disks, they need to be mounted before they can be used by Kubernetes nodes.
At a high level, steps to create Local Persistent Volume based statefulsets would be:-
- Partition NVME disks when Kubernetes node is added to cluster. (see below for details)
- Format and mount partition at node level.(see below for details)
- Create Storage Class specifying local persistent volumes.
- Run external dynamic provisioner .Kubernetes does not support dynamic provisioning for local persistent volumes natively but can be achieved by using an external provisioner. External dynamic provisioner would look for the volumes mounted under a specific mount point, create PVs from them and make them ready to be consumed by above Storage Class. We use the one listed here. Using helm, you would generate a yaml to create ConfigMap, RBAC and DaemonSet which runs external provisioned. Refer to an example shows at the end of this article.
- Create statefulset workload as needed.
#1 and #2 can be automated by adding a user-data script which gets executed when a new instance is provisioned. If Kubernetes is provisioned via kops, the following yaml can be used to create an instance group which provides local instance storage based persistence volumes.
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: your_cluster_name
name: your_node_name
spec:
additionalUserData:
- name: mountvolumes.sh
type: text/x-shellscript
content: |
#!/bin/bash
cp /etc/fstab /etc/fstab.orig
lsblk|grep “^nv”|awk ‘{print $1}’|while read A
do
fdisk /dev/$A << EOF
n
p
1
+400G
n
p
2
+400G
n
p
3
+400G
n
p
w
EOF
mkfs.ext4 /dev/${A}p1
mkfs.ext4 /dev/${A}p2
mkfs.ext4 /dev/${A}p3
mkfs.ext4 /dev/${A}p4
mkdir /mnt/${A}p1
mkdir /mnt/${A}p2
mkdir /mnt/${A}p3
mkdir /mnt/${A}p4
echo “/dev/${A}p1 /mnt/${A}p1 ext4 defaults 1 2” >>/etc/fstab
echo “/dev/${A}p2 /mnt/${A}p2 ext4 defaults 1 2” >>/etc/fstab
echo “/dev/${A}p3 /mnt/${A}p3 ext4 defaults 1 2” >>/etc/fstab
echo “/dev/${A}p4 /mnt/${A}p4 ext4 defaults 1 2” >>/etc/fstab
done
mount -a
cloudLabels:
TAG1: Value1
TAG2: Value2
image: ac_number/path/to/image
machineType: i3.2xlarge
nodeLabels:
app: your_node_label
instance_type: i3
kops.k8s.io/instancegroup: your_instance_group_name
role: Node
subnets:
- us-west-1a
Dynamic Provisioner
— -
#Source:provisioner/templates/provisioner.yaml
apiVersion:v1
kind:ConfigMap
metadata:
name:local-provisioner-config
namespace:your_namespace
labels:
heritage:”Tiller”
release:”RELEASE-NAME”
chart:provisioner-2.3.0
data:
storageClassMap:|
fast-disks:
hostDir:/mnt
mountDir:/mnt
blockCleanerCommand:
-”/scripts/shred.sh”
-”2"
volumeMode:Filesystem
fsType:ext4
— -
apiVersion:apps/v1
kind:DaemonSet
metadata:
name:local-volume-provisioner
namespace:your_namespace
labels:
app:local-volume-provisioner
heritage:”Tiller”
release:”RELEASE-NAME”
chart:provisioner-2.3.0
spec:
selector:
matchLabels:
app:local-volume-provisioner
template:
metadata:
labels:
app:local-volume-provisioner
spec:
serviceAccountName:local-storage-admin
nodeSelector:
instance_type:i3
tolerations:
-effect:NoSchedule
key:app
operator:Equal
value:your_namespace
priorityClassName:system-node-critical
containers:
-image:”quay.io/external_storage/local-volume-provisioner:v2.3.0"
name:provisioner
securityContext:
privileged:true
env:
-name:MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath:spec.nodeName
-name:MY_NAMESPACE
valueFrom:
fieldRef:
fieldPath:metadata.namespace
-name:JOB_CONTAINER_IMAGE
value:”quay.io/external_storage/local-volume-provisioner:v2.3.0"
volumeMounts:
-mountPath:/etc/provisioner/config
name:provisioner-config
readOnly:true
-mountPath:/dev
name:provisioner-dev
-mountPath:/mnt
name:fast-disks
mountPropagation:”HostToContainer”
volumes:
-name:provisioner-config
configMap:
name:local-provisioner-config
-name:provisioner-dev
hostPath:
path:/dev
-name:fast-disks
hostPath:
path:/mnt
— -
#Source:provisioner/templates/provisioner-service-account.yaml
apiVersion:v1
kind:ServiceAccount
metadata:
name:local-storage-admin
namespace:your_namespace
labels:
heritage:”Tiller”
release:”RELEASE-NAME”
chart:provisioner-2.3.0
— -
#Source:provisioner/templates/provisioner-cluster-role-binding.yaml
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRoleBinding
metadata:
name:local-storage-provisioner-pv-binding
labels:
heritage:”Tiller”
release:”RELEASE-NAME”
chart:provisioner-2.3.0
subjects:
-kind:ServiceAccount
name:local-storage-admin
namespace:your_namespace
roleRef:
kind:ClusterRole
name:system:persistent-volume-provisioner
apiGroup:rbac.authorization.k8s.io
— -
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRole
metadata:
name:local-storage-provisioner-node-clusterrole
labels:
heritage:”Tiller”
release:”RELEASE-NAME”
chart:provisioner-2.3.0
rules:
-apiGroups:[“”]
resources:[“nodes”]
verbs:[“get”]
— -
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRoleBinding
metadata:
name:local-storage-provisioner-node-binding
labels:
heritage:”Tiller”
release:”RELEASE-NAME”
chart:provisioner-2.3.0
subjects:
-kind:ServiceAccount
name:local-storage-admin
namespace:your_namespace
roleRef:
kind:ClusterRole
name:local-storage-provisioner-node-clusterrole
apiGroup:rbac.authorization.k8s.io
— -
#Source:provisioner/templates/namespace.yaml