Stateful Services on Preemptible Nodes with Google Kubernetes Engine
There are numerous benefits to running a Kubernetes workload on Google Kubernetes Engine (GKE). Due to Google’s significant contributions to the Kubernetes open source project, as well as the heavy use of Google’s version of Kubernetes internally at Google (via Borg), Google is at the forefront of running and orchestrating containers in general, and especially in the public cloud.
Some of the advantages to running Kubernetes workloads via GKE include the integration offered between microservices running on GKE and the other services offered via Google Cloud Platform (GCP). It’s relatively simple to leverage Google’s data warehousing technologies such as BigQuery via GKE without concern for details of data throughput or authentication since much of that is fully managed by Google. Spinning up clusters with GPUs and soon TPUs is also easily done with a few clicks.
With the benefits of allowing Kubernetes, and more specifically GKE, manage the orchestration of containers and their underlying nodes, the appeal of using Google’s preemptible nodes becomes more apparent. Though preemptible nodes can die at any time (once notified by Google via SIGTERM), they can represent dramatic savings in the ballpark of 70% off the cost of a traditional long-running node. Since Google manages the scaling of the containers and the nodes, the death of a node shouldn’t matter to our microservice nor our application overall.
This is great for most microservices, but in some cases, microservices need to maintain state. Even if statefulness is transient and state is eventually offloaded, this can be troublesome for workloads with preemptible instances.
Due to the nature of GKE and preemptible nodes, once a node is scheduled for deletion, a node receives a SIGTERM, but the underlying pod never knows it’s going to die until it is actually terminated. Again, for stateless services, this causes no concern since GKE simply spins up new nodes which the pods can be scheduled on.
In the case of stateful services, we need to account for the termination of the nodes and manage their shutdown in a graceful manner.
In order to solve for this, we needed to consider a few requirements.
- A node must notify the pods running on it that it is going to die with enough time for it to save or offload state
- We want to adhere to the principle of least privilege (part of an overall secure cloud) to ensure that a node only has the ability to
kubectl drainand not any other undesirable things which might happen if we grant too much permission tokubectl - The permissions required for the
kubectlagent to communicate with the GKE master must be granted to each GKE node at startup (so that it can runkubectl drain `hostname` - The nodes come with a secure CoreOS operating system. Therefore,
gcloudis not installed and cannot be installed. In fact, almost nothing can actually be installed on the node itself for security reasons. - We want to keep things as centrally managed and simple as possible
We determined that the best way forward was to develop a solution which would manage the delivery of the required config files to each GKE node via Google Compute Engine (GCE) startup script. We’d also ensure graceful pod shutdown by including kubectl drain `hostname` on each GKE node via GCE shutdown script.
To make this solution work, there is some prep work required. We first must create a Kubernetes role with the permissions required to drain nodes and nothing else.
vim drain-role.yamlapiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: system:node-drainer
rules:
- apiGroups:
- ""
resources:
- pods/eviction
verbs:
- create- apiGroups:
- apps
resources:
- statefulsets
verbs:
- get- apiGroups:
- extensions
resources:
- daemonsets
- replicasets
verbs:
- get- apiGroups:
- batch
resources:
- jobs
verbs:
- get- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- patch- apiGroups:
- ""
resources:
- pods
verbs:
- list
Next, we need to establish a Kubernetes service account user in drain-user.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: drain-user
namespace: defaultWe then need to make sure to bind this role to this user via drain-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: drain-user
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-drainer
subjects:
- kind: ServiceAccount
name: drain-user
namespace: defaultWe then need to create our GKE cluster and gain access credentials to it.
gcloud container clusters create [CLUSTER_NAME]
gcloud container clusters get-credentials [CLUSTER_NAME]This will create a config file at ~/.kube/config with which we can use kubectl. We should note that this config should be protected as it contains the credentials to communicate with the GKE as an administrator.
Now that we have permission to do whatever we want with our GKE master, we’ll apply the YAMLs we previously created and get our new service account user established and configured.
kubectl create -f drain-role.yaml
kubectl create -f drain-user.yaml
kubectl create -f drain-binding.yamlNow we need to get the token for our new service account “drain-user”.
kubectl describe serviceAccounts drain-user
Which yields:
Name: drain-user
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"ServiceAccount","metadata":{"annotations":{},"name":"drain-user","namespace":"default"}}Image pull secrets: <none>Mountable secrets: drain-user-token-z9kr6Tokens: drain-user-token-z9kr6Events: <none>
Bold added to the token name for emphasis. We need this for the next command which actually fetches the token itself. Again, it must be noted that this token is what gives permission to your service account. You’ll want to make sure this is kept secure.
kubectl describe secrets drain-user-token-z9kr6
We now need to take a look at the file created by gcloud under ~/.kube/config. We should start by making a copy (since we don’t want to damage this file and risk losing access to the GKE master). We then want to trim out the users section and replace it with our newly obtained service account user token.
Our final config file should look like the following (secrets redacted):
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED # from the original config file
server: https://REDACTED # from the original config file
name: CLUSTER_NAME # from the original config file
contexts:
- context:
cluster: CLUSTER_NAME # from the original config file
user: CLUSTER_NAME # from the original config file
name: CLUSTER_NAME # from the original config file
current-context: CLUSTER_NAME # from the original config file
kind: Config
preferences: {}
users:
- name: CLUSTER_NAME # from the original config file
user:
token: REDACTED # from our previous commandMake sure this file is placed at ~/.kube/config where the home directory is that of the user who will run kubectl.
We now need to create a new GCE instance group based on the initial cluster created by GKE and ensure the startup script is amended to also place this config file in place during startup.
First, we need to get the instance template that is being used by the GKE cluster.
gcloud container clusters describe [CLUSTER_NAME] -z [CLUSTER_ZONE]
We need to extract the instance template name from this output (bold added for emphasis) and use it in the following command to get the contents of the startup and shutdown scripts. These scripts are critical to copy as they lay the foundation required for the node to talk to its GKE master.
...
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/my-gcp-project/zones/us-central1-a/instanceGroupManagers/my-gke-cluster-group
...gcloud compute instance-groups managed describe my-gke-cluster-group
This now gives us the instance template used for this group.
...
instanceGroup: https://www.googleapis.com/compute/v1/projects/my-gcp-project/zones/us-central1-a/instanceGroups/my-gke-cluster-group
instanceTemplate: https://www.googleapis.com/compute/v1/projects/my-gcp-project/global/instanceTemplates/my-gke-cluster-instance-template
kind: compute#instanceGroupManager
...And now that we finally have the instance template, we can fetch the metadata which includes the startup and shutdown scripts, among other important pieces of data. Again, worth noting, there are lots of very valuable secrets in these scripts, and as such, they must be handled as passwords or any other piece of sensitive data.
gcloud compute instance-templates describe my-gke-cluster-instance-templateWhich gives us:
...
metadata:
fingerprint: REDACTED
items:
- key: cluster-location
value: us-central1-a
- key: kube-env
value: REDACTED
- key: google-compute-enable-pcid
value: 'true'
- key: user-data
value: REDACTED
- key: gci-update-strategy
value: update_disabled
- key: gci-ensure-gke-docker
value: 'true'
- key: configure-sh
value: REDACTED
- key: cluster-name
value: REDACTED
kind: compute#metadata
...Now that we have all of the information we need, we can create our new GCE instance group which will include preemptible instances, all of the configuration information required to communicate with the GKE master, all of the required scaling logic (illustrated below as an example), and the ability to drain pods from a preemptible node upon SIGTERM.
Note, you can also specify metadata as a file, which may make sense for extremely large or specially formatted values (such as startup scripts).
We’ll want to take the output from the configure-sh key and create a new file, startup.sh, with the contents. We then need to append the following to the end of that file.
mkdir ~/.kube
touch ~/.kube/config
echo “CONTENTS_OF_KUBE_CONFIG” > ~/.kube/config # this should be the actual config fileWe also need to create our shutdown script which we’ll save as shutdown.sh.
kubectl drain `hostname` --force --ignore-daemonsets
Now we can put it all together in an instance template.
gcloud compute instance-templates create new-gke-node-template \
--machine-type=n1-standard-1 \
--preemptible \
--image-project=coreos-cloud \
--image-family=coreos-stable \
--metadata cluster-location=us-central1-a,kube-env=REDACTED,google-compute-enable-pcid=’true’,user-data=REDACTED,gci-update-strategy=update_disabled,gci-ensure-gke-docker=’true’,cluster-name=REDACTED
--metadata-from-file=configure-sh=~/path/to/startup.sh,shutdown-script=~/path/to/shutdown.shNow we have an instance template which will create instances which are able to communicate with the GKE master and will drain themselves when preempted by Google. Our last step is to create the managed instance group based on this instance template so that we get the scaling and self-healing that we’re after.
gcloud compute instance-groups managed create my-new-gke-instance-group \
--size 5 \
--zone us-central1-a \
--template my-gke-cluster-instance-templategcloud compute instance-groups managed set-autoscaling my-new-gke-instance-group \
--max-num-replicas 40 \
--zone us-central1-a
We’ve now met our initial requirements. We have an auto scaling instance group of preemptible nodes communicating with our Google-managed GKE master. The nodes drain themselves upon a SIGTERM (manual or automated) and each node only has permissions to drain, not perform any other Kubernetes action. Our secrets are secure and stored only within the instance template (which is protected by the same security employed to protect the rest of your GCP data) and everything is centrally managed by GCP constructs (though it would be relatively simple to automate this via something like Terraform).
We’re now able to take advantage of all that GKE offers while still allowing for the best cost optimization possible under preemptible nodes, all without sacrificing statefulness!

