Updating Google Kubernetes Engine VM scopes with zero downtime
I often hang out on the Google Cloud Platform Slack, which is a great community for learning and discussing GCP. Here is a common situation I’ve seen multiple people run into:
“I want to use Cloud SQL / Datastore / PubSub / etc. from my pods running in Kubernetes Engine. I want to use the automatic VM credentials, but my VM doesn’t have the right scopes or permissions and I can’t change it! How do I fix this?”
All Google Compute Engine VMs come with a built in OAuth2 Service Account that can be used to automatically authenticate to various GCP services. You can choose which services this service account has permission to access by assigning different “scopes” to the account.
For example, you might give the service account the “devstorage.read_only” scope so it can only read data from Google Cloud Storage. You might give another service account the “devstorage.read_write” so it can read and write data. An account can have any mix of scopes, so you can give it the exact permissions it needs to get the job done. You can find a list of all Google scopes here.
This service account is often referred to as “Application Default Credentials”
The problem with Application Default Credentials
̶O̶n̶c̶e̶ ̶y̶o̶u̶ ̶c̶r̶e̶a̶t̶e̶ ̶t̶h̶e̶ ̶i̶n̶s̶t̶a̶n̶c̶e̶,̶ ̶y̶o̶u̶ ̶c̶a̶n̶’̶t̶ ̶c̶h̶a̶n̶g̶e̶ ̶t̶h̶e̶ ̶s̶c̶o̶p̶e̶s̶!̶ ̶T̶h̶e̶r̶e̶ ̶i̶s̶ ̶n̶o̶ ̶w̶a̶y̶ ̶t̶o̶ ̶f̶l̶i̶p̶ ̶o̶u̶t̶ ̶t̶h̶e̶ ̶A̶p̶p̶l̶i̶c̶a̶t̶i̶o̶n̶ ̶D̶e̶f̶a̶u̶l̶t̶ ̶C̶r̶e̶d̶e̶n̶t̶i̶a̶l̶s̶ ̶w̶i̶t̶h̶ ̶a̶n̶o̶t̶h̶e̶r̶ ̶o̶n̶e̶.̶ (This is no longer the case, Compute Engine now supports changing scopes on running instances. However, I still recommend following this guide. If the VM crashes and respawns or you change the cluster size, the new VMs will not have the updated scopes)
There are two solutions:
- Create a Service Account with the right scope, and use it directly in the app instead of using the Application Default Credentials.
- Create a new instance with the right scopes, and move everything over.
The first method is definitely more robust, as each pod can have its own service account if needed, but has the additional overhead of forcing you to manage these accounts. I won’t be covering this method in this post, but it is a great option if you want more control.
The second method means you don’t have to make any code changes or manage accounts, but you risk downtime while the new VMs start up.
With Google Kubernetes Engine, you can avoid this downtime if you follow some simple steps. Let’s take a look!
The initial setup
For this blog post, we are going to have a small 3-node Kubernetes cluster on Google Kubernetes Engine running a service backed by a deployment. The deployment will have 6 replicas.
Here are the nodes:
$ kubectl get nodes
NAME STATUS AGE
gke-cluster-1-default-pool-7d6b79ce-0s6z Ready 2m
gke-cluster-1-default-pool-7d6b79ce-9kkm Ready 2m
gke-cluster-1-default-pool-7d6b79ce-j6ch Ready 2m
Here are the pods (modified to fit the screen):
$ kubectl get pods -o wide
hello-1959708372-c13v2 gke-cluster-1-default-pool-7d6b79ce-9kkm hello-1959708372-fdx7z gke-cluster-1-default-pool-7d6b79ce-j6ch hello-1959708372-n510f gke-cluster-1-default-pool-7d6b79ce-0s6z hello-1959708372-xhz0h gke-cluster-1-default-pool-7d6b79ce-9kkm hello-1959708372-zdmvb gke-cluster-1-default-pool-7d6b79ce-0s6z
You can see that the pods are distributed across the nodes.
Oh nooooo. The nodes don’t have the right permissions!
The first thing to do is create new nodes with the correct permissions. You can do this by creating a new node pool the same size as the old pool. This new node pool will sit alongside the old one, and new pods can be scheduled onto it.
Let’s say that your code needs the “devstorage.read_write” and “pubsub” scopes.
To create the new node pool, run the following command:
$ gcloud container node-pools create adjust-scope \
--cluster <YOUR_CLUSTER_NAME> --zone <YOUR_ZONE> \
--num-nodes 3 \
As always, you can customize this command to fit your needs.
Now if you check the nodes, you will notice there are three more with the new pool name:
$ kubectl get nodes
NAME STATUS AGE
gke-cluster-1-adjust-scope-9ca78aa9–5gmk Ready 9m
gke-cluster-1-adjust-scope-9ca78aa9–5w6w Ready 9m
gke-cluster-1-adjust-scope-9ca78aa9-v88c Ready 9m
gke-cluster-1-default-pool-7d6b79ce-0s6z Ready 3h
gke-cluster-1-default-pool-7d6b79ce-9kkm Ready 3h
gke-cluster-1-default-pool-7d6b79ce-j6ch Ready 3h
However, the pods are still on the old nodes!
Time to drain
At this point, we could simply delete the old node pool. Kubernetes will detect that the pods are no longer running, and will reschedule them to the new nodes.
However, this will introduce some downtime to the application, since it will take time for Kubernetes to detect the nodes are down start the containers on the new hosts. This may only be a few seconds or minutes, but that might be unacceptable!
The better way would be removing the pods from the old nodes one at a time, and then remove the node from the cluster. Thankfully, kubernetes has a built in commands to do this.
First, cordon each of the old nodes. This will prevent new pods from being scheduled onto them.
$ kubectl cordon <NODE_NAME>
Then, drain each node. This will delete all the pods on that node.
Warning: Make sure your pods are managed by a ReplicaSet, Deployment, StatefulSet, or something similar. Standalone pods won’t be rescheduled!
$ kubectl drain <NODE_NAME> --force
After you drain a node, make sure the new pods are up and running before moving on to the next one.
Once that is done, you can see that all the pods are running on the new nodes!
$ kubectl get pods -o wide
hello-1959708372-vvjd8 gke-cluster-1-adjust-scope-9ca78aa9-v88c hello-1959708372-cn28s gke-cluster-1-adjust-scope-9ca78aa9–5gmk hello-1959708372-cm9sd gke-cluster-1-adjust-scope-9ca78aa9–5w6w hello-1959708372-d92jh gke-cluster-1-adjust-scope-9ca78aa9-v88c hello-1959708372-b039s gke-cluster-1-adjust-scope-9ca78aa9–5w6w
Delete the old pool
Now that all the pods are safely rescheduled, it is time to delete the old pool.
Replace “default-pool” with the pool you want to delete.
$ gcloud container node-pools delete default-pool \
--cluster <YOUR_CLUSTER_NAME> --zone <YOUR_ZONE>
That’s it! You have updated your cluster with new scopes, and did it with zero downtime!