Updating Google Kubernetes Engine VM scopes with zero downtime

Published in

Google Cloud - Community

5 min readFeb 1, 2017

I often hang out on the Google Cloud Platform Slack, which is a great community for learning and discussing GCP. Here is a common situation I’ve seen multiple people run into:

“I want to use Cloud SQL / Datastore / PubSub / etc. from my pods running in Kubernetes Engine. I want to use the automatic VM credentials, but my VM doesn’t have the right scopes or permissions and I can’t change it! How do I fix this?”

All Google Compute Engine VMs come with a built in OAuth2 Service Account that can be used to automatically authenticate to various GCP services. You can choose which services this service account has permission to access by assigning different “scopes” to the account.

For example, you might give the service account the “devstorage.read_only” scope so it can only read data from Google Cloud Storage. You might give another service account the “devstorage.read_write” so it can read and write data. An account can have any mix of scopes, so you can give it the exact permissions it needs to get the job done. You can find a list of all Google scopes here.

This service account is often referred to as “Application Default Credentials”

The problem with Application Default Credentials

̶O̶n̶c̶e̶ ̶y̶o̶u̶ ̶c̶r̶e̶a̶t̶e̶ ̶t̶h̶e̶ ̶i̶n̶s̶t̶a̶n̶c̶e̶,̶ ̶y̶o̶u̶ ̶c̶a̶n̶’̶t̶ ̶c̶h̶a̶n̶g̶e̶ ̶t̶h̶e̶ ̶s̶c̶o̶p̶e̶s̶!̶ ̶T̶h̶e̶r̶e̶ ̶i̶s̶ ̶n̶o̶ ̶w̶a̶y̶ ̶t̶o̶ ̶f̶l̶i̶p̶ ̶o̶u̶t̶ ̶t̶h̶e̶ ̶A̶p̶p̶l̶i̶c̶a̶t̶i̶o̶n̶ ̶D̶e̶f̶a̶u̶l̶t̶ ̶C̶r̶e̶d̶e̶n̶t̶i̶a̶l̶s̶ ̶w̶i̶t̶h̶ ̶a̶n̶o̶t̶h̶e̶r̶ ̶o̶n̶e̶.̶ (This is no longer the case, Compute Engine now supports changing scopes on running instances. However, I still recommend following this guide. If the VM crashes and respawns or you change the cluster size, the new VMs will not have the updated scopes)

There are two solutions:

Create a Service Account with the right scope, and use it directly in the app instead of using the Application Default Credentials.
Create a new instance with the right scopes, and move everything over.

The first method is definitely more robust, as each pod can have its own service account if needed, but has the additional overhead of forcing you to manage these accounts. I won’t be covering this method in this post, but it is a great option if you want more control.

The second method means you don’t have to make any code changes or manage accounts, but you risk downtime while the new VMs start up.

With Google Kubernetes Engine, you can avoid this downtime if you follow some simple steps. Let’s take a look!

The initial setup

For this blog post, we are going to have a small 3-node Kubernetes cluster on Google Kubernetes Engine running a service backed by a deployment. The deployment will have 6 replicas.

Here are the nodes:

$ kubectl get nodes
NAME                                      STATUS  AGE
gke-cluster-1-default-pool-7d6b79ce-0s6z  Ready   2m
gke-cluster-1-default-pool-7d6b79ce-9kkm  Ready   2m
gke-cluster-1-default-pool-7d6b79ce-j6ch  Ready   2m

Here are the pods (modified to fit the screen):

$ kubectl get pods -o wide
NAME                        NODE
hello-1959708372-25x63      gke-cluster-1-default-pool-7d6b79ce-0s6z
hello-1959708372-c13v2      gke-cluster-1-default-pool-7d6b79ce-9kkm hello-1959708372-fdx7z      gke-cluster-1-default-pool-7d6b79ce-j6ch hello-1959708372-n510f      gke-cluster-1-default-pool-7d6b79ce-0s6z hello-1959708372-xhz0h      gke-cluster-1-default-pool-7d6b79ce-9kkm hello-1959708372-zdmvb      gke-cluster-1-default-pool-7d6b79ce-0s6z

You can see that the pods are distributed across the nodes.

Disaster strikes!

Oh nooooo. The nodes don’t have the right permissions!

The first thing to do is create new nodes with the correct permissions. You can do this by creating a new node pool the same size as the old pool. This new node pool will sit alongside the old one, and new pods can be scheduled onto it.

Let’s say that your code needs the “devstorage.read_write” and “pubsub” scopes.

To create the new node pool, run the following command:

$ gcloud container node-pools create adjust-scope \
   --cluster <YOUR_CLUSTER_NAME> --zone <YOUR_ZONE> \
   --num-nodes 3 \
   --scopes https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/pubsub

As always, you can customize this command to fit your needs.

Now if you check the nodes, you will notice there are three more with the new pool name:

$ kubectl get nodes
NAME                                        STATUS  AGE
gke-cluster-1-adjust-scope-9ca78aa9–5gmk    Ready   9m
gke-cluster-1-adjust-scope-9ca78aa9–5w6w    Ready   9m
gke-cluster-1-adjust-scope-9ca78aa9-v88c    Ready   9m
gke-cluster-1-default-pool-7d6b79ce-0s6z    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-9kkm    Ready   3h
gke-cluster-1-default-pool-7d6b79ce-j6ch    Ready   3h

However, the pods are still on the old nodes!

Time to drain

At this point, we could simply delete the old node pool. Kubernetes will detect that the pods are no longer running, and will reschedule them to the new nodes.

However, this will introduce some downtime to the application, since it will take time for Kubernetes to detect the nodes are down start the containers on the new hosts. This may only be a few seconds or minutes, but that might be unacceptable!

The better way would be removing the pods from the old nodes one at a time, and then remove the node from the cluster. Thankfully, kubernetes has a built in commands to do this.

First, cordon each of the old nodes. This will prevent new pods from being scheduled onto them.

$ kubectl cordon <NODE_NAME>

Then, drain each node. This will delete all the pods on that node.

Warning: Make sure your pods are managed by a ReplicaSet, Deployment, StatefulSet, or something similar. Standalone pods won’t be rescheduled!

$ kubectl drain <NODE_NAME> --force

After you drain a node, make sure the new pods are up and running before moving on to the next one.

Once that is done, you can see that all the pods are running on the new nodes!

$ kubectl get pods -o wide
NAME                        NODE
hello-1959708372-neu42      gke-cluster-1-adjust-scope-9ca78aa9–5gmk
hello-1959708372-vvjd8      gke-cluster-1-adjust-scope-9ca78aa9-v88c hello-1959708372-cn28s      gke-cluster-1-adjust-scope-9ca78aa9–5gmk hello-1959708372-cm9sd      gke-cluster-1-adjust-scope-9ca78aa9–5w6w hello-1959708372-d92jh      gke-cluster-1-adjust-scope-9ca78aa9-v88c hello-1959708372-b039s      gke-cluster-1-adjust-scope-9ca78aa9–5w6w

Delete the old pool

Now that all the pods are safely rescheduled, it is time to delete the old pool.

Replace “default-pool” with the pool you want to delete.

$ gcloud container node-pools delete default-pool \
   --cluster <YOUR_CLUSTER_NAME> --zone <YOUR_ZONE>

That’s it! You have updated your cluster with new scopes, and did it with zero downtime!