How to Upgrade a Kubernetes Cluster With No Downtime

Tom Dickman
Sep 19, 2016 · 3 min read
Image for post
Image for post
Kubernetes Logo

At RetailMeNot, teams traditionally use AWS services such as EC2 Container Service and Elastic Beanstalk, but as a member of the Labs group, we are tasked with rapidly prototyping and deploying apps. As such, we have been using Google Container Engine (GKE), Google’s managed version of Kubernetes, as a low-overhead way of exploring Kubernetes.

Kubernetes operates on a three-month-release cycle, so recently we were planning to upgrade our clusters from version 1.2.x to 1.3.x. Thankfully, GKE provides a straightforward process for cluster upgrades, initiated via a few clicks in its web interface. Unfortunately, during our testing we observed some downtime during this process, but we were able to find a better way.

GKE’s Default Upgrade Procedure

The master is the control plane for Kubernetes, and it includes the Kubernetes API server, the scheduler, and the controller manager server. During the upgrade process, pods will continue to run, but new pods cannot be scheduled, and stopped pods will not restart.

The worker nodes consist of the kubelet that is in charge of managing pods, and their associated images, volumes, secrets, etc, and the kube-proxy that acts as a simple load balancer for services. During the upgrade process, Google takes each node down, and upgrades the kubelet version running on that node. This process continues without waiting for the upgraded node to return to the ready state, which can result in having:

  1. Too few nodes to run all scheduled pods
  2. Not enough time to restart pods on neighboring nodes

These two side effects create a risk that there’s no pod running in a replica set, resulting in downtime for your services.

A Better Way

First, switch to your project in Google Cloud.

gcloud config set project {project_name}

Next you need to create a new node pool. Note: You can do this via the Google Cloud web interface, but by default the new nodes will not have any permissions. If you wish to give the new nodes permissions, you will need to run the following:

gcloud container node-pools create {node_pool_name} --num-nodes {num_nodes} --machine-type {machine_type} --cluster {cluster_name} --scopes https://www.googleapis.com/auth/userinfo.email,\
https://www.googleapis.com/auth/compute,\
https://www.googleapis.com/auth/devstorage.full_control,\
https://www.googleapis.com/auth/taskqueue,\
https://www.googleapis.com/auth/bigquery,\
https://www.googleapis.com/auth/sqlservice.admin,\
https://www.googleapis.com/auth/datastore,\
https://www.googleapis.com/auth/logging.admin,\
https://www.googleapis.com/auth/monitoring,\
https://www.googleapis.com/auth/cloud-platform,\
https://www.googleapis.com/auth/bigtable.data,\
https://www.googleapis.com/auth/bigtable.admin,\
https://www.googleapis.com/auth/pubsub,\
https://www.googleapis.com/auth/servicecontrol,\
https://www.googleapis.com/auth/service.management,\
https://www.googleapis.com/auth/logging.write \
--zone {zone}

Ensure that your new node pool is provisioned to handle all of the existing pods.

gcloud config set container/cluster {cluster_name}

Next, verify that your new nodes are running and are in the ‘ready’ state.

gcloud container clusters get-credentials {cluster_name}
kubectl get nodes

Mark the old nodes as unschedulable. This will force pods to only start on new, upgraded nodes, as you drain nodes in the next step.

kubectl cordon {old_node_name}  # Repeat this command for each old node

Begin draining nodes from your old node pool, one at a time, waiting until all pods have been scheduled on other nodes before proceeding. Daemonsets run on all nodes, so you will need to pass the ignore-daemonsets parameter if you are running any on your cluster.

k drain {new_node_name} — ignore-daemonsets

Once you have completed this process, you can safely delete your old node pool, and your cluster will be running on the new version.

gcloud container node-pools delete {node_pool_name} — cluster {cluster_name}

Concluding Thoughts

RetailMeNot Engineering

Saving The World Money Since ‘09

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store