Rethinking Live Software Updates with Kubernetes

Karla Saur
9 min readJan 25, 2019

--

Updating software often causes a disruption in availability, often referred to as downtime. I really hate software downtime. Like really, really hate it. (I wrote my entire PhD thesis on ways to avoid any program restarts or downtime during updates by enabling runtime software upgrades.) Downtime is bad for the businesses providing the software/software service, and downtime is inconvenient and annoying for their customers. But what’s worse is that the collective desire to avoid downtime makes everyone less likely to perform software upgrades…however, not regularly performing updates usually causes even longer and nastier outages as the upgrade-debt builds up.

https://xkcd.com/1328/

Many software services can avoid downtime by updating software dynamically, performing updates silently in the background without a disruption to users. Lots of systems do this today, through rolling updates in backend server programs, or in the background on your laptop, or “in-place” in the cloud.

I started working at Microsoft on the Azure Kubernetes Service (AKS) team a few months ago, and I was really excited when my first assigned task was to upgrade all of the running instances of Docker across all of our AKS Kubernetes clusters. I have spent a lot of time thinking about dynamic software upgrades on a large scale, but before I joined Microsoft, I had mostly worked on long running, somewhat stateful, C-based server programs. Upgrading something so critical as the underlying container system with a very large quantity of VM nodes across 15 geographic regions with a very large number of connected production customers was a whole new fun (scary?) challenge!

From my previous experiences, I had observed that rebooting is generally much more disruptive than if you can find a way to either hot-patch or dynamically upgrade, especially if it preserves any state loss or prevents connectivity outage. I had heard that Kubernetes was self-healing, but I didn’t really believe it or understand what that meant.

TL;DR: This blog post is the story of how I learned that Kubernetes changes the game and can make rebooting the nodes actually less disruptive than updating node software without a reboot. If your Pods don’t have ephemeral state or you can save any state constantly to a persistent DB, then a cordon/drain update of a Kubernetes node can be worse in terms of recovery and speed than a simple node reboot.

Azure Kubernetes Service (AKS)

Before I jump in to the update fun, let me first provide a bit of context. In AKS, we provide Azure-hosted control planes so that the customers do not need to worry about managing their Master VMs and can focus on managing their actual workload content in their Agent VMs. For each customer, AKS maintains a full Kubernetes Master setup (etcd, scheduler, API server, controller, etc), represented by the dark blue box below. We run our services on a farm of Kubernetes clusters, with multiple clusters within a region.

Each of our Kubernetes clusters has multiple agent (minion) nodes and several master nodes, where each node is an Azure VM. Also, each region has a varied number of Kubernetes clusters depending on demand. There are currently 15 regions as of this writing, with more planned. This adds up to a lot of VMs, which all have their own underlying software to support the system in addition to Kubernetes.

AKS became generally available in June of 2018, which means that the system is still fairly new. As it grows and expands, we must constantly keep the underlying software up-to-date. AKS is now one of the fastest growing services in the history of Azure, and we have to keep things online and updated all the time while rapidly growing capacity. This means that we must be able to upgrade our current systems without downtime. Updating by shuffling around workloads to newly updated VMs and destroying the old ones isn’t an ideal solution because it is costly and takes time, therefore in-place updates are necessary.

The Update

We wanted to standardize on Moby for easier cross-team management, and we needed to roll out a new version of Moby to all of our clusters and replace an older version of Docker. There’s a lot of confusion about the relationship between Moby and Docker, but very roughly, Moby is the up-stream version of Docker. In this blogpost, I’ll continue to refer to Moby as Docker to avoid confusion.

To do the update cleanly we needed to purge all of the older Docker components (ex: docker-engine and docker-cli) and install the new Docker versions because there were enough differences in the versions and the packages. (Attempting to do something such as sudo dpkg -i --force-overwrite caused significantly more problems than it solved.) The goal was to put each node in a state that the old Docker could be purged and that the new Docker could be installed safely without losing any state or causing any noticeable outage.

Initial Approach

During a software upgrade, one of the first things to account for is any ephemeral or in-flight state during the upgrade. Since the workload on the nodes is the Azure-hosted control plane components themselves, the only real state that we need to track is in etcd, which is automatically backed up to cloud storage every few seconds anyway, and also stored redundantly across the cluster. Therefore, there are no additional mechanisms necessary to prevent state-loss during an upgrade, as all state is quickly restored through an etcd restore, which conveniently Kubernetes handles for us entirely.

There are some upgrade-specific features built into Docker and Kubernetes, but none worked for my case¹. So, I experimented and came up with what I thought was “brilliant” plan to very carefully do a rolling quarantine-style update of each Kubernetes agent node, temporarily stopping or pausing a few critical components (ex: there are a lot of options for containers), draining some load onto other nodes, purging the older Docker, upgrading to the new Docker version, then restarting the services and un-quarantining the node, all without the dreaded rebooting. It was a bit slow and clunky, but it seemed like the safest option.

I tested the upgrade out in our internal test cluster, and everything seemed ok after 24 hours, so I moved on to updating Docker on one of our canary (test and internal customer) clusters which had much more load than the test cluster. As expected, all of the Pods were recreated with the infamous Pod sandbox changed, it will be killed and re-created message because of the change to docker-engine. The Pods seemed to be restarting/recreating successfully for the most part. Everything looked ok…the updated version of docker-engine and docker-cli were in place, and the recovery of all of the Pods was underway.

It’s probably fine.

Long story short², much to my dismay, I ended up rebooting at least half of the nodes in the cluster before things fully recovered. The process of preparing a node for upgrade was slow and clunky, and since this would be for a very large number of nodes and an even larger number of Pods, streamlining the update process is important. And (obviously), ripping docker-engine out from under a bunch of running Kubernetes Pods (eek!) to have them reconnect/recreate still left a bunch of services (mostly related to networking or metrics) in a broken or hung state, which is not exactly shocking, but still a bit disappointing.

The Great Rebooting

While the thought of rebooting (we like to call it ‘bouncing’) a live production Kubernetes node still seemed counter-intuitive and scary, I decided to give it a try by updating another canary cluster with reboots to each node after the upgrade. After each reboot, the newly upgraded node should (and does!) automatically rejoin the Kubernetes cluster and begin hosting Pods. I rewrote my Ansible script to simply purge the old Docker and install the new version, and then included a reboot at the end; this was much simpler and faster than a cordon/pause/drain strategy. I added a 15-minute pause between reboots of each node in a cluster (to be extra safe and to avoid too much churn) and ran the script across the next canary Kubernetes cluster.

Happily, the system recovered much quicker this time! Each node generally took less than 30 seconds to restart and rejoin the cluster. There were no lingering network issues. Many of the Pods were rescheduled on other nodes on the cluster, but several of the Pods were rescheduled on the same node, depending on how fast the node restarted. (Our internal scheduling algorithms make sure that the load is balanced across cluster nodes, so any additional necessary rebalancing may also be automatically scheduled later.)

About 95% of the Pods recovered automatically in ~1–30 seconds, usually <10 seconds. Only about 5% of the Pods did not automatically recover in less than 5 minutes and required manual or automated remediation. From this, we updated some of our auto-remediators (processes that constantly perform health checks and apply known fixes in the case of a problem) to fix up the majority of the remaining issues in a much faster way, particularly in terms of restarting network connectivity.

Self-Healing Kubernetes: Live Software Updating Made Easy

In conclusion, trusting Kubernetes to make an “updated and rebooted” node rejoin the cluster and repopulate with workload proved a much faster solution than an update-in-place method.

This process completely changed my outlook on rebooting in terms of performing a runtime software upgrade. Kubernetes truly is self-healing, surprisingly even at the node-level, with a rebooted-node quickly rejoining the cluster. You can actually “rolling-restart” all of the nodes in a cluster with minimal additional work (assuming you have a way to backup any state, such as to etcd). Live-bouncing nodes is definitely not something I ever thought I’d recommend, but when it comes to Kubernetes, you really can. Now, go update your software!

¹ Kubernetes has a nifty rolling update feature, but this applies to applications running in Kubernetes Deployments, rather than the underlying nodes’ software. Here, we are updating the Docker software running on each agent and master node in a Kubernetes cluster. Also, the main challenge of the Docker upgrade is the fact that Kubernetes and the Pods will not have access to docker-engine during the update, which is nearly impossible to do without causing some problem. When first approaching the Docker upgrade, I was hoping to use the (really awesome) “live-restore”, which which we had set (“live-restore”: true) in the /etc/docker/daemon.json file. This flag keeps containers running if the daemon becomes unavailable such as during limited Docker daemon upgrades. Unfortunately, “live-restore” is only available for certain upgrades such as patches and does not work across either minor or major upgrades.

² At the end of the day, I got on the bus to head out, but soon received an ICM message (InCident Management, which creates an incident ticket and pages us) forwarded to me from the on-call person. There was a continual network disruption on the canary cluster I had updated. We restarted all ingress and kube-proxy Pods across the cluster (yay for wifi-enabled buses!), and all seemed ok. I got off the bus and arrived at a meet-up (ironically the meet-up was about how distributed systems are hard and titled Everything is Broken) and saw that there was another ICM from the update. Apparently, our logging and metrics were getting dropped somewhere on the updated canary cluster. I worked with the on-call person to restart the logging and metrics Pods in the cluster. Halfway through the meet-up, another ICM. Several of the Pods had gotten stuck in crash-back-off loops, dropping our QoS threshold below an acceptable level for too long, and the Pods needed to be manually deleted and recreated. That fixed most things, but there were still some lingering issues.

--

--

Karla Saur

Containers | Distributed Systems | Kubernetes | Live Software Updates