Upgrading a large cluster on GKE
At Olark we’ve been running production workloads on kubernetes in GKE since early 2017. In the beginning our clusters were small and easily managed. When we upgraded kubernetes on the nodes, our most common cluster-wide management task, we could just run the process in the GKE console and keep an eye on things for awhile. Upgrading involves tearing down and replacing nodes one at a time, and consumes about 4–5 minutes per node in the best case. When we were at 20 nodes it might take 90–120 minutes, which is in a tolerable range. It was disruptive, but all our k8s services at the time could deal with that. It was irreversible too, but we mitigated that risk by testing in staging, and by staying current enough that the previous version was still available for a replacement nodepool if needed. This approach seemed to work fine for over a year.
As our clusters grew and we created additional nodepools for specific purposes a funny thing began to happen: upgrading started to become a hassle. Specifically it began to take a long time. Not only did we have more nodes, but we also had a greater diversity of services running on them. Some of those implemented things like pod disruption budgets and termination grace periods, that slow an upgrade down. Others could not be restarted without a downtime due to legacy connection management issues. As the upgrade times got longer the duration of these scheduled downtimes also grew, impacting our customers and our team. Not surprisingly we began to fall behind the current GKE release version. Recently we received an email from Google Support letting us know that an upcoming required master update would be incompatible with our node version. We had to upgrade them, or they would.
Running the upgrades using our existing process in staging quickly drove home that we were well past the point of doing these in-place. Our largest nodepool in production at the time was 55 nodes, which would have taken at least 6 hours to fully upgrade. All told we had 105 nodes to upgrade. We needed a different approach, and decided to move forward by replacing nodepools instead of upgrading them. Before I get into how that process worked I should make it clear that this isn’t necessarily going to be a problem for everyone running a larger k8s cluster. If all your services are stateless and can be safely rescheduled, then kicking off an upgrade and letting it run for 8 or 10 hours isn’t the worst thing you can do. In our experience the process is highly reliable and there are straightforward recovery options if you have to bail out.
But I suspect many people will be in situations similar to ours, with a diversity of workloads that have different levels of tolerance to disruption. In our case we had four nodepools in the cluster, each specialized for a specific purpose. Some run stateless http and rpc services and are very resilient. Some run services that we don’t want to restart without notice. Another runs a large elasticsearch cluster ingesting 150–200 million log events per day. We decided to treat each one of these separately and upgrade them at different times, just to keep the scope of things reasonable and focused, and to allow us to tailor the approach to the workloads involved.
We began with the elasticsearch cluster, and decided to upgrade that nodepool in-place as before. There are relatively few nodes each running one or two pods, and elasticsearch itself is pretty tolerant of node failures, even under load. Taking out one node at a time with 5 minutes in between should be something it can deal with, and in fact that process went off very smoothly. There were periods when the cluster status went to red, and there were some brief interruptions in log delivery, but as one of my colleagues likes to point out: “logging is a firehose.” If the bucket isn’t big enough it isn’t big enough, and we could live with that for a short time.
Our largest nodepools run stateless http and rpc services, and consist of over 75 nodes. These nodepools autoscale and some of them have taints and labels that are important for pod affinity. We prepared for these by first creating a new nodepool on the target version, with autoscaling disabled and a small fixed number of nodes. The new nodepools duplicated the taints and labels present on the existing nodepools. We then disabled autoscaling on the source nodepools. With that done we began to migrate pods by scaling down the source nodepools and scaling up the replacements ten nodes at a time. The process took a couple of hours, and went pretty smoothly. I’ll talk about one or two hitches that we ran into in the conclusion below. Once we had fully migrated we enabled autoscaling on the new nodepools and removed the old. As a side benefit we had earlier noticed that we were over-provisioned on disk for these nodes, and were able to cut the boot disk size in half, saving 2500 GB.
Lastly we scheduled a downtime for the migration of our “no restart” services. We prepared for this as above by creating the new nodepool, duplicating taints and labels. In this case our goal was to get the pods migrated and keep the downtime as short as possible, so we created the new nodepool at the full target size. We then scaled down the existing nodepool by half, and waited for the nodes to be deleted and the pods to be rescheduled and to become ready. After validating the workload on the new nodes we removed the rest of the source nodepool and forced the remaining pods over to the new one. If the workloads had failed on the new nodes the fallback would have been to scale the previous nodepool back up and remove the new one. Taken together all of these upgrades took around three hours, and when completed we had all components of our 105 node cluster running on the latest version of kubernetes.
I could have titled this post “Why we run kubernetes” or “I love nodepools,” and almost did. From my perspective as a software engineer and SRE who has been around since things were bolted into racks the flexibility derived from going all-in on the kubernetes ecosystem is still something almost magical. Not that we didn’t run into a little trouble. The one mysterious thing that happened was that scaling down the existing nodepool hung on two occasions. The good news is that it was very easy to identify which node was causing a problem using kubectl get nodes
and examining the node status. During an upgrade nodes move from Ready/Schedulable
to Ready/NotSchedulable
and finally NotReady/NotSchedulable
after which they are removed. In each case one node was stuck in Ready/NotSchedulable
. It was easy to identify and remove the hung pod that was gumming up the works, which in both cases was the nginx ingress controller default backend.
I want to give a shout out to my colleagues Kyle Owens, Nick Hill, Nick MacInnis and Aaron Wilson for help in brainstorming and carrying out the upgrades, and to our former colleague Brandon Dimcheff for being a great sounding board.