Engine hot swapping: upgrading Kubernetes 1.8 straight to 1.15 with zero downtime

Published in

heycar

9 min readAug 28, 2020

There is no quick and easy way to achieve in-place upgrade from 1.8 to 1.15. This article describes how we successfully jumped over several minor versions of Kubernetes releases using what we call the “Twin Technique”.

Intro

As previously mentioned in another article, the heycar infrastructure was built in six weeks. Kubernetes was adopted from the very start and it provided a stable and reliable platform to run our workloads. It required little to no maintenance and never gave us any problems. It ran smoothly like clockwork, which meant that the focus could be shifted to building the product.

Because Kubernetes is a fast paced community (it pushes out new releases approximately every three months) you need those clusters as up-to-date as possible. Not just to get those nice new features, but also important security fixes. The version skew support policy is very strict, so it’s not a good idea to run an unsupported version.

We didn’t have a dedicated team to perform these kind of maintenance before and eventually the day to deal with the tech debt came.

Landscape

We run a pretty common infrastructure stack. Our microservices architecture, runs in Kubernetes, provisioned by KOPS, running on top of AWS and everything is managed with terraform.

High level and very simplistic view of our setup

Problem to solve and requirements

The task at hand is simple and so are the requirements. Perform an upgrade from 1.8 to the latest stable release without disrupting the business and the other team’s ability to deliver value to our customers.

We also don’t want to perform the various upgrades until we get to the desired version because of the existing breaking changes (etcd 3 and kops api version for example).

Last, but definitely not least, we wanted to have the ability to roll back, or forward, at any given step of the process.

Plan, ideas & strategies

The plan is simple. Create a new cluster and, one by one, migrate each service across with no downtime.

So it comes the first problem. Taking the example below, we have service A which receives requests from a cronjob and service B. It also requires access to service C.

If we migrate service A to the new cluster, how can it access service C which is still running on the old cluster? And how is service B and the cronjob able to make requests to service A on the new cluster ?

Cross-cluster communication and service discovery…. hmmm 🤔

Options

Kubernetes federation

KubeFed is currently in alpha state and moving rapidly towards its initial beta release. We didn’t want to take the risk of using alpha software for production migration.

VPN, SSH tunnels, reverse proxy

The idea is to establish a “bridge” between two clusters and route requests to a destination (i.e. to a Kubernetes service). Existing solutions don’t look simple or appealing in terms of setup.

Service meshes

Sounds cool and trendy, albeit not easy to integrate into the old cluster.

Big bang approach

Creating a new cluster and just shifting traffic from A to B was not possible because of some event driven workloads that have some circular dependencies to other workloads. This would also introduce service disruption thus voiding one of the requirements.

Our approach

Sometimes the simplest approach is also the most elegant and efficient and you don’t need fancy tools or complicated processes, especially if it’s provided out of the box.

We came up with the idea to use Kubernetes services with type = ExternalName. A Kubernetes service is an abstraction that defines a set of Pods (but not limited to) and a way to access them. Usually the pods targeting happens by a selector — a set of key-value pairs or labels, that allows a Service to match required pods.

Taking the same example as before, once the workload is removed from cluster A, its service type changes from ClusterIP to ExternalName. At the same time we deploy the workload in cluster B, with a service type of LoadBalancer, which creates an elastic load balancer on AWS, so the service is available for requests from outside the cluster. We use the ELB’s address as the externalName value on the old cluster and voila! All requests to the service on the old cluster are routed to the service on the new cluster. The same principle applies to requests coming from the new cluster send to services still on the old cluster.

Cross cluster service communication with ELBs and services of type ExternalName

Sadly, this initial approach has a small issue. ELBs take time to create and during the change we introduce some disruption, even though very small. We want a zero downtime migration, so this is something we have to avoid.

The Twin Technique

In Kubernetes, the Service name must be unique in a namespace, however, it’s totally fine to have multiple services with different names sharing the same selector and pointing to the same workload.

Eureka! We create an extra service for each existing one, with a different name, same selector and of type LoadBalancer. This on both clusters and ahead of time, thus eliminating the disruption during the migration.

We called those services twins 🙂

Public subdomains

For the services exposed to the world via subdomains, we were using an annotation, dns.alpha.kubernetes.io/external, which allows Kubernetes to manage DNS records in Route53.

To allow complete control in shifting user traffic gradually from one cluster to another, and roll back in case of problems, it’s better to have DNS weighted records managed via terraform. Easily done with terraform import.

Automating the process

Having the plan in mind, we started crafting a tool to automate the process and conduct rehearsals on staging. Because we wanted the option to do service by service, or all of them, either rolling forward or back to the initial state, the name Service DJ was deemed appropriate for the task.

It took care of creating the ELBs, for example, adjusting the service types and updating the ExternalName value as we moved services between clusters and some other calls to the AWS api.

This was a very important piece as it allowed us to quickly perform any task either rolling forward or back!

It can’t be that easy, can it ?

First problem: AWS quotas for the additional resources. Simple to solve via support request.

Then a limit on security group ingress rules. As we create the ELBs of the TWINs, new ingress rules are added to the cluster’s EC2 network interfaces. The Security groups per network interface limit multiplied by the Rules per security group limit can’t exceed 1000.

The solution is simple. All ELBs can use the same security group. This is configured using cloudConfig key in the kops cluster yaml file.

cloudConfig:     
  elbSecurityGroup: "sg-xxxxxxxxx"
  disableSecurityGroupIngress: true

The security group is configured to allow traffic only between clusters. Later on we noticed ELBs were still exposed to external traffic. Something was changing the rules. We found an issue in Kubernetes that explained this. Wait for a fix is not an option, nor we want to fix it ourselves and run a custom build of Kubernetes in production.

Using the AWS cli, we can fix the rules as part of the DJ’s playlist.

During the rehearsals some other issues were found. An option that enables nginx to pass the incoming X-Forwarded-* headers to upstreams become disabled by default. Quickly fixed by enabling the option use-forwarded-headers.

After some more troubleshooting, we were finally in a position where everything was working fine. We performed several dress rehearsals on our staging cluster and finally set the date.

…8 …7 …6 …5 …4 …3 …2 …1.15 !!!

For the actual migration, we decided to perform staging only and keep it for a couple of days to get a feeling on how things would roll. It went very smoothly and our DJ took care of the party in under two hours. We adjusted the CD pipeline and started to monitor it. In the meantime, preparations started for the migration of our production cluster. Everything was working as expected.

Production migration was a bit more “fun”, with Murphy coming uninvited, as usual.

Several events made us start later rather than sooner, but we still wanted to go ahead that day. One of our team members had to go away the next day and, after everyone put so much effort and time on this, we wanted the whole team to be present and celebrate the end of it.

Migration kicked off with us shifting traffic of our subdomains to the new cluster. All the requests were rerouted via the ExternalName type Services to the TWINS on the old cluster.

Then Service DJ started doing its job and one by one we moved our workloads between clusters.

Everything was going well up until we spotted an issue with our A/B testing service. Logging was showing errors, but we couldn’t understand why as there weren’t many details. Rolling back the service made the problem go away so it was something on the new cluster.

After some investigation we found two issues happening at the same time. A rewrite-target configuration no longer compatible with the current version of the nginx ingress controller and, on top of that, the A/B test service wasn’t configured to handle a default rule for each test. A scenario that slipped through our initial tests.

Figuring this out plus testing the fix took some time, and it was getting late. We didn’t want to risk running into another issue that it would mean more time troubleshooting and fixing so we decided to just leave the migration half-way. Waaait… WHAT ?!?!

Yes, you read it right. While we were investigating the issue the migration was pretty much halted and we had roughly half the workloads running on the new cluster and the remaining ones on the old.

The whole basis for this approach was that we could operate workloads in different clusters and it was working fine, so why roll back ?

We just spent some time double checking that everything was in order and working within parameters, and… we went home.

Murphy probably didn’t see that coming so he didn’t show up the following day. The DJ resumed the party where we left it the previous day and all services were migrated smoothly.

CD pipelines were adjusted, we double checked everything again and the next day we dropped the old clusters.

Final thoughts

Key take-aways from this experience:

Using ExternalName opens some possibilities: you can hide an endpoint behind a Kubernetes service that uses in-cluster DNS, which is cheaper than Route53, and you can simplify names. Instead of an RDS name such as my-database.cxjojbttfabct.space-central-1.rds.amazonaws.comyou can hide it behind a service with my-database name.
Using features that come by default with Kubernetes made it possible without extra tools and configuration; sometimes you don’t need fancy tools for that single use case.
This is a good method to avoid tricky breaking changes when you’re performing a Kubernetes upgrade.
It is also valid for migrating between cloud providers or even managed services (EKS to KOPS, for example, or vice-versa).
We changed the engine without a pit stop.
We’re not going to wait long again to maintain our infrastructure up-to-date.

All in all the ride was very fun, filled with learnings and very satisfying by the fact that we did it mostly with features that come by default with Kubernetes and our cloud provider.

Nowadays we are upgrading our clusters as new versions come through by simply making a pull-request and letting our tooling do all the rest.