Migrating applications between Kubernetes clusters
TL;DR: Migrating applications gradually between Kubernetes deployments so that intra-cluster traffic is not affected, can be challenging. This article proposes an approach that makes use of standard but powerful tools.
Raise the hand who has never had to migrate applications between two Kubernetes clusters!
Easy, right? Well, not really. Not always.
It may happen for a few reasons: for example, it might be due to some major platform upgrades, or as a result of a consistent resource reorganisation.
While in principle this would seem quite a straight-forward task, the process may get more complex, as dependencies between applications increase, and requirements for lower downtime during the migration arise.
A first approach may consist in recreating the same environment in the new cluster, and eventually switching any external DNS record, once the new deployment is ready. Sadly, this is not always possible.
A simple use-case will give us a better understanding of the issue and how it can be solved.
Let’s imagine a Kubernetes cluster running two applications; one depending on the other. They are owned by different teams, who can migrate them at different paces.
How would you migrate the applications gradually, avoiding downtimes?
Before moving forward, let’s point out a couple of fundamental, firm assumptions, that are not strictly related to the migration, but will definitely lower the overall cluster maintenance efforts, and straighten the migration process:
- Applications should always reference both internal and external services using DNS names, rather than IP addresses
- It’s important to have a clear view of the relationships between our applications. It will help us to proactively prepare the migration and react in time to possible failures.
Now that we stated the basic rules of the game, let’s get back to our example.
How can we move one application to the new cluster, so that the other doesn’t even know that its dependency has been migrated and avoid malfunctions?
When I started scratching the surface of the problem, some terms, like multi-cluster, cluster federation and service mesh, popped up in my mind.
There are a bunch of tools out there to build multi-cluster environments, create cluster federations and efficiently manage “microservices networks”. Istio is just one out of many.
Great, it seems we made it! …Well, not really.
Most of the multi-cluster and federation tools come with important caveat: the two Kubernetes clusters need to run a very similar, if not the same, version of Kubernetes, and this is definitely not always the case. Moreover, many users may find using these tools an overkill and something risky to inject in their systems, for the only sake of a migration.
This article follows a more conservative approach, leveraging a set of basic but powerful tools, such as DNS, Kubernetes services and, optionally, an ingress controller.
My Environment
I’ll use Google Cloud Platform (GCP) for the demonstration. This will include a couple of Kubernetes clusters (together with some load balancers, provisioned as we create our services), and Cloud DNS.
I will also make use of Traefik as an ingress controller. Although it is not mandatory, it’s strongly recommended, as it will avoid creating single LoadBalancer services (and related internal GCP load balancers) for each application to be migrated. Things will get clearer as soon as we’ll go through the steps. Anyway, keep in mind you’ll be able to optionally substitute Traefik with your favorite Ingress.
Demo Time!
I’ve prepared two empty clusters: cluster-old
and cluster-new
.
We’ll start deploying two applications in the old cluster: app1
and app2
. To make the experiment somehow more intriguing, the two apps will be deployed in different namespaces: respectively app1-ns
and app2-ns
.
Each application is composed of a pod and a namesake ClusterIP service that references it.
App1 is a dumb HTTP server, that returns at its root, on port 80, the message "Hello, I’m app1"
.
Here is the Kubernetes manifest I’ve used to create the pod:
# app1-pod.yamlkind: Pod
apiVersion: v1
metadata:
name: app1
labels:
app: app1
spec:
containers:
- name: app1
image: hashicorp/http-echo:0.2.3
args:
- "-text=Hello, I'm app1"
- "-listen=:80"
And here is the one for the service:
# app1-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1
spec:
selector:
app: app1
ports:
- port: 80
App2
is a very simple client: it queries app1
every second, using curl. Writing replies received to the system standard output, we’ll be able to see them querying the container logs.
# app2-pod.yamlkind: Pod
apiVersion: v1
metadata:
name: app2
labels:
app: app2
spec:
containers:
- name: app2
image: curlimages/curl:7.74.0
command: ["/bin/sh", "-c"]
args:
- >
while true; do
curl -s -X GET http://app1.app1-ns
sleep 1
done
Let’s create the namespaces and deploy the applications.
kubectl create namespace app1-ns
kubectl apply -f app1-pod.yaml --namespace app1-ns
kubectl apply -f app1-svc.yaml --namespace app1-nskubectl create namespace app2-ns
kubectl apply -f app2-pod.yaml --namespace app2-ns
The goal is to migrate these applications from the old cluster to the new cluster, where only the two empty namespaces have been created.
No need to say that the most critical part is migrating app1
, so that app2
doesn’t even notice that the former has been moved over.
At a high level, the idea is to create a new service in the old cluster that, instead of pointing to the local app1 application will link -through an external DNS entry- to a clone of it, living in the new cluster.
Along the way, we’ll always be able to test if the new components created work, before modifying any existing routing.
Let’s start implementing the machinery, making sure the mechanism works within the same cluster first: app2
will soon communicate with app1
through the external DNS.
Installing Traefik is the first step. I won’t go into details, since this is not the goal of the article. I’ve simply followed the official helm installation guide, and added an annotation to create a GCP internal load balancer, instead of the default HTTP global load balancer, not really needed for this experiment.
kubectl create namespace traefik
helm repo add traefik https://helm.traefik.io/traefik
helm repo updatehelm install \
--set service.annotations."cloud\.google\.com/load-balancer-type"=Internal \
--namespace traefik \
traefik \
traefik/traefik
In a few seconds you should see the private IP address allocated for the Traefik LoadBalancer service.
kubectl get services -n traefik
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
traefik LoadBalancer 10.72.4.210 192.168.100.16 80:31679/TCP,443:31228/TCP 45s
Make a note of it. We’ll need it soon.
Moving to Cloud DNS, create a private DNS zone. For example, mycompany.internal
Create an A record, old.mycompany.internal
, pointing to the LoadBalancer IP just allocated.
Once finished, your DNS panel should look as follows.
In the old cluster:
Make a copy of your app1
service. Call it, for example, app1-internal
.
# app1-internal-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1-internal
spec:
selector:
app: app1
ports:
- port: 80
Create a Traefik IngressRoute.
# app1-ingress-route.yamlapiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: app1
spec:
entryPoints:
- web
routes:
- kind: Rule
match: Host(`app1.app1-ns`) || Host(`app1.app1-ns.svc.local`)
services:
- kind: Service
name: app1-internal
namespace: app1-ns
port: 80
Notice that we match either on the destination hostnames app1.app1-ns
or app1.app1-ns.svc.cluster.local
. This is because, beside the different strategies we will put in place, requests will still reach app1
using the original destination host header.
Let’s deploy the two components:
kubectl apply -f app1-internal-svc.yaml --namespace app1-ns
kubectl apply -f app1-ingress-route.yaml --namespace app1-ns
It’s time to verify that our application is reachable through the new path. To do so, I’ve entered into the app2
client and manually curl app1
through the new address.
kubectl exec -it -n app2-ns app2 /bin/sh
curl -H “Host: app1.app1-ns” http://old.mycompany.internal
Hello, I'm app1
Notice that I specify a Host header. Without doing this, you would receive a “404: Page not found.”
reply from Traefik.
Create an ExternalName service, pointing to old.mycompany.internal
.
app1-ext-old-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1
spec:
externalName: old.mycompany.internal
type: ExternalName
Finally, replace the original app1
service in app1-ns
with the one just created.
kubectl replace -f Desktop/medium/app1-ext-old-svc.yaml -n app1-ns
During the experiment, I’ve kept logs going on app2, and I’ve never visibly noticed app1 stop answering.
App1
is still living in the old cluster, but from now on, client requests go to the ExternalName service, out of the cluster to Cloud DNS, back into Traefik, through the new app1-internal
service, and finally to the pod.
The Migration
Now that the mechanics we had in mind works, we are ready for the migration.
Let’s setup the new cluster. As we did in the old one:
- Deploy
app1
andapp2
in their namespaces (same commands as above) - Deploy Traefik (same commands as above). Get the new internal load balancer IP
- Create another A record in
mycompany.internal
. Call itnew.mycompany.internal
and point to the new cluster ingress IP - Deploy the same Traefik IngressRoute deployed in cluster-old
- Deploy the
app1-internal
service in theapp1-ns
namespace (same commands as above)
Create a new ExternalName service pointing to the DNS name just created.
# app1-ext-new-svc.yamlkind: Service
apiVersion: v1
metadata:
name: app1
spec:
externalName: new.mycompany.internal
type: ExternalName
Finally, let’s replace the old ExternalName service in the old cluster with the one just created:
kubectl replace -f Desktop/medium/svc-ext-name-new.yaml -n app1-ns
With basically no downtime, your app2
client in the old cluster will start communicating with the app1
application living in the new cluster.
Notice the process was completely transparent for app2
, as we haven’t changed any reference to app1
in it.
We can now repeat the same process for app2
, thus completing the migration.
Once the migration is complete:
- The
app1
service can be converted back to a ClusterIP service - The old cluster can be removed
- The Cloud DNS zone, Traefik and the
app1-internal
service in the new cluster can be deleted
Although this is a minimal setup, it should be fairly easy to extend the same process to larger deployments and possibly automate it to be applied at scale. But this is another topic!
Enjoy!
Thank you Ludovico Magnocavallo for sharing your ideas and helping me with this article!