Blue-Green Upgrades of Istio Control Plane

The Cloud Platform team at Snowflake runs more than 100+ Kubernetes clusters on AWS, Azure, and GCP. On each cluster, we run most workloads in an Istio service mesh to provide, despite multi-cloud, consistent management of traffic and security. Upgrading Istio is challenging due to its high frequency, wide blast radius, and large fleet of clusters. This article presents the blue-green upgrade approach we devised and the lessons we learned.

Background

Istio is an open-source service mesh that offers a uniform control plane to manage microservices in hybrid-cloud and multi-cloud. With the sidecar-proxy injection, Istio requires zero code change in the application and yet provides mTLS, rate limiting, service discovery, telemetry, RBAC, traffic shifting, etc.

Istio upgrades must be done often, to address new vulnerabilities and the short end-of-life of each release. Because Istio is a critical infrastructure for traffic, policy, and observability, misconfigured or unhealthy Istio components could lead to cluster-wide outages, so Istio upgrades are risky. Implementing such upgrades on all our clusters on different clouds requires scalable tooling and automation. We want to validate the new version before shifting workloads over. In case of any upgrade failure, we must be able to roll back quickly and cannot be stuck between versions.

With those requirements in mind, we arrived at the blue-green upgrade approach for Istio described below.

The blue-green upgrade

On a high level, the blue-green upgrade for Istio happens in four stages:

  • deploy a new control plane
  • shift canary workloads to the new control plane to validate its functionality and interoperability
  • shift the rest of the workloads to the new control plane
  • retire old control plane

This process is animated in Figures 1 through 3.

Figure 1: New control plane is deployed while all workloads connected to old control plane
Figure 2: Run canary workloads in the new mesh interoperating with the old mesh
Figure 3: Shift the rest of the workloads to the new control plane

We require all configuration changes to go through code review and applied through gitOps. We use Istio Operator to manage the installation by defining the IstioOperator custom resource. The canary workloads consist of a pair of client and server. By separating the client and the server in blue and green meshes, we can validate the interoperability of the two control planes, a prerequisite for fast rollback to the old Istio if anything goes wrong during the migration of production workloads. By moving both the canary server and client to the new mesh, we can test the functionalities of the new Istio release.

Because we require strict mTLS for all mesh-internal services, and that istiod signs the proxy certificate, the interoperability of the two control planes means they need to have a common root of trust. This can be done by using the same issuing CA or signing the issuing certificate with the same root CA.

Which control plane a Pod connects to is selected by the namespace label. Note that after the namespace label is updated, Pods in that namespace need to be restarted to have the right version of proxy injected. For the reasons that Istio sidecar proxy injection is implemented using a Kubernetes mutating webhook, and that a Pod is the smallest atomic unit of deployment on Kubernetes, we could not simply swap out the proxy container and must restart the Pod. The silver lining from such a requirement is that it promotes operation excellence among the platform users to configure their PodDisruptionBudget, liveness and readiness probes, graceful termination, etc, because Pods really are ephemeral.

The Istio upgrade also affects how the external traffic reaches the mesh-internal services. The external traffic is terminated at the Istio ingress gateway, provisioned by the Istio operator. After migration to istio-green, we need to drain client traffic from istio-blue to istio-green before cleaning up istio-blue. Given the Istio ingress gateways in the two Istio namespaces are fronted by separate load balancers and hence different IP addresses, we will update DNS records to direct traffic to the new ingress gateway. We do so using external-dns by updating the annotation of the Kubernetes Service object of the Istio ingress gateway. External-dns allows us to hide the cloud-specific IAM, permissions, and Pod Identity integration behind a consistent configuration abstraction, which allows our Istio related toolings to be cloud-agnostic.

Yes, traffic draining by DNS is inefficient because we cannot control client caching behavior. In the future, we want to explore more intelligent load balancing schemes that select the right backends — Istio ingress gateway Pods — given configurations such as labels. We also want to make the end-to-end upgrade process to be fully automated, such as using an operator to drive the blue-green upgrade — an IstioOperator operator if you will.

Come join us!

Snowflake is scaling rapidly. There is so much we want to do to make our cloud-native infrastructure more capable, resilient, and efficient. If you are excited like us about multi-cloud, distributed systems, container platforms, and open-source software, come join us!

Acknowledgment

Istio upgrades were a team accomplishment that would not be possible without contributions from Brian Nutt, Mehernosh Garda, Lamyanba Yambem, Jonas-Taha El Sesiy, and Javeria Khan. We appreciate Shawn Zhou and Raman Hariharan for supporting all Istio initiatives and for this blog.

--

--