Kubernetes Engine (GKE) multi-cluster life cycle management series

Part IV: GKE multi-cluster lifecycle management

Published in

Google Cloud - Community

11 min readApr 22, 2020

In this blog, I discuss common GKE life cycle strategies as well as planning and design considerations when picking the right one for you.

I’ll assume you already know the reasons for multi-cluster architectures (part I), what are Distributed Services (part II) and what constitutes a GKE upgrade (part III).

Planning and design considerations

GKE multi-cluster architecture plays a part in selecting a desirable cluster lifecycle management strategy. Before discussing these strategies, it is important to discuss certain design decisions that may affect or be affected by the cluster lifecycle management strategy.

Type of clusters — If using the GKE auto upgrade as a cluster lifecycle management strategy, the type of cluster may matter. For example, regional clusters offer multiple masters where masters are auto upgraded one at a time vs zonal clusters which offer a single master. If you’re not using GKE auto upgrades and believe that all Kubernetes clusters are to be treated as infrastructure components (and disposable) then it may not matter what type of cluster you choose when deciding on a cluster lifecycle management strategy. The strategies discussed in the next section can be applied to any type of cluster.

Cluster placement and footprint — There are a few factors that need to be considered when deciding on the cluster placement and footprint.

Zones and regions clusters are required to be in
Number and size of clusters needed

The first question is usually easy to answer as the zones/regions are dictated by your business and in which regions you are serving your users out of. The answer to the second question typically falls in two categories each with pros and cons.

Small number of large clusters — You may choose to utilize the redundancy and resiliency provided by regional clusters and place one (or two) large sized regional clusters per region. The benefit of this approach is low operational overhead of managing multiple clusters. The downside is it may affect a large number of services at once i.e. large blast radius.
Large number of small clusters — Another strategy is to create a large number of small sized clusters. This reduces the cluster blast radius as your services are split across many clusters. This approach also works well for short lived ephemeral clusters (for example clusters running a batch workload). The downside of this approach is higher operational overhead as there are more clusters to upgrade. There may also be additional cost associated with a higher number of masters. This can be offset by automation, predictable schedule and strategy and careful coordination between teams and services affected.

The guide does not recommend one approach over the other, they are simply there as options. In some cases, both design patterns can be chosen for different categories of services. The strategies discussed below work with either design choice. A guide for deciding on a strategy can be found here.

Capacity planning — When planning for capacity, it is important to take into account the chosen cluster lifecycle strategy. Capacity planning must take into account normal service load and maintenance events. There are two types of outages:

Planned events like cluster upgrades
Unplanned events like cluster outages for example bad config pushes, bad rollouts etc.

When capacity planning, you must take into account any total or partial outages. If you design for planned maintenance events only then all distributed services must have one additional cluster than required so that one cluster may be taken out of rotation at a time for upgrades without degrading the service. This approach is also referred to as “N+1 capacity planning”. If you design for planned and unplanned maintenance events then all distributed services must have two (or more) additional clusters than required to serve intended capacity — one for the planned event and one for an unplanned event in case it occurs during the planned maintenance window. This approach is also referred to as “N+2 capacity planning”.

In multi-cluster architectures, the terms draining and spilling are often used. These terms refer to the process of removing (or draining) traffic from a cluster and redirecting (or spilling) traffic onto other clusters during upgrades and maintenance events. This is accomplished using networking solutions like multi-cluster ingress or other load balancing methods. Careful use of draining and spilling are at the heart of some cluster lifecycle management strategies which are discussed in the following section. During capacity planning, these must be taken into account. For example: “if a single cluster is drained, do other clusters have enough capacity to handle the additional spilled traffic?” Other considerations include sufficient capacity in the desired zone or region or is there a need to send traffic to a different region (if using a single regional cluster per region for example).

Clusters and distributed services — Services based cluster design dictates that cluster architecture (number, size and location) is determined by services that are required to run on the clusters. Therefore the placement of your clusters are dictated by where the distributed services are needed. There are a number of considerations that are taken into account when deciding the placement of distributed services. Some of these are:

Location requirement — where/which regions does the service need to be served out of
Criticality — how critical is the availability of a service to the business
SLO — service level objectives for the service (typically based on criticality)
Resilience — How resilient does the service need to be? Does it need to withstand cluster, zonal or even regional failures.

When planning for cluster upgrades, you must consider the number of services a single cluster affects when it is drained and must account for spilling each of these services to other appropriate clusters. Clusters can be single tenant or multi-tenant. Single tenant clusters only serve a single service or a product represented by a set of services. They do not share the cluster with other services or products. Multi tenant clusters can run many services and products, typically partitioned into namespaces. When planning for cluster upgrades, you must consider the number of services a single cluster affects when it is drained and must account for spilling each of these services to other appropriate clusters.

Impact to teams — A cluster event not only affects services but may also impact teams. For instance, the DevOps team might need to redirect or halt their CI/CD pipelines during a cluster upgrade. Likewise, support teams might get alerted for planned outages and need to be notified. It is therefore important to realize and plan for the fact that cluster events impact more than just services. AutomationCareful coordination and tooling must be in place to help ease the impact to multiple teamssurprises between operations and support teams. A cluster or a cluster fleet upgrade should be considered as routine and “uneventful” with all teams well informed.

Timing, scheduling and coordination — Kubernetes releases a new minor version quarterly and maintains the last three releases. Timing and scheduling cluster upgrades must be planned carefully. Collectively, there must be an agreement on when these upgrades take place. There are a few considerations in this regard:

How often do you upgrade? Do you upgrade every quarter or perhaps every half (two quarters)? Or a different timeline.
When to upgrade? Do you upgrade at the beginning of the quarter when business slows down or during other business downtime dictated by your specific industry.
When not to upgrade? Do you have clear planning around when not to upgrade. For example, Black Friday, Cyber Monday or during high profile conferences and other industry specific events.

It is important to have a strategy in place. It is equally important that it is clearly communicated with the service owners as well as the operations and support teams. There should be no surprises. Everyone knows and expects when and how the clusters are upgraded. This requires clear coordination with all of the teams involved. A single service has multiple teams that interact with it. Typically they can be grouped into two categories. The service developer is the persona that is responsible for creating and coding the business logic into a service. And the service operator that is responsible for safely, and reliably running the service. The operators can be composed of multiple teams like policy/security admin, networking admin, support teams etc. They must all be kept in the loop during cluster upgrades so they can take proper actions during this time. One option is to plan for this the same way as an outage incident. You have an incident commander, a chat room, a postmortem (even if no users were impacted).

With these design and planning considerations, let’s discuss common GKE multi-cluster lifecycle management strategies.

GKE Cluster lifecycle Strategies

The section discusses three main cluster lifecycle management strategies often used in GKE multi-cluster architecture. It is important to note that one size does not fit all and you may end up choosing multiple strategies for various categories of services and needs of the business.

Rolling Upgrades

This is the simplest and the most cost effective strategy. You start with N number of clusters running the old_ver (or current production version). You then drain m clusters at a time, where m is less than N. You then delete and recreate new clusters with the new desired version, or upgrade the drained clusters. The decision between deleting and upgrading new clusters depends upon the size of the clusters as well as your belief in immutable infrastructure. Immutable infrastructure dictates that instead of constantly upgrading a cluster which may produce undesirable results over time, you create new clusters instead, avoiding any unforeseen configuration drift. This is quite easily accomplished if using GKE as you can create a GKE cluster with a single command or an API call. New cluster strategy requires that you have the entire cluster configuration (cluster manifests) stored outside of the cluster, typically in Git. You can then use the same configuration template on the newly created cluster. If this is a new cluster, ensure that your CI/CD pipelines are pointing to the correct cluster. After the cluster is properly configured, you can push traffic back onto the cluster slowly while monitoring services’ SLOs.

The process is repeated for all clusters. Depending upon your capacity planning you may end up upgrading multiple clusters at a time without violating service SLOs.

This strategy is great if you value simplicity and cost over resiliency. During this strategy, you never exceed the GKE fleet’s required capacity for all distributed services.

Blue/Green

This is a simple strategy, it provides some added resiliency but a bit more costly compared to the previous one. This strategy is very similar to the previous one. The only difference is instead of draining existing clusters first, you create m new clusters with the desired version first, where m is less or equal to N. You add the new clusters to the CI/CD pipelines and then slowly spill traffic over while monitoring the service SLOs. When the new clusters are fully taking traffic, you drain and delete clusters with the older version. This is akin to a blue/green upgrade strategy typically used for services. Creating multiple new clusters at a time will increase the overall cost but gives you the benefit of speeding up the fleet upgrade time. The added cost is only for the duration of the upgrade, during which additional clusters are used. The benefit of creating new clusters first is easy rollbacks in case of a failure. The new cluster can also be tested before sending production traffic to it. As these clusters co-exist with their old version counterparts for a small period of time, the additional cost should be minimal.

This strategy is great if you value simplicity and resiliency over cost. During the strategy, additional cluster(s) are added first hence exceeding the GKE fleet’s required capacity for the duration of the upgrades.

Canary Clusters

This is the most resilient and complex strategy out of the three. This strategy completely abstracts cluster lifecycle management from services lifecycle management, thereby offering the lowest risk and highest resilience for your services. In the previous two strategies, you maintain your entire GKE fleet on a single version. In this strategy, you maintain two or perhaps three fleets of GKE clusters running different versions. Instead of upgrading the clusters, you migrate services from one fleet of clusters to the other over time. When the oldest GKE fleet is naturally drained (meaning all of the services have been migrated to the next versioned GKE fleet), you delete the fleet. This strategy requires you to maintain a minimum of two GKE fleets. One for the current production and one for the next production candidate version. You can also maintain more than two GKE fleets. This gives you more flexibility but your cost and operational overhead also goes up. It is important to note that this is not the same having clusters in different environments for example dev, stage and prod. Non production environments are great for testing the Kubernetes features and/or services with non production traffic. This strategy dictates that you maintain multiple GKE fleet versions in the production environment. This is similar to canary release strategies often used by services. With canary service deployments, the service owner can always pinpoint issues to a particular version of the service, with canary clusters, they must also take into account GKE fleet versions their services are running on. A single distributed service version could potentially run on multiple GKE fleet versions. The migration of a service can happen gradually so you can see the effects of the service on the new fleet prior to sending all of the traffic for the service to the new versioned clusters (as in the previous two strategies).

This strategy is great if you value resilience over everything else.

The decision tree below may be useful for determining which strategy is best for you based on the service and business needs.

GKE Lifecycle Management Decision Matrix

Up next… Part V Hands-on Lab: GKE multi-cluster rolling upgrades using Ingress for Anthos step-by-step tutorial