Kubernetes Engine (GKE) multi-cluster life cycle management series

Part II: Multi-cluster and Distributed Services

Published in

Google Cloud - Community

5 min readApr 21, 2020

In part I, I talked about various reasons for multi-cluster Kubernetes architectures. In this part, I’ll introduce the concept of a Distributed Service. A distributed service is a Kubernetes Service that is deployed to multiple Kubernetes clusters. Distributed services are stateless services and act identically across multiple clusters. This means a distributed service has the same Kubernetes Service name and is implemented in the same namespace across multiple clusters. Kubernetes Services are tied to the fate of the Kubernetes cluster they run on. If a Kubernetes cluster goes offline, so does the Kubernetes Service. Distributed services are abstracted from individual Kubernetes clusters. If one (or more) Kubernetes clusters are down, the distributed service may very well be online and within the desired SLO (Service Level Objective).

In the following diagram, Foo is a distributed service running on multiple clusters (with the same Service name and namespace). As such, it is not tied to a single cluster and hence represented conceptually as a layer above the Kubernetes cluster infrastructure layer. If any of the individual clusters (running Service Foo) were to go down, the distributed service Foo still remains online. Meanwhile, Services Bar and Baz are Kubernetes Services running on single individual clusters. Their uptime (and availability) are dependent on the uptime of the specific Kubernetes cluster on which they reside.

Distributed Service Foo and Kubernetes Services Bar and Baz on GKE

Resiliency is one of the reasons for multi-cluster deployments. Distributed services create resilient services on multi-cluster architecture. Stateless services are prime candidates for distributed services in a multi-cluster environment. There are a few requirements and considerations when working with distributed services.

multi-cluster networking — Traffic destined to a distributed service can be sent to clusters running that service. This is accomplished via a multi-cluster ingress technology like Ingress for Anthos or rolling your own external load balancer/proxy solution. Nonetheless, the networking solution must give the operator control over when, where (i.e. which cluster) and how much traffic is routed to a particular instance of a distributed service.

Multi-cluster Ingress to Distributed Service Foo

Observability — Tools must be in place to observe at a distributed services layer. SLOs (typically for availability and latency) should be measured collectively for a distributed service. This provides a global view of how each service is performing across multiple clusters. While distributed service is not a well defined resource in most observability solutions, the intended outcome can be achieved by collecting and combining individual Kubernetes Service metrics. Solutions like Cloud Monitoring or open source tools like Grafana and many others provide Kubernetes Service metrics. Service mesh solutions like Istio and ASM also provide Service metrics out of the box without any instrumentation required.

Service Placement — Kubernetes Services provide node level fault tolerance within a single Kubernetes cluster. This means that a Kubernetes Service can withstand node outages. During node outages, a Kubernetes master automatically reschedules Pods to healthy nodes. A distributed service provides cluster level fault tolerance. This means a distributed service can withstand cluster outages. When capacity planning for a distributed service, this must be taken into account. A distributed service does not need to run on every cluster available. Which clusters a distributed service runs on may depend upon the following requirements:

Where, or in which regions, is the service required?
What is the required SLO for the distributed service?
What type of fault tolerance is required for the distributed service? Cluster, zonal or regional. For instance, multiple clusters in a single zone, or multiple clusters across zones in a single region or multiple regions.
What level of outages should the distributed service withstand in the worst-case? At a cluster layer
N+1, meaning a distributed service can withstand a single cluster failure
N+2, meaning two concurrent failures. For instance a planned and an unplanned outage of a Kubernetes Service in two clusters at the same time.

The placement of a distributed service depends on the above requirements.

Rollouts/Rollback — Distributed services, like Kubernetes Services, allow for gradual rollouts and rollbacks. Unlike Kubernetes Services, distributed services enable clusters as an additional unit of deployment as a means for gradual change. Rollouts and rollbacks also depend upon the service requirement. In some cases, it may be required to upgrade the service on all the clusters at the same time (for example a bug fix). In other cases, it may be required to slowly roll out (or stagger) the change one cluster at a time. This lowers the risk to the distributed service by gradually introducing changes to the service. However, this might take longer depending on the number of clusters. Typically, there is no one size fits all solution. Often, multiple rollout/rollback strategies are used depending upon the distributed service requirements. The important point here is that distributed services must allow for gradual and controlled changes in the environment.

Business Continuity/Disaster Recovery (BCDR) — These terms are often used together. Business continuity refers to continuation of (critical) services in the face of a major (planned or unplanned) event, whereas disaster recovery refers to the steps taken or needed to return business operations to its normal state after such events. There are many strategies for BCDR that are beyond the scope of this guide. Suffice it to say, BCDR requires some level of redundancy in systems and services. The key premise of distributed services is they run in multiple locations (clusters, zones, regions). BCDR strategies are often dependent upon the rollout/rollback strategies discussed earlier. For example, if rollouts are performed in a staggered/controlled manner, the effect of a bug or a bad configuration push can be caught early without affecting a large number of users. At large scale coupled with rapid rate of change (e.g. in modern CI/CD practices), it is common that not all users are served the same version of a distributed service. BCDR planning and strategies in distributed systems and services differ from traditional monolithic architectures. In traditional systems, a change is made wholesale affecting a large number of or perhaps every user and thus must have a redundant/backup system in place in case of undesirable effects of a rollout. In distributed systems/services, almost all change is done in a gradual manner affecting a small number of users.

Cluster lifecycle management — Like controlled rollouts/rollbacks, distributed services allow for controlled cluster lifecycle management. As mentioned before, distributed services provide cluster level resiliency. This means clusters can be “taken out of rotation” for maintenance. This can be done in multiple ways. Some of these strategies are described later in this guide. Suffice it to say here, cluster lifecycle management is a tenet of distributed services that does not apply to Kubernetes Services.

The remainder of this series focuses on the cluster lifecycle aspect of distributed services.

Up next… Part III: GKE Cluster lifecycle management

Kubernetes Engine (GKE) multi-cluster life cycle management series

Part II: Multi-cluster and Distributed Services

Written by Ameer Abbas