American chimpanzee Ham in his Mercury mission capsule. Ham was the first canary to pull levers in outer space.

How Canary Deployments Work, Part 2: Developer vs. Operator Concerns

Pei-Ming Wu
Glasnostic

--

This is the second of a two-part series on canary deployments. In part one, we covered the developer pattern and how it is supported in Kubernetes, Linkerd and Istio. In this post, we explore the operational pattern, how it is implemented in Glasnostic, a comparison of the various implementations and finally the pros and cons of canary deployments.

Two Perspectives on the Canary Pattern

There are two perspectives on the canary pattern: a narrow view that developers take when they ask themselves “will this update work as expected?” and a wider one that operations teams take when they wonder “will this update cause my service landscape to fail?” As a result, canary deployments as an operational pattern require control of traffic to the production and canary (and potentially baseline) clusters, plus controls to protect the parts of the service landscape that are surrounding the canary cluster. Operators also need the ability to create canary deployments independent of (and not interfering with) other, pre-existing and potentially overlapping policies. In short, operations teams want the ability to layer policies around the canary to protect the surrounding architecture.

Glasnostic’s Operational Perspective

Glasnostic is built from the ground up to support operational patterns. Analogous to a sound engineer’s mixing board, it is designed around the concept of grouping service interactions logically in channels, each of which then acts as a point of control for the interactions it applies to. Glasnostic supports the creation of any number of channels, for arbitrary sets of interactions. Once a channel is defined, operations teams may then control its underlying interactions by applying policies and operations. Channels are independent of each other and thus can be layered arbitrarily.

Glasnostic is a control plane for operations teams that controls the complex interactions among microservice applications in order to detect and remediate issues, prevent cascading failures and avert security breaches.

Let’s look at two examples of canary deployments from the operational perspective. First, we’ll look at a simple, undifferentiated deployment without client rules and then at a more involved deployment with client differentiation by source with an additional layered channel governing the canary cluster’s upstream interactions.

Basic Canary Pattern

To implement a basic canary pattern in Glasnostic, operators can simply create a new channel for any traffic directed at the canary cluster and then apply a rate limit at whatever level is desired. While this does the job, it is often preferable to also create a channel to monitor the existing production cluster for comparison. Figure 1 illustrates this setup.

Figure 1: Basic canary deployment with Glasnostic. One channel is created to monitor the existing production cluster (1) while a second channel regulates traffic to the canary (2).

Creating a channel for the canary cluster allows for straightforward regulation of its traffic. This setup can be easily extended to one that includes a baseline cluster to compare against by creating a third channel around traffic to a subset of the production cluster that is of equal size to the canary cluster.

Canary Pattern With Client Differentiation and Policy Layers

Figure 2 shows a more complex canary deployment around an inventory management service within an e-commerce application that makes use of four channels. As before, the first channel is created to monitor the production cluster. This time, however, the canary is set to receive traffic from a different, development environment. As a result, the second channel governs requests from the development environment to the canary. In addition, a third backstop channel limits how much load the canary is allowed to generate towards upstream services. Finally, all these policies can be instituted, adjusted or removed without affecting a blanket segmentation between users and inventory services using a fourth channel.

Figure 2: Layered canary deployment with Glasnostic. One channel routes production traffic to the existing production cluster (1) while a second channel routes requests from a development environment to the canary (2). To protect upstream services from the canary, a third, backstop channel limits requests issued by the canary (3). All the while, a separate global policy segmenting order-related services from all inventory services remains in force and unaffected by the canary deployment (4).

This example shows how layered policies give operations teams full control over how changes to the service landscape are introduced and how to not only ensure that new deployments work on their own but also to protect the overall architecture from potential fallout from such changes.

Advantages of Glasnostic’s Operational Approach to Canary Deployments

There are several key advantages to Glasnostic’s operational approach to canary deployments:

  • Containment of canaries. Like any change introduced into a complex system, canaries can negatively impact your architecture. Instead of merely focusing on whether a new deployment works in isolation, Glasnostic allows operators to ringfence them, thus protecting their existing architecture from any negative fallout.
  • Independent, multi-level control. Glasnostic is built around grouping arbitrary classes of traffic into logical channels and controlling them independently of each other. In the context of canary deployments, being able to define channels quickly not only provides a convenient way to establish a baseline cluster to compare a canary to, but also allows operators to further specialize individual traffic classes as needed by applying additional policies or operations. For instance, operators may use a canary deployment’s production cluster channel to backpressure against a sudden influx of bursty traffic or to ensure quality of service for tier one clients, all without affecting the canary pattern.
  • Unified operations. Because Glasnostic provides the same operational controls for canary deployments as for any other operational pattern, operations teams can work with a unified and cohesive toolset without having to contend with siloed solutions. As a result, operations teams are able to rely on a seamless operational workflow and stay in control of their service landscape.

Comparing Support for Canary Deployments

Among the three projects we compared in part one of this series, Kubernetes has the least robust support for canary deployments. While ingress traffic can be subjected to some routing rules, the routing of intra-cluster (“east-west”) traffic is based on round-robin load balancing only and as a result, the share of traffic hitting a canary can be only influenced by adjusting the number of running production instances.

Linkerd 1.x was built on top of Finagle and as such brings significantly more flexibility to canary deployments. In particular, it supports fine-grained routing rules based on weights and HTTP headers. Istio adds support for explicit client rules, thus allowing canary deployments to be based on source differentiation.

However, none of these projects, approach canary deployments from an operational perspective. Round-robin balancing, destination rules, routing based on HTTP headers and client rules are all designed to balance traffic between production and canary clusters, not to protect the surrounding architecture from the deployment. As a result, these projects apply very localized, YAML-based configurations instead of helping operators approximate effective policies by presenting high-level metrics based on golden signals in a UI.

Ultimately, it is this localized application of static configuration that does not lend itself to creating the set of layered policies that an operational approach to canary deployments would require. This is the reason why Glasnostic was designed from the ground up around a UI that allows operations teams to “ view and do “, to detect and remediate, with full support for policy layering.

Figure 3: Support for canary deployments in Kubernetes, Linkerd 1.x, Istio and Glasnostic. While Kubernetes’ support is limited to balancing instance counts, Linkerd 1.x, like Finagle, on which it is based, allows routing rules and HTTP header based routing. Istio expands this feature set by adding the ability to specify client rules. Glasnostic’s extensive support for policy layering allows teams to approach canaries from an operational perspective.

Pros and Cons of Canary Deployments

Canaries are a First Step Towards Deploying to Production

Canary deployments help development and operations teams test each new deployment in production to see how it interacts with the “real world.” This is particularly useful in complex service landscapes with multiple microservice-based applications, where development teams introduce changes independent of each other and according to their own release schedules, or when upstream or third-party services over which the operator has no control are in the mix.

Fundamentally, observing the behavior of a new deployment in production (albeit with fewer users) will always be less risky than the alternative of “let’s just push to prod and see what happens, we can always rollback, right?” As a result, the main advantage of a canary deployment to operators is the ability to incrementally roll out new features and services while minimizing potential problems to not only a subset of users, but also a subset of the operating environment, which includes the network, compute and storage infrastructure.

Canary Deployments Require A Readiness to “Move Fast and Break Things”

However, canary deployments are not without their challenges. A big one is that without significant upfront investments in reusable automation, monitoring, tooling and rollback mechanisms, canary deployments will require a large amount of manual setup work every time the pattern is put in place. On the monitoring side, canary deployments require some observability into KPIs like HTTP success rates to decide whether to promote the canary or to roll it back. Apart from the work of setting up such monitoring, these KPIs have to be monitored manually unless tools such as Weavework’s Flagger or Spinnaker Kayenta are used. Finally, rollbacks can be challenging if incompatibilities between deployment versions and database schema changes are not managed correctly.

Canary deployments are also not a good idea in scenarios where even a small number of end users would not be able to tolerate failures of any kind or if the failures they experience may cause reputational harm. For example, services that could cause bank transfers to fail or services with the potential to fail very visibly are poor candidates for canary deployments if you prefer end users would rather not complain to your support department or on social media.

Summary

In part one of this series, we laid out the basic canary deployment pattern and summarized how some popular open-source projects support it. In this part, we discussed the differences between the developer- and operations-oriented variants of the canary deployment pattern and showed two examples of how canaries can be realized from an operational perspective with Glasnostic. Most importantly, we showed how this operational perspective requires an ability to layer policies. Among the projects discussed in this series, Glasnostic is the only product that supports such policy layering.

Canary deployments are a great first step towards deploying to production but require a not so insignificant investment in automation and tooling. They also require a fundamental readiness to “move fast and break things,” which does not lend itself to critical, transactional or highly visible workloads. Nevertheless, the benefits of being able to move fast where companies can afford to do so outweighs by far the cost of adopting canary deployments on a large scale.

Originally published at https://glasnostic.com on May 7, 2019.

--

--