How Canary Deployments Work, Part 1: Kubernetes, Istio and Linkerd

Published in

Glasnostic

5 min readApr 30, 2019

This is the first of a two-part series on canary deployments. In this post, we cover the developer pattern and how it is supported in Kubernetes, Linkerd and Istio. In part two, we’ll explore the operational pattern, how it is supported in Glasnostic, a comparison of the various implementations and finally the pros and cons of canary deployments.

A canary deployment (or canary release) is a microservices pattern that should be part of every continuous delivery strategy. This pattern helps organizations deploy new releases to production gradually, to a subset of users at first, before making the changes available to all users. In the unfortunate event that things go sideways in the push to prod, canary deployments help minimize the resulting downtime, contain the negative effects to a small number of users, and make it easier to initiate a rollback if necessary. In a nutshell, think of a canary deployment as a phased or incremental rollout.

What are Canary Deployments?

The canary deployment pattern takes its name from the now-defunct practice of coal miners bringing canary birds into the mines with them to alert them when toxic gases reached dangerous levels. As you might imagine, as long as the canary sang, the air was safe to breathe. If the canary died, it meant it was time to evacuate! What does this have to do with software development? Think of the canary as a small set of end users who are exposed to new services or new capabilities before the majority of users are. The advantage of this type of rollout is that If the deployment breaks in unacceptable ways, the release can be rolled back and the adverse effects can be contained to just a small set of users.

***Figure 1***: Diagram of a typical canary deployment. Initially, client traffic to a service is routed to the existing production cluster (blue). To test a new version of the service, a canary cluster is deployed and the governing gateway or load balancer is instructed to divert a small amount of traffic to it (green). Often, this traffic is simply a small percentage of all requests for the service in question. At times, though, operators may prefer to only route a specific segment of traffic to the canary cluster, such as requests from a particular set of users or requests from a specific geography. If the canary cluster behaves as intended, the deployment is rolled out in full and the old production cluster is removed. Sometimes, a separate cluster serves as a baseline in canary analysis (hatched blue).

What about Canary Analysis?

A refinement of the canary pattern called canary analysis involves an additional baseline cluster running services of the current production version alongside the canary cluster and with equal amounts of traffic. This eliminates any peculiarities of the production cluster that are due to its long-running nature.

Canary Deployment Pattern Implementations

Kubernetes

Support for canary deployments in Kubernetes is relatively limited. The approach typically taken involves deploying canary instances in the desired proportion alongside production instances and then configuring the load balancer to distribute load across all instances as evenly as possible. Deploying a canary is somewhat easier if the governing load balancer is an ingress controller. Because ingress rules can be based on a request’s host or path, or a combination of both, this offers more criteria for how traffic can be split.

However, in the majority of cases, the only way to adjust the relative traffic volumes between canaries and production versions is to tinker with instance scaling. (See e.g. this post for a complete example of how this can be done.) In other words, if 10% of traffic should be routed to the canary, it will have to be deployed alongside nine instances of its production version. To make matters worse, this linear relationship only holds true for evenly distributed load balancing strategies such as round-robin balancing. Dynamic strategies such as least-connection balancing make specific ratios difficult to maintain.

On the whole, Kubernetes is not particularly well suited to canary deployments. It does not support canary routing based on request source criteria such as geography or demographics. Kubernetes also tends to waste resources when specific canary routing ratios are required.

Because Linkerd is based on Twitter’s library, Buoyant’s original , now commonly referred to as Linkerd 1.x, provides extensive support for more generic, dynamic request routing, which operators can use to implement what Buoyant calls “traffic shifting.” Dynamic request routing is based on sets of routing rules called “delegation tables” ( dtabs for short) that are stored globally in namerd and can be changed at runtime without restarting linkerd proxies. As a service mesh, Linkerd 1.x can apply routing rules to any traffic, north-south or east-west, not just ingress traffic.

Linkerd 1.x support for routing is extensive. When Linkerd 1.x initially accepts a request, it is assigned a logical “destination path.” For instance, a request to http://users-service/lookup might be assigned /svc/users as destination path. This path may then undergo a series of rule-based transformations. For example, a dtab rule

/svc => /env/prod

would rewrite the previous destination path /svc/users to /env/prod/users.

Routing rules can be quite expressive. To implement a canary pattern, for instance, operators could specify a rule like

/svc/users => 99 * /env/prod/users & 1 * /env/prod/users-v2

to divert 1% of traffic to a users-v2 canary. In addition, routing rules may be overridden at a per-request basis via the Linkerd-specific l5d-dtab HTTP header. This allows canaries to be tested by explicitly requesting them.

The more recent Linkerd 2.x is a rewrite of Linkerd in Go and Rust and thus does not include the rich Finagle-based routing capabilities. As a result, canary deployments are not supported out of the box, leaving Linkerd 2.x users to rely on Kubernetes’ limited support for routing. The feature request for routing support in Linkerd 2.x is being tracked here.

Istio

As a service-mesh, Istio supports routing rules to be applied to all services in the mesh, not just to ingress traffic. Similar to Linkerd 1.x, these routing rules allow for a fair amount of control over how traffic is directed. Unlike Kubernetes, canary deployments in Istio can be implemented without requiring a specific number of instances.

Canary deployments in Istio are configured in two steps. First, a destination rule is created to define subsets of the target service based on version labels ( figure 2). A virtual service rule is then used to specify relative weights between these subsets ( figure 3).

Figure 2: Istio destination rule defining a “v1” and a “v2” subset of a users.prod.svc.cluster.local service based on the version label of the service’s instances.

Figure 3: Istio virtual service rule specifying that 95% of traffic to users.prod.svc.cluster.local should be routed to its “v1” subset and 5% to its “v2” subset.

Once these rules are applied (via kubectl apply), the canary deployment takes immediate effect.

The example given above merely scratches the surface of what Istio’s routing rules can do. For instance, the virtual service definition could include a regular expression match against a user’s cookie to implement source routing rules, among others.

Read part two of this post here.

Originally published at https://glasnostic.com on April 30, 2019.