Moving Canary deployments on AWS using ELB to kubernetes using Traefik

Tasdik Rahman
5 min readOct 25, 2018

--

25 Oct 2018

Canary deployment pattern is very similar to Blue green deployments, where you are deploying a certain version of your application to a subset of your application servers. If everything is alright and you have tested out that everything is working fine, you route a certain percentage of your users to those application servers and gradually keep increasing the traffic till a full rollout is achieved.

One of the many reasons to do this can be to test a certain feature out with a percentage of users who use your service. This can be further extended to enabling a service to users of a particular demographic.

Canary deployments on AWS

Canary in our use case @ Razorpay, was used by one of our API’s which we served and gave out for consumption and the earlier method for canary deployments there before we were on kubernetes was to have two separate Autoscaling Groups, one for the primary ASG serving the particular API and another ASG(with a smaller desired count for the ASG), let’s call it canary ASG for now.

Now both were having their own individual ELB’s attached to them, with them being in our public subnet. Both the ELB’s would have a CNAME DNS record pointing to their public FQDN(s) given out by AWS.

For simplicity of drawing the ASG groups, I have not shown the ASG groups for both the canary and the main service in 2 separate AZ’s, but it is the recommended way to go forward. As in case of an AZ failure, you have the other set of ASG instances to be routed by the ELB(with cross zone load balancing enabled)

The canary ASG would be attached to both the

  • main ELB for the service
  • canary’s separate ELB

The capacity(min: desired) for the main service is more than the capacity for the ASG for canary, and the canary ASG capacity’s max is set to it’s desired. The reasoning for this is that, any regression wouldn’t propagate to a larger number of users if autoscaling kicks in.

Since our ELB is an Internet-facing load balancer, it gets public IP addresses(one for each AZ). The DNS name of an Internet-facing load balancer is publicly resolvable to the public IP addresses of the nodes of the ELB. Therefore, Internet-facing load balancers can route requests from clients over the Internet.

The load balancer node that receives the request selects an attached instance using the round robin routing algorithm for TCP listeners and the least outstanding requests routing algorithm for HTTP and HTTPS listeners.

So for example let’s say the main ASG has 9 instances in it (desired: 9) and the canary has 1 instance attached to it. 1/10th of the traffic would be routed to the canary instance as well as the other 9 instances.

Hence, the canary instances would also get traffic in a round robin fashion.

Replicating the same in kubernetes

traefik runs as our L7 load balancer, or as our ingress controller inside kubernetes to route traffic to our kubernetes services for various micro-services running inside our cluster.

traefik would be running on hostNetwork: true as DaemonSet

These pods will use the host network directly and not the “pod network” (the term “pod network” is a little bit misleading as there is no such thing — it basically just comes down to routing network packets and namespaces). So we can bind the Traefik ports on the host interface on port 80. That also means of course that no further pods of a DaemonSet can use this port and of course also no other services on the worker nodes. But that’s what we want here as Traefik is basically our “external” load balancer for our “internal” services - our tunnel to the rest of the internet so to say.

Sample configuration which you can use to deploy traefik

The diagram above shows two ASG’s for edge nodes, which will host the traefik daemonset(s).

There would be a CNAME DNS record for myapp.example.com which points to the public FQDN for the common ELB to which both the edge ASG’s are attached to. Traffic would be routed to the edge VM’s attached based on a round robin fashion here. Before that, the security groups attached to the ASG’s can also be configured to only allow TCP connections on port 80(others would be blocked automatically as it’s default deny).

Similarly a DNS record for canary would be there.

Traefik would be listening on port 80 on the host’s network for incoming requests, and there would be an ingress object in the namespace of the app which would define which service to route the traffic based on the hostname.

We can have an ingress object like the following in the namespace myapp for the services

  • myapp
  • myapp-canary

This way, traefik would route the traffic coming to myapp.example.com to the services

  • myapp : 90% of the traffic would be routed here.
  • myapp-canary: 10% of the traffic would be routed here.

You can have multiple services to which you can specify the weights and I would only be repeating myself to what has been written here https://docs.traefik.io/user-guide/kubernetes/#traffic-splitting

Another thing to note here is that, the service to which you are trying to do a weighted routing for your canary, should be in the same namespace as the other service(which is myapp in this case). This was asked here in their issue tracker and they pointed out the same https://github.com/containous/traefik/issues/4043.

So you have seen how we can do canary deployments on AWS using traditional ELB’s and ASG’s as well as if you are on kubernetes with an ingress controller.

References

The Network diagrams were made using draw.io

Originally published at tasdikrahman.me on October 25, 2018.

--

--

Tasdik Rahman

Systems team @gojek, Former SRE @ Razorpay(YC W15). Contributor to @ovirt. Past intern @ciscosecurity @wingify. https://tasdikrahman.me/. Opinions my own.