Deploying in modern days can be challenging, given the complexity of systems built in microservices structure. But thanks to some deployment strategies, you can do that with much more confidence.
In this article, I’ll show you how to use the canary deployment pattern. Despite the heartbreaking stories that made this little bird notorious for being a type of "informant", in software engineering, canary symbolizes substantial risk reduction. Let's find out why!
What are canary deployments?
They are a way to release updates to a shifting percentage of the total user base. What’s alluring about that is that when new code isn’t really ready to reach production, due to undetected bugs during the testing phase, performance issues, or other factors, it’ll only affect a small percentage of the users before being detected and rolled back.
There are for sure many ways to do canary releases, even more so because there are many ways to deploy and run applications. Since at Wildlife we’re heavy Kubernetes users, I’ll show you two approaches to do them on software running over Kubernetes. Let's dig in!
A simplistic approach
A typical request flow
Typically, to make an HTTP/HTTPS API available to external users we set up an ingress controller and define routing rules to send traffic to specific services.
So, the bare minimum we need to confirm with the diagram above is something like the snippets of YAML below.
All resources are named “your-api”, and the ingress will route all requests for “api.example.com” to the “http” port in the service also called “your-api”.
The only thing the deployment needs so that the service can point to the expected pods is the “app” label.
With manifests like the ones above we can have a backend in place with a fixed number of replicas running. One natural extra step is using a Horizontal Pod Autoscaler.
When we approach deployments like the shown flow, we put ourselves in an all or nothing kind of situation. Where we have a mix of versions only during the Kubernetes rollout of new pods, which can be very short.
So, if we find a problem in production after that transitional period, we’ll have to rollback after identifying the issue and having potentially negatively affected 100% of the live user base.
The canary ingress
To prevent exactly this situation regarding code changes, we can deploy applications to an ever-increasing percentage of requests until it covers all of them.
We’ll call the flow above “stable” and create a new one for “canary” releases. They’ll be incredibly similar but point to different resources in the chain. In our case, “stable” will keep naming all resources “your-api”, and “canary” will call them “your-api-canary”.
The differences between this ingress manifest and the one from before are in the annotations. In this case, we’re using the NGINX ingress, letting Kubernetes know that this ingress is canary and that 5% of the requests are expected to go to the service “your-api-canary”.
That’s all that’s needed to route a specific percentage of requests to another Kubernetes service, that can point to another deployment, with a different version of the same application, for example.
You’ll hardly be satisfied with just that. Here are some questions one might ask:
- How to automatically shift percentages until all traffic is reaching a new version?
- How to automatically cancel a canary release if it’s not production-ready?
One way to tackle both issues is by creating new stages in your continuous delivery pipelines. The first thing to do can be defining targets like 5%, 25%, 50%, and 100% and manually playing some or each of them.
When we reach 100% of traffic on canary, this means that the stable flow should switch to the canary version and canary is now at 0%.
There’s also Flagger, a Kubernetes operator that helps on canary deployments.
Instead of creating the manifests we’ve already seen, we could achieve the same result by installing Flagger on the cluster and creating a “canary” manifest.
Here’s the official documentation on how to use Flagger with NGINX. The cool thing about using this operator instead of doing your own, as in the previous section, is that it has several integrations out of the box. You can set, for example:
- Acceptance tests;
- Load tests;
- Watch Prometheus’ metrics to use in deployment strategies;
- Strategy for automatic percentage shift;
- Strategy to rollback.
I hope that at this point you’re aware of how simple it is to avoid all or nothing situations when releasing new code to production — using a community-supported operator or in a “DIY fashion”.