As a Platform Engineer, I am in close contact with our Product Teams to improve and evolve our Kubernetes (K8s) environment. When reviewing the incoming requests of last month, the demand for Service Mesh is one of — if not — the most asked technologies. Digging deeper into this demand, the main features people are looking for is the ability to easily introduce retry logic, circuit breakers, and last but not least canary releases. All three concepts aim to improve the quality of software provided to internal and external customers. While the first two, retries and circuit breaker, help to avoid unwanted error messages to be shown on the screen by mitigating issues like networking fluctuations, the ladder rather focuses on releasing new software versions seamlessly and that way: well tested.
What are canary releases?
Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users, before rolling it out to the entire infrastructure and making it available to everybody.
This approach allows to test new versions under real life conditions by promoting them to production. It is a fair statement to say that even extensively tested software which runs smoothly in earlier environments can have unexpected behaviour in others. In such cases, a rollback or isolation of the malfunctioning resources can be achieved easily and in a timely manner. Furthermore, it enables DevOps teams to run automated verification and detection of untypical patterns and metrics.
Do I need Service Mesh / Istio to do canary deployments on Kubernetes?
Surprisingly the answer is no, at least not necessarily. Service Mesh adds many features through its abstraction and hence complexity which is good in selected cases. Using Service Mesh technologies solely to solve canary release requirements does not seem right. How to achieve that with a vanilla Kubernetes installation? You could create several deployments with the same label but different versions of your application and thus build kind of a canary deployment. For example, having four replicas of the production release and one replica with the new release running within your namespace would roughly result in 20% of the traffic hitting the canary deployment. This obviously doesn’t work well and results in a heavy waste of resources if you want to start with 1% only.
But wait, while searching for a bug fix in the change logs of NGINX Ingress Controller I stumbled across “#3341 Add canary annotation and alternative backends for traffic shaping” which ultimately brings exactly that functionality and even more. It just takes a couple of annotations in your ingress resource and you are ready to go.
To be more precise, there are two different kind of canary releases.
- A weight-based canary release that routes a certain percentage of the traffic to the new release
- Let’s call it — user-based routing where a certain Request Header or value in the Cookies decides which version is being addressed
Getting Started with canary rollouts on K8s
In the following section, I will explain the test case created to try the new canary feature of NGINX Ingress Controller. It is a simulation of a very basic weight based canary release (Option 1). I also created a git repository that contains all resources that you might require to reproduce the case. You will find further instructions and a list of pre-requisites there as well.
The app used for the scenario is a simple go http server with three handlers.
- /version returning the the version of the app that actually processed the request to differentiate between both releases, production and canary.
- /metrics to show the amount of calls that have been processed by the container on path /version.
- /reset, as the name suggests, resets the request counter to zero.
1. Create the status quo
Everything starts with a stable version running in production. The example follows the semantic versioning approach with current stable version 1.0.0 running in the namespace “demo-prod”. As there is no canary release deployed to the cluster, X equals “0” resulting in 100% of the traffic being served by the production release. This can be simulated with the following ingress manifest:
First of all, deploy the namespace “demo-prod” as it is required for the rest of the resources. Continue with creating the deployment, service, and ingress for the production environment. At this point, there is nothing special about the ingress resource.
- host: canary.example.com
With apache benchmark you can easily sent a predefined amount of requests to prove that all requests are being served by the current release.
$ ab -n 1000 -c 100 -s 60 -m GET http://canary.example.com/version
If everything ran smoothly, the /metrics endpoint should show the same amount of calls that have been sent to the endpoint. You can use jq to process the output of your curl command:
$ curl -s "http://canary.example.com/metrics | jq '.calls'
As expected, the request count incremented to a total value of 1000.
2. Rollout the canary release
Now it is time to do the actual canary deployment. Therefore a second namespace called “demo-canary” is mandatory. Why is that? Eventually, we will create a second ingress resource with the exact same name but including the canary annotations. If we deployed it to one and the same namespace it would change the existing resource which is not desired. Once the namespace has been created, we can push the deployment with the new software version 1.0.1, service, and ingress to the cluster. In the below sample ingress we define X=”20" and thus, route 80% of the workload to the production release which is considered to be stable and the remaining 20% to our freshly deployed canary release.
Therefore, we have to add two annotations. The first one, nginx.ingress.kubernetes.io/canary: “true”, enables the canary functionality for the ingress. Secondly, we define the share that we want to be served by the canary deployment by adding nginx.ingress.kubernetes.io/canary-weight: “20”.
- host: canary.example.com
Let’s see if it works. Use apache benchmark like describe above to generate load against the /version endpoint. One method to verify if NGINX does the split based on the configurations is to do a port-forward and curl the /metrics endpoint of both pods, canary and production.
$ kubectl -n demo-prod port-forward demo-prod-6cc6dfd7c6-ttvkm 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080$ curl localhost:8080/metrics -s | jq ‘.calls’
$ kubectl -n demo-canary port-forward demo-canary-657998b785-xzvb6 8081:8080
Forwarding from 127.0.0.1:8081 -> 8080
Forwarding from [::1]:8081 -> 8080$ curl localhost:8081/metrics -s | jq ‘.calls’
Looking at these figures, the weight split involves a slight deviation of roughly 1% compared to the initial 80/20 split. For me, tiny enough to call it a success!
3. Take action on the results
Of course, the final step is to analyse the application metrics, usage pattern, and feedback form the test users to decide whether to increase the canary weight, switch completely to the new software release, or even remove the canary deployment.
If you were looking for a way to do canary release in your Kubernetes cluster, I would definitely recommend you to play with this feature of NGINX Ingress Controller. However, if it comes to more complex scenarios where teams want to do multiple canary rollouts for one application at the same time, NGINX IC will probably show its limits rather sooner then later.
All that remains to outline is a summary of
…the problems that I ran into during my exercise:
- Even though the documentation says that “canary-weight” expects a Number it only works if you wrap in quotes
- It is important that the canary ingress has the exact same name as the production ingress to ensure that the controller puts the configuration in the right section of nginx configuration. That means that you will need a second namespace as it is not possible to create a second resource with the same name in a single namespace
- The canary-weight annotation might result in a slight deviation from the percentage that you put as value.
… known Limitations mentioned in the docs:
- Only one canary per ingress is supported by the NGINX IC
- Non-canary annotation might be ignored
In this article, I did not go into the details of the user based canary option. Just leave a message if you want me to write another post about that.