Practical Canary Releases in Kubernetes with Argo Rollouts

Published in

Soluto by asurion

9 min readSep 8, 2020

At Soluto, our microservices-based infrastructure, combined with all the CI/CD tools, allow us to move fast with multiple releases a day to deliver features and fixes to our customers.

In this storm of releases, sometimes issues are found in production, after the release. When this happens, we want to protect our customers from being exposed to these issues and at the same time, we want to know about them early on. This is where Argo Rollouts comes into the picture, with its support for canary release strategies.

Disclaimer

There are a lot of guides out there for setting up a canary release process in your K8s cluster, and this post is not a “how to” — it assumes you know how Argo Rollouts works, it’s about choices, issues you might face, and optimizations you will need in order to make your canary release process efficient. You can read the whole thing for better understanding or you can skip to the summary and lessons learned sections for a quick read.

What did we want from canary releases?

Our mission when we decided to implement canary was:

“On each new deployment, keep the old version running and give time for the new version to prove it works in production with minimum exposure to customers”

This means that we have to:

Run two versions of the service at the same time in the production K8s cluster
Split production traffic between the two versions in a controlled manner
Automatically analyse how well the new version is performing
Automatically replace versions or rollback based on the result of the analysis

Argo Rollouts

There are solutions out there for doing canary releases like flagger (which is planned to merge with Argo rollouts, yay!) but our choice was Argo because:

Traffic splitting support without using a mesh provider for internal traffic that doesn’t go through Ingress
We had Argo CD and it easily integrates with it (more on that later)

Meaning if you don’t have a mesh provider (Istio), Argo Rollouts splits traffic between versions by creating a new replica set that uses the same service object, and the service will still split the traffic evenly across pods (new and old). In other words, controlling the number of pods controls the traffic percentage.

Not ideal, but does the job, and if you do have Istio, you can better control traffic as mentioned here.

The implementation

The canary lifecycle

When thinking about our mission, we start shaping the desired behaviour of our canary process:

We have version N already deployed on production
Version N+1 was deployed as a canary rollout
Now both versions of N and N+1 exist and N+1 is assigned a percentage of the traffic
Wait for a while
Measure how well N+1 is performing
If performing well, increase traffic and go to step 4
If not performing well, delete N+1 and move all traffic back to N
Repeat steps 4–7 as needed until all analysis completes successfully, then switch all traffic to N+1 and delete N

Writing a real and practical canary analysis

The combination of Rollout and AnalysisTemplate in Argo Rollouts are enough to give us the flexibility to configure a canary release strategy like the one above, in terms of steps, traffic control and analysis, but what is a good strategy? There’s no one correct answer for this, it’s per use case. Below is the process of how we shaped a good fit for our use case, it may also be a good fit for you or at least an inspiration.

Starting from scratch, we asked ourselves some questions:

What is a “good traffic percentage” to give a canary?

If it’s too low, the canary will not get enough traffic, hence an unreliable analysis, and too much traffic will affect more customers if an issue happens. In other words, a canary should handle between 5% to 10% of your total traffic.

How long should we run the canary analysis?

For the metrics in the statistical analysis to be reliable, we need at least 50 data points for good results, meaning that we need to pause for a duration that allows the monitoring system (Prometheus, DataDog, etc) to collect metrics at least 50 times. The time can vary between different setups, but in our case, it was improved to collect metrics every 15 seconds, meaning the minimum pause time we could have is around 12.5 minutes with active users generating traffic.

How many steps should we include?

There’s no correct answer to this question, but our developers didn’t want to wait a lot to know if something is wrong, and we needed to protect our customers:

First Step (Fast Fail): If there’s an obvious issue in the new release, developers did not want to wait for the whole analysis to finish in order to see it. We wanted something that can tell us “Hey! Your code doesn’t work!” very quickly, so this is the analysis for the first step:

5% traffic
13 minutes pause (again 50 data points)
Measure metrics after and either kill canary or move to next step

Steps 2 and 3 are identical: 10% traffic with 30 minutes duration

This seemed good, a total of 1.25 hours of canary run time, we didn’t have to wait a lot… But as it turned out, it wasn’t good enough. See why in the “lessons learned section”.

How do I know what’s happening?

When a canary rollout is running, we need to see what’s happening, you can use any of these:

Argo Rollouts kubectl plugin — provides a nicely formatted output for the status of a canary
Argo CD — if you use Argo CD, you already know it has a nice UI for showing you the status of the application inside your cluster, and when you combine it with rollouts, you can see what’s happening in real time with a beautiful UI.
Metrics — yes, Argo Rollouts exposes metrics that you can build dashboards with so you can show the status of all your rollouts like this:

Which metrics to measure?

We have two types of services:

API (http) based services: exposes APIs called using http protocol
Worker services: they consume from a queue system, like pub/sub kafka, etc

In both of them, we need to measure:

Success rate: the percentage of successful operations should be more than 95%
In APIs, it’s the percentage of 2xx responses to total responses
In a worker, it’s more complicated, there are no http calls, so how can we measure success?
Latency: how long it took to complete an operation
In APIs, it’s the total time from the start of the request until the response is returned
In a worker, it’s the total time from when the message is consumed until it’s processed and acknowledged.

To unify the metrics as much as possible between an API and a worker, we made use of the fact that we are using the sidecar pattern in workers, where a sidecar consumes from the queue and calls the main service using http:

This converted the main service into an http based service, where we can use the same metrics as an API… good for us.

Running everything

We let the developers use it for a few months, and we observed. Canary saved the day more than a few times, but failed to save it in some situations… Yes, the short running duration was the culprit, and some other factors that are explained here.

Lessons learned from using canaries

Canaries need traffic

Oh yes, the more traffic the more operations, and more operations mean a more accurate analysis. In our case, our customers are in a different timezone than the one our releases are scheduled by, meaning that when a canary runs, it doesn’t have the regular traffic size it should have. We learned the hard way that issues are more likely to be discovered if an analysis is run when our customers are more active.

To overcome this, the canary analysis should span the high traffic times, so a canary analysis that spans more than 24 hours is more likely to safely detect more issues.

Provide the ability to skip canary analysis

When there are issues in production that need an urgent quick fix, we can’t afford to wait 24 hours for the fix to be released, so in the CD pipelines, we added the option to skip the canary analysis when releasing. This option was a good addition to use in such cases.

Weighing operations in the analysis

For example, if we want a success rate of at least 95%, this prometheus query calculates it:

sum(increase(http_request_duration_seconds_count{status_code=~”2.*”}[15m])) / sum(increase(http_request_duration_seconds_count[15m]))

As you can see, it sums up all requests from all endpoints, but some endpoints are called more than others, and some less. This doesn’t mean they are less or more important, but in that query, endpoints that are called less don’t have an effect on the final percentage.

This introduces the need to change the query to be the weighted sum of each endpoint. For example, considering two endpoints:

(
sum(
   increase(
     http_request_duration_seconds_count{
       status_code=~”2.*”, 
       path=”/api/v1/getSomething/”
     }[15m]
   )
  ) 
/ 
sum(
  increase(
    http_request_duration_seconds_count{
      path=”/api/v1/getSomething/”
    }[15m]
   )
 )
) * 0.2
+
(sum(
   increase(
     http_request_duration_seconds_count{
       status_code=~”2.*”, 
       path=”/api/v1/getAnotherThing/”
     }[15m]
   )
 ) 
/
sum(
  increase(
    http_request_duration_seconds_count{
       path=”/api/v1/getAnotherThing/”
    }[15m]
   )
 )
) * 0.8

Since getAnotherThing gets a lot less traffic than getSomething, we increased its impact on the final result. This is good in terms of numbers, but the resulting query can be a nightmare to maintain, so be careful when considering when you really need to do this.

To summarise things

After running a canary release process in our kubernetes clusters using Argo Rollouts for a few months, and after observing and collecting feedback, we reached the following optimisations for a more practical and efficient canary release process:

Load routed to your canary should be no less than 5% and no more than 10% of the traffic.
Using a sidecar pattern helps unify the metrics used in your services.
Make the canary analysis runs for longer periods that span high traffic periods for better analysis, this means more than 24 hours.
To reduce the developers’ frustration, the first step of the analysis can be a “Fail Fast” step, where the duration of the analysis is the minimum time needed for the monitoring system to collect 50 data points from the metrics your service exposes (usually 50 minutes or if it collects every 15 seconds, then it’s 13 minutes).
The more traffic the better. You should try and release during your peak usage hours, if that’s not possible, either simulate more traffic when you do a release or make it run for longer periods as mention in bullet #3.
If you can, make your analysis — Prometheus query, Datadog query, or others — a sum of the weight of each API/operation instead of the total average, since some endpoints are not called that often but they are as important as the others.
Make your developers aware of the status of their release by creating dashboards in Grafana that show the status of their canary release and/or combining it with notifications.
Finally, provide the option for your developers to skip canary analysis on new releases. This is particularly useful for urgent fixes to production bugs that cannot afford to wait days for the release to be rolled out.

It’s recommended to follow this formula to get the most out of your canary release process, and of course, you should tweak it and modify it to better fit your use cases, since in canary, there’s no “one ring to rule them all”.