Practical Canary Releases in Kubernetes with Argo Rollouts

Sari Alalem
Sep 8, 2020 · 9 min read
Canary Releases in Soluto

At Soluto, our microservices-based infrastructure, combined with all the CI/CD tools, allow us to move fast with multiple releases a day to deliver features and fixes to our customers.

In this storm of releases, sometimes issues are found in production, after the release. When this happens, we want to protect our customers from being exposed to these issues and at the same time, we want to know about them early on. This is where Argo Rollouts comes into the picture, with its support for canary release strategies.

Disclaimer

There are a lot of guides out there for setting up a canary release process in your K8s cluster, and this post is not a “how to” — it assumes you know how Argo Rollouts works, it’s about choices, issues you might face, and optimizations you will need in order to make your canary release process efficient. You can read the whole thing for better understanding or you can skip to the summary and lessons learned sections for a quick read.

What did we want from canary releases?

Our mission when we decided to implement canary was:

“On each new deployment, keep the old version running and give time for the new version to prove it works in production with minimum exposure to customers”

This means that we have to:

Argo Rollouts

There are solutions out there for doing canary releases like flagger (which is planned to merge with Argo rollouts, yay!) but our choice was Argo because:

Meaning if you don’t have a mesh provider (Istio), Argo Rollouts splits traffic between versions by creating a new replica set that uses the same service object, and the service will still split the traffic evenly across pods (new and old). In other words, controlling the number of pods controls the traffic percentage.

Traffic splitting by pods count

Not ideal, but does the job, and if you do have Istio, you can better control traffic as mentioned here.

The implementation

The canary lifecycle

When thinking about our mission, we start shaping the desired behaviour of our canary process:

Writing a real and practical canary analysis

The combination of Rollout and AnalysisTemplate in Argo Rollouts are enough to give us the flexibility to configure a canary release strategy like the one above, in terms of steps, traffic control and analysis, but what is a good strategy? There’s no one correct answer for this, it’s per use case. Below is the process of how we shaped a good fit for our use case, it may also be a good fit for you or at least an inspiration.

Starting from scratch, we asked ourselves some questions:

What is a “good traffic percentage” to give a canary?

If it’s too low, the canary will not get enough traffic, hence an unreliable analysis, and too much traffic will affect more customers if an issue happens. In other words, a canary should handle between 5% to 10% of your total traffic.

How long should we run the canary analysis?

For the metrics in the statistical analysis to be reliable, we need at least 50 data points for good results, meaning that we need to pause for a duration that allows the monitoring system (Prometheus, DataDog, etc) to collect metrics at least 50 times. The time can vary between different setups, but in our case, it was improved to collect metrics every 15 seconds, meaning the minimum pause time we could have is around 12.5 minutes with active users generating traffic.

How many steps should we include?

There’s no correct answer to this question, but our developers didn’t want to wait a lot to know if something is wrong, and we needed to protect our customers:

First Step (Fast Fail): If there’s an obvious issue in the new release, developers did not want to wait for the whole analysis to finish in order to see it. We wanted something that can tell us “Hey! Your code doesn’t work!” very quickly, so this is the analysis for the first step:

Steps 2 and 3 are identical: 10% traffic with 30 minutes duration

This seemed good, a total of 1.25 hours of canary run time, we didn’t have to wait a lot… But as it turned out, it wasn’t good enough. See why in the “lessons learned section”.

How do I know what’s happening?

When a canary rollout is running, we need to see what’s happening, you can use any of these:

Which metrics to measure?

We have two types of services:

In both of them, we need to measure:

To unify the metrics as much as possible between an API and a worker, we made use of the fact that we are using the sidecar pattern in workers, where a sidecar consumes from the queue and calls the main service using http:

Sidecar pattern and http based metrics

This converted the main service into an http based service, where we can use the same metrics as an API… good for us.

Running everything

We let the developers use it for a few months, and we observed. Canary saved the day more than a few times, but failed to save it in some situations… Yes, the short running duration was the culprit, and some other factors that are explained here.

Lessons learned from using canaries

Canaries need traffic

Oh yes, the more traffic the more operations, and more operations mean a more accurate analysis. In our case, our customers are in a different timezone than the one our releases are scheduled by, meaning that when a canary runs, it doesn’t have the regular traffic size it should have. We learned the hard way that issues are more likely to be discovered if an analysis is run when our customers are more active.

To overcome this, the canary analysis should span the high traffic times, so a canary analysis that spans more than 24 hours is more likely to safely detect more issues.

Provide the ability to skip canary analysis

When there are issues in production that need an urgent quick fix, we can’t afford to wait 24 hours for the fix to be released, so in the CD pipelines, we added the option to skip the canary analysis when releasing. This option was a good addition to use in such cases.

Weighing operations in the analysis

For example, if we want a success rate of at least 95%, this prometheus query calculates it:

sum(increase(http_request_duration_seconds_count{status_code=~”2.*”}[15m])) / sum(increase(http_request_duration_seconds_count[15m]))

As you can see, it sums up all requests from all endpoints, but some endpoints are called more than others, and some less. This doesn’t mean they are less or more important, but in that query, endpoints that are called less don’t have an effect on the final percentage.

This introduces the need to change the query to be the weighted sum of each endpoint. For example, considering two endpoints:

(
sum(
increase(
http_request_duration_seconds_count{
status_code=~”2.*”,
path=”/api/v1/getSomething/”
}[15m]
)
)
/
sum(
increase(
http_request_duration_seconds_count{
path=”/api/v1/getSomething/”
}[15m]
)
)
) * 0.2
+
(sum(
increase(
http_request_duration_seconds_count{
status_code=~”2.*”,
path=”/api/v1/getAnotherThing/”
}[15m]
)
)
/
sum(
increase(
http_request_duration_seconds_count{
path=”/api/v1/getAnotherThing/”
}[15m]
)
)
) * 0.8

Since getAnotherThing gets a lot less traffic than getSomething, we increased its impact on the final result. This is good in terms of numbers, but the resulting query can be a nightmare to maintain, so be careful when considering when you really need to do this.

To summarise things

After running a canary release process in our kubernetes clusters using Argo Rollouts for a few months, and after observing and collecting feedback, we reached the following optimisations for a more practical and efficient canary release process:

It’s recommended to follow this formula to get the most out of your canary release process, and of course, you should tweak it and modify it to better fit your use cases, since in canary, there’s no “one ring to rule them all”.

Soluto by asurion

Engineering. Product. UX. Culture.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store