Blue-Green or Canary? Why not both?

Our way of Continuous Deployment

Emre Tanriverdi
Trendyol Tech
5 min readMay 18, 2022

--

Canary icons created by Freepik — Flaticon

As Homepage & Recommendation Team, we lately have been working on improving our deployment experience in Trendyol.

We wanted to share our story and tell you why and how we did it.

Start of the journey

We were actively using Argo as a GitOps solution, as time passed we also wanted to use its Continuous Deployment features.

We were already ready to achieve Continuous Deployment mindset as a team since we do trunk-based development with a well-defined pipeline and we had automation tests running before production.

But since we are a high-scale organization with lots of users, we always aim for zero-downtime and wanted these deployments to be as secure as possible while reducing the amount of manual human interaction.

What choices did we have?

  1. Blue-Green Deployment
  2. Canary Deployment

Let’s talk about what they are and what concerns our team has about them.

Blue-Green Deployment

“Blue-Green Deployment (sometimes referred to as Red-Black) has both the new and old version of the application deployed at the same time. During this time, only the old version of the application will receive production traffic.

This allows the developers to run tests against the new version before switching the live traffic to the new version.”

- Argo Project

To simplify:

Blue Deployment: Current deployment
Green Deployment: Idle deployment

Smoke tests run in Green Deployment, no production traffic is given.

→ If tests pass, Green replaces Blue.
→ If tests fail, Blue stays as-is.

Blue-Green Deployment visualized by Argo Project

Our concern: No gradually increasing production traffic to check metrics.
When smoke tests are done, it’s done.

This was a problem for us because we thought that if there ever was a situation that we couldn’t catch with the smoke tests, we wouldn’t be able to see it before real users do.

Canary Deployment

“A Canary Deployment exposes a subset of users to the new version of the application while serving the rest of the traffic to the old version. Once the new version is verified to be correct, the new version can gradually replace the old version.

Ingress controllers and service meshes such as Nginx and Istio, enable more sophisticated traffic shaping patterns for canarying than what is natively available (e.g. achieving very fine-grained traffic splitting,
or splitting based on HTTP headers).”

- Argo Project

To simplify:

Current deployment: 100% production traffic
Idle deployment: 0% production traffic

Idle deployment’s traffic increases like 10–20–30–60–75–90–100.

→ Checks custom metrics such as error rate while doing this operation.
Based on metrics, if an anomaly is detected (let’s say if the error rate is higher than a certain threshold), it rollbacks to the previous deployment.

Canary Deployment visualized by Argo Project

Our concern: Even though it’s a small percentage of the user base,
no running tests before the new code is revealed to real users.

No stopping point for issues that we could’ve caught with smoke tests.

What did we want?

  1. Smoke tests running on production before real users get affected by the new deployment. (Blue-Green Deployment)
  2. Increase the new deployment’s weight percentage gradually while checking error rate & response time metrics for a possible anomaly.
    (Canary Deployment)
  3. If a certain percentage of real-life requests encounter an anomaly, automatically rollback to the previously working deployment.
    (Canary Deployment)
  4. If no issue occurs, safely finish the deployment process.
    (Canary Deployment)

Basically, a zero-human-interaction process after the branch is merged.

We saw that we could simply achieve this by adding smoke tests step as another custom metric to Canary Deployment, similar to error rate or response time metrics.

So what we want is exactly this: Canary Deployment with smoke tests.

But for the smoke tests, there should be no real production traffic.

Our way

Rollout & virtual service definitions

We are using Istio for traffic management in Trendyol as Gökhan Karadaş mentioned here.

As you can see in our virtual service definition, we are using Istio to adjust our application traffic between canary and stable services the way we want.

The flow above is a usual Canary Deployment flow with only the addition of the smoke tests step, which will help us to achieve our goal of combining the testability of Blue-Green Deployment and the anomaly detecting abilities of Canary Deployment strategies.

For this purpose we used Dynamic Canary Scale provided by Argo to make sure we can create 1 pod without giving any real-life traffic on it, run our smoke tests on that pod, then after it ends with success (therefore achieving the Blue-Green behavior we wanted), we use the matchTrafficWeight flag to continue with the default Canary behavior, as seen between lines 16–22.

Stable and canary deployments’ service definitions

Smoke tests

We kept our smoke tests as simple as possible to not extend our deployment time and also not waste time maintaining it with every feature (since we already use automation tests for this purpose).

Our main goal for smoke tests is to simply send basic requests to ensure that the API will respond correctly, therefore making it a stopping point for incidents.

Smoke tests as analysis template

We have a separate project where we write our smoke tests and define our analysis template as above, where we can run the mentioned smoke tests.

The cluster can reach lorem-ipsum-api-canary via service discovery so we are always making sure that we are sending the requests to the new code.

The reason me and Abdulkadir Karakoç wrote this story is to lend those, who want to do a similar thing in their projects, a helping hand.

We hope it was helpful. :)

Special thanks to Hüseyin Celal Öner and Serhat Yılmaz for all their support in the process.

Thank you for reading! ❤️

Thanks to our colleagues in the Homepage & Recommendation Team. 🤟

--

--