EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Flagger - Canary Deployments on Kubernetes

Flagger enables automated deployments — in this part I’ll run through some deployments using Flagger

Published in

Expedia Group Technology

11 min readOct 8, 2020

This article is the second one of the series dedicated to Flagger. In a nutshell, Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. It reduces the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring metrics and running conformance tests.

Make sure you have a local Kubernetes cluster running with the service mesh Istio. If you don’t, read the first article: Flagger - Get Started with Istio and Kubernetes.

In this second guide, we will focus on the installation of Flagger and run multiple canary deployments of the application Mirror HTTP Server (MHS). Remember that this dummy application can simulate valid and invalid responses based on the request. This is exactly what we need to test the capabilities of Flagger. We will cover both happy (rollout) and unhappy (rollback) scenarios.

This is a hands-on guide and can be followed step by step on MacOS. It will require some adjustments if you are using a Windows or Linux PC. It is important to note that this article will not go into details and only grasp the concepts & technologies so if you are not familiar with Docker, Kubernetes, Helm or Istio, I strongly advise you to check some documentation yourself before continuing reading.

Traffic lights on Hampton Beach — Photo by Shane on Unsplash

Installing Flagger

Let’s install Flagger by running these commands.

kubectl create ns flagger-system

We install Flagger in its own namespace flagger-system.

helm repo add flagger https://flagger.app
 
kubectl apply -f https://raw.githubusercontent.com/weaveworks/flagger/master/artifacts/flagger/crd.yamlhelm upgrade -i flagger flagger/flagger \
--namespace=flagger-system \
--set crd.create=false \
--set meshProvider=istio \
--set metricsServer=http://prometheus.istio-system:9090

Reference: Flagger Install on Kubernetes
Flagger depends on Istio telemetry and Prometheus (in that case, we assume Istio is installed in the istio-system namespace).
All parameters are available on the Flagger readme file on GitHub.
We don't specify a version for Flagger, which means it will use the latest available in the repo (1.2.0 at the time of writing).

After a few seconds, you should get a message confirming that Flagger has been installed. From the Kube dashboard, verify that a new namespace has been created flagger-system and the Flagger pod is running.

Using the kube gui, select the flagger system namespace and see that a flagger pod is present — Flagger is deployed in your cluster

Experiment 0 - Initialize Flagger with MHS v1.1.1

Mirror HTTP Server has multiple versions available. To play with Flagger canary deployment feature, we will switch between version 1.1.1 , 1.1.2 and 1.1.3 of MHS (the latest version at the time of writing).

Before deploying MHS, let’s create a new namespace application, we don't want to use the default one at the root of the cluster (this is good practice). The name is too generic, but sufficient for this tutorial, in general you will use the name of the team or the name of a group of features.

kubectl create ns application

Do not forget to activate Istio on this new namespace:

kubectl label namespace application istio-injection=enabled

To deploy MHS via Flagger, I created a Helm chart.

This “canary flavored” chart was created based on the previous chart without Flagger which itself was created with the helm create mhs-chart command, then adapted. In this "canary flavored" chart, I did some extra adaptation to use 2 replicas instead of 1 to make it more realistic and use a fixed version to 1.1.1, I also added the canary resource where the magic happens.

Clone the chart repo:

git clone https://github.com/ExpediaGroup/mhs-canary-chart.git

And install MHS:

cd mhs-canary-chart
helm install mhs --namespace application ./mhs

After a few moments, if you look at the dashboard, you should see 2 replicas of MHS in the namespace application.

The kube giu application namespace shows MHS 1.1.1 is deployed in your cluster — MHS 1.1.1 is deployed in your cluster

It is important to note that no canary analysis has been performed and the version has been automatically promoted. It was not a “real” canary release.
Why? Because Flagger needs to initialize itself the first time we do a canary deployment of the application. So make sure the version you are deploying with Flagger the first time is fully tested and works well!
You could also guess this auto-promotion happened because there was no initial version of the application in the cluster. Although this is obviously a good reason, it’s important to note that, even if we had a previous version before (e.g. 1.1.0), the canary version 1.1.1 would have still been automatically promoted without analysis.

You can still check the canary events with:

kubectl -n application describe canary/mhs

You should have a similar output without a canary analysis:

Events:
Type     Reason  Age                  From     Message
----     ------  ----                 ----     -------
Warning  Synced  2m29s                flagger  mhs-primary.application not ready: waiting for rollout to finish: observed deployment generation less then desired generation
Normal   Synced  92s (x2 over 2m30s)  flagger  all the metrics providers are available!
Normal   Synced  92s                  flagger  Initialization done! mhs.application

Or you can also directly check the log from Flagger:

export FLAGGER_POD_NAME=$(kubectl get pods --namespace flagger-system -l "app.kubernetes.io/name=flagger,app.kubernetes.io/instance=flagger" -o jsonpath="{.items[0].metadata.name}")
 
kubectl -n flagger-system logs $FLAGGER_POD_NAME

If you take a closer look at the Kube dashboard, you should see some mhs and mhs-primary resources:

mhs-primary are the primary instances (= the non-canary ones). Flagger automatically add the -primary suffix to differentiate them from the canary instances.
mhs are the canary instances. They exist only during the canary deployment and will disappear once the canary deployment ends. That's why, in the screenshot above, you don't see any mhs canary pods (i.e. 0 / 0 pod).

Why this naming convention? I asked Flagger team directly and there is a technical constraint.

Flagger is now initialized properly and MHS is deployed to your cluster. You can use the terminal to confirm MHS is accessible (thanks to the Istio Gateway):

curl -I -H Host:mhs.example.com 'http://localhost'

You should receive an HTTP 200 OK response:

HTTP/1.1 200 OK
x-powered-by: Express
date: Mon, 05 Oct 2020 16:47:33 GMT
x-envoy-upstream-service-time: 10
server: istio-envoy
transfer-encoding: chunked

And:

curl -I -H Host:mhs.example.com -H X-Mirror-Code:500 'http://localhost'

should return an HTTP 500 response:

HTTP/1.1 500 Internal Server Error
x-powered-by: Express
date: Mon, 05 Oct 2020 16:48:09 GMT
x-envoy-upstream-service-time: 12
server: istio-envoy
transfer-encoding: chunked

Experiment 1 - MHS v1.1.2 canary deployment

We are going to install a newer version 1.1.2. You need to manually edit the file mhs-canary-chart/mhs/values.yaml and replace tag: 1.1.1 with tag: 1.1.2 (this line).

Then:

cd mhs-canary-chart
helm upgrade mhs --namespace application ./mhs

While the canary deployment is in progress, it’s very important to generate some traffic to MHS. Without traffic, Flagger will consider that something went wrong with the new version and will rollback automatically to the previous one. Obviously, you don’t need this extra step in a production environment that continuously receives real traffic.

Run this loop command in another terminal to generate artificial traffic:

while (true); do curl -I -H Host:mhs.example.com 'http://localhost' ; sleep 0.5 ; done

Check the Kube dashboard, you should see the canary pod with the new version 1.1.2 at some point:

The application namespace now shows MHS 1.1.1 and canary deployment of MHS 1.1.2 in progress in your cluster — Canary deployment of MHS 1.1.2 in progress in your cluster

Check the canary events with the same command as before:

kubectl -n application describe canary/mhs

After a while (about 6 minutes) you should have a similar event output:

Events:
Type     Reason  Age                From     Message
----     ------  ----               ----     -------
Warning  Synced  30m                flagger  mhs-primary.application not ready: waiting for rollout to finish: observed deployment generation less then desired generation
Normal   Synced  29m (x2 over 30m)  flagger  all the metrics providers are available!
Normal   Synced  29m                flagger  Initialization done! mhs.application
Normal   Synced  10m                flagger  New revision detected! Scaling up mhs.application
Normal   Synced  9m16s              flagger  Starting canary analysis for mhs.application
Normal   Synced  9m16s              flagger  Advance mhs.application canary weight 10
Normal   Synced  8m16s              flagger  Advance mhs.application canary weight 20
Normal   Synced  7m16s              flagger  Advance mhs.application canary weight 30
Normal   Synced  6m16s              flagger  Advance mhs.application canary weight 40
Normal   Synced  5m16s              flagger  Advance mhs.application canary weight 50
Normal   Synced  4m16s              flagger  Copying mhs.application template spec to mhs-primary.application
Normal   Synced  3m16s              flagger  Routing all traffic to primary
Normal   Synced  2m16s              flagger  (combined from similar events): Promotion completed! Scaling down mhs.application

The canary release performed successfully. Now you have version 1.1.2 installed on all the primary pods and the canary pod has been removed.

The application namespace now shows MHS 1.1.2 and no canary deployment — MHS 1.1.2 is deployed in your cluster

Why did this deployment take about 6 minutes? Because it includes a 5 minutes canary analysis. During this analysis, the traffic was routed progressively to the canary pod. The canary traffic increased by steps of 10% every 1 minute until it reached 50% of the global traffic. The analysis is configurable and defined in the canary.yaml file that was added to the chart.

Below is the analysis configuration we have used:

analysis:
  # stepper schedule interval
  interval: 1m
  # max traffic percentage routed to canary - percentage (0-100)
  maxWeight: 50
  # canary increment step - percentage (0-100)
  stepWeight: 10
  # max number of failed metric checks before rollback (global to all metrics)
  threshold: 5
  metrics:
    - name: request-success-rate
      # percentage before the request success rate metric is considered as failed (0-100)
      thresholdRange:
        min: 99
      # interval for the request success rate metric check
      interval: 30s
    - name: request-duration
      # maximum req duration P99 in milliseconds before the request duration metric is considered as failed
      thresholdRange:
        max: 500
      # interval for the request duration metric check
      interval: 30s

The canary analysis has been covered with the 2 basic metrics that are provided out of the box by Istio / Prometheus (request success rate + duration). It is possible to define your own custom metrics. In that case, they will need to be provided by your application. Your application will need to expose a Prometheus endpoint that includes your custom metrics. And you will be able to update the Flagger analysis configuration to use them with your own PromQL query. Note this goes beyond the scope of this hands-on guide that uses only the built-in metrics.

Experiment 2 - MHS v1.1.3 faulty deployment

Again, you need to manually edit the file mhs-canary-chart/mhs/values.yaml and replace tag: 1.1.2 with tag: 1.1.3.

Then:

cd mhs-canary-chart
helm upgrade mhs --namespace application ./mhs

We generate some artificial traffic:

while (true); do curl -I -H Host:mhs.example.com 'http://localhost' ; curl -I -H Host:mhs.example.com -H X-Mirror-Code:500 'http://localhost' ; sleep 0.5 ; done

This time, we also generate invalid traffic to make sure the request success rate is going down!

Check the canary events with the same command as before:

kubectl -n application describe canary/mhs

After a while (about 6 minutes) you should have a similar event output:

Normal   Synced  8m23s (x2 over 20m)  flagger  New revision detected! Scaling up mhs.application
Normal   Synced  7m23s (x2 over 19m)  flagger  Advance mhs.application canary weight 10
Normal   Synced  7m23s (x2 over 19m)  flagger  Starting canary analysis for mhs.application
Warning  Synced  6m23s                flagger  Halt mhs.application advancement success rate 57.14% < 99%
Warning  Synced  5m24s                flagger  Halt mhs.application advancement success rate 0.00% < 99%
Warning  Synced  3m24s                flagger  Halt mhs.application advancement success rate 71.43% < 99%
Warning  Synced  2m24s                flagger  Halt mhs.application advancement success rate 50.00% < 99%
Warning  Synced  84s                  flagger  Halt mhs.application advancement success rate 63.64% < 99%
Warning  Synced  24s                  flagger  Rolling back mhs.application failed checks threshold reached 5
Warning  Synced  24s                  flagger  Canary failed! Scaling down mhs.application

And you are still on version 1.1.2.

Flagger decided not to go ahead and propagate version 1.1.3 as it could not perform a successful analysis and the error threshold was reached, i.e. 5 times (indeed, each time, about 50% of the requests were ending up in an HTTP 500 response). Flagger has simply redirected all traffic back to the primary instances and removed the canary pod.

Congratulations, you’ve come to the end of this second tutorial!

Observations

Before we clean up the resources we’ve created, let’s wrap up with a list of observations:

Deleting a deployment will delete all pods (canary / primary). And we don’t end up with orphan resources.
Prometheus is required. Without it, the canary analysis won’t work.
It is not possible to re-trigger a canary deployment of the same version if it has just failed. It forces you to bump up the version (even if it was a configuration and not a code issue).
Flagger off-boarding process is not as simple as removing the canary resource from the chart and deploy a new version. If you delete the canary resource then Flagger won’t trigger the canary process, it will change the version in mhs and remove mhs-primary but mhs has 0 pods so it will make your service unavailable! You need to be careful and adopt a proper manual off-boarding process. Recently, the Flagger team added a property revertOnDeletion you can enable to avoid this issue. You can read the documentation to know more about this canary finalizer.
After multiple deployments, it seems that some events can be missing, the Kubernetes describe command is accumulating them (x<int> over <int>m) sometimes the order is not preserved and/or some events are not showing up. You can look at the phase status (terminal status are Initialized, Succeeded and Failed). The best is to look directly at the logs on the Flagger pod as this is always accurate and complete.
The canary analysis should be configured to run for a short period of time (i.e. no more than 30 minutes) to leverage continuous deployment and avoid releasing a new version while a canary deployment for the previous one is still in progress. If you want to perform canary releases over longer periods, Flagger may not be the best tool.
Finally, it’s important to remember that the first time you deploy with Flagger (like in experiment 0 above), the tool needs to initialize itself (Initialized status) and will not perform any analysis.

Cleaning up resources

Now the tutorial is complete you can remove the MHS application and its namespace.

helm delete mhs --namespace applicationkubectl delete namespaces application

We recommend that you leave Flagger and Istio in place to save time in the next tutorial. If however you’d like to remove everything now, then you can run the following commands.

Remove Flagger:

helm delete flagger --namespace flagger-systemkubectl delete namespaces flagger-system

Remove Istio and Prometheus:

kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.7/samples/addons/prometheus.yamlistioctl manifest generate --set profile=demo | kubectl delete -f -kubectl delete namespaces istio-system

What’s next?

The next article will focus on the Grafana dashboard provided out of the box with Flagger which is a nice addition, so you don’t need to manually run any kuberctl commands to check the result of your canary deployments. Stay tuned! In the meantime, you can stop the Kubernetes cluster by unchecking the box and restarting Docker Desktop. Your computer deserves another break.

Learn more about technology at Expedia Group