Full automation with Argo Rollout blue-green deployment — automatic rollout abort and rollback

Piotr Kleban
9 min readJun 16, 2023

--

Introduction

GitOps is a methodology that uses Git as the source of truth for your desired state of your applications and infrastructure. A GitOps agent (such as Argo CD) attempts to keep the actual state in-sync with the desired state by continuously applying the changes from Git to your cluster.

However, Argo CD has no way to keep the cluster in a stable state in case of application bugs or failures. For example, if you deploy a new version of your application that has an error, Argo CD will still apply it to your cluster and potentially cause downtime or disruption for your users.

Argo Rollout is a Kubernetes operator that extends the native Kubernetes rollout capabilities with advanced strategies such as blue-green deployment or canary deployment.

Blue-green deployment is a technique that involves creating two identical environments (blue and green) and switching traffic between them when deploying a new version of your application. This way, you can ensure that the new version is fully tested and functional before exposing it to all your users.

One of the key features of Argo Rollout is analysis. Analysis allows you to run various tests or checks on your application during or after the deployment. You can define an analysis template that specifies what metrics you want to measure or conditions to verify. You can run some tests (such as smoke tests/system-level) on your application before exposing it to your users. You can also abort the rollout or rollback automatically if there are any issues with the new version. Once the error is fixed, you can roll forward with the next revision.

We will also see you how to use Rollout with HPA (Horizontal Pod Autoscaler) to scale your pods based on the metrics.

How to use Argo Rollout for automatic rollout abort ?

One of the challenges of deploying new versions of applications is how to failures that may occur during or after the deployment. In Continuous deployment scenario we want to abort the rollout or rollback to a previous version automatically without manual intervention.

Argo Rollout provides a feature called Analysis that allows you to run various tests or checks on your application during or after the deployment. You can define an AnalysisTemplate that specifies what metrics or conditions you want to measure or verify.

For example, you can run an analysis before promoting a new version from preview to active (pre-promotion analysis) or after promoting a new version from preview to active (post-promotion analysis).

How to use Argo Rollout for blue-green deployment?

To use Argo Rollout for blue-green deployment, we can create a rollout manifest that specifies activeService, and the previewService in strategysection. The active service is used to route traffic to the stable version of the application. The preview service is used handle traffic to the new version of the application.

kind: Service
apiVersion: v1
metadata:
name: rollout-ngnix-active # <- active
spec:
selector:
app: ngnix
ports:
- protocol: TCP
port: 80
targetPort: 80
kind: Service
apiVersion: v1
metadata:
name: rollout-ngnix-preview # <- preview
spec:
selector:
app: ngnix
ports:
- protocol: TCP
port: 80
targetPort: 80
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:

analysis:
successfulRunHistoryLimit: 3
unsuccessfulRunHistoryLimit: 3

selector:
matchLabels:
app: ngnix

template:
metadata:
labels:
app: ngnix
spec:
containers:
- name: ngnix
image: nginx:1.19.1
ports:
- name: http
containerPort: 80
protocol: TCP
resources:
limits:
cpu: 10m

rollbackWindow:
revisions: 3

strategy:

blueGreen:

activeService: rollout-ngnix-active # <- matches Active Service metadata
previewService: rollout-ngnix-preview # <- matches Preview Service metadata

prePromotionAnalysis:
templates:
- templateName: pre-promotion-analysis
args:
- name: rollout-ngnix-preview

postPromotionAnalysis:
templates:
- templateName: post-promotion-analysis

previewReplicaCount: 2
autoPromotionEnabled: true
autoPromotionSeconds: 1
scaleDownDelaySeconds: 30

# 0 means not to scale down
abortScaleDownDelaySeconds: 30

One significant disadvantage of blue-green strategy is that you must have enough capacity to support both versions running simultaneously. However in Argo Rollouts, there are some parameters that can help optimize the cluster resource usage. One of these parameters is previewReplicaCount, which specifies how many replicas to run under the preview service before the switch. This parameter can reduce the resource consumption during the preview phase. What makes it even more special is fact that new ReplicaSet will be fully scaled before the switch happens!

There will be a brief period of time when both the old and the new replica sets are running at the same time. ParameterscaleDownDelaySeconds specifies how long to wait before scaling down the old replica set. Decreasing this parameter should be done with extra care as default minimum value for this parameter is 30 seconds, which is recommended to ensure IP table propagation across the nodes in a cluster.

autoPromotionEnabledsis a parameter that determines the promotion mode of the rollout. If true (default), the rollout automatically switches the active service to the new version when ready. If false, the rollout pauses and requires a manual resume.

Two parameters control the history of analysis runs and experiments: successfulRunHistoryLimit and unsuccessfulRunHistoryLimit. They limit how many successful and unsuccessful runs and experiments are stored, respectively. The default value for both is 5.

To examine history of analysis use:

kubectl get analysisruns.argoproj.io
kubectl describe analysisruns.argoproj.io/<analysis-name>

kubectl get jobs
kubectl describe jobs/<job-name>

kubectl logs job/<job-name>

# Check the logs of the analysis run pod
kubectl logs pod/<analysis-run-pod-name>

Automatic Rollout abort and Rollback

As stated pre-promotion analysis is a feature of Argo Rollouts that allows you to run an analysis before the traffic switch to the new version of your application. This can help you validate that the new version is ready and safe to receive traffic. We use Job analysis to run set of tests on the new version of application. If the analysis succeeds, it will automatically promote (by default) the new version from preview to active. However, if the analysis fails, the rollout will automatically abort and keep traffic to the old stable version. This avoids exposing your users to a faulty version of your application.

What is more, in a very simple way to leverage another hook — Post-promotion analysis to make deployment even more reliable. This hook lets you run an analysis after the traffic switch to the new version of your application. You can use this hook to make sure that the live traffic is not causing any trouble with the new version, such as high latency or errors. If the analysis finds any problems, it will automatically roll back to the previous stable version. This way, we don’t have to do anything manually!

Once the error-causing issue has been resolved and we just roll-forward deploying next version.

In both situations (rollback or abort) unused ReplicaSets will be scaled down to avoid using resources.

How to use Rollout with HPA?

HPA (Horizontal Pod Autoscaler) is a feature that automatically scales the number of pods based on metrics such as CPU or memory utilization. This way, you can ensure that your application has enough resources to handle the load and optimize resource utilization. Crucial setting isscaleTargetRef,it has to match Rollout object.

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

Lets see an example to examine previously discussed Rollout attributes:

      previewReplicaCount: 2
autoPromotionEnabled: true
abortScaleDownDelaySeconds: 30

We need to generate some traffic to your application so that the CPU utilization or other metrics increase and trigger the scaling action.

A simple way to do that might be by:

kubectl run -i --tty load --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q http://rollout-ngnix-active -O /dev/null; done"

After a minute ReplicaSet my-app-6dfff79cfb has been scaled up to 10 ( maxReplicas in HPA manifest):

Now lets roll-forward with a new image ngnix 1.21.0→1.21.1. There are 2 preview pods set according to value in previewReplicaCount in Rollout manifests illustrated below:

In accordance with previewReplicaCountattribute in Rollout spec: "Once the rollout is resumed the new ReplicaSet will be fully scaled up before the switch occurs"

Revision 3 has been scaled up to 10 replicas once AnalysisRun has been successfully finished.

Revision 3 becomes an active application:

Old ReplicaSet accoding to scaleDownDelaySeconds: 30 is scale down by terminating pods:

What is an Job AnalysisTemplate?

We have covered the behaviour of Rollout in sunny day scenario. Let’s now take a look at AnalysisTemplate.

Example of AnalysisTemplate with job provider:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: <string> # required
namespace: <string> # optional
spec:
metrics: # required
- name: <string> # required
provider:
job:
metadata: <object> # optional
spec: <object> # required
initialDelay: <duration> # optional
interval: <duration> # optional
count: <intstr> # optional
failureLimit: <intstr> # optional
  • interval: This attribute specifies the time between measurements for a metric. For example, if interval is 10s, the measurement will be performed every 10 seconds.
  • count: This attribute specifies the number of measurements to perform for a metric. For example, if count is 1, the measurement will be performed only once.
  • failureLimit: This attribute specifies the number of failed measurements that cause the analysis to fail for a metric. For example, if failureLimit is 0, the analysis will fail if there is any failed measurement.
  • initialDelay: specifies how long to wait before starting the first measurement. It is an optional field that defaults to zero if not specified.

Default values for these parameters are:

  • count: 1 (run analysis once)
  • interval: 60s (run analysis every 60 seconds)
  • failureLimit: 0 (fail analysis after one failed measurement)
  • initialDelay: 0s (fail analysis after one failed measurement)

First measurement is started immediately after the analysis run is created.

Another worth mentioning attribute is backoffLimit of job section.

  • backoffLimit: This attribute specifies the number of retries before marking a job as failed. For example, if backoffLimit is 0, the job will not be retried if it fails.

Job Analysis is based on the native Kubernetes Job resource. If you set backoffLimit to 3, the job will be run up to 4 times. If the job still fails after 3 retries, the job will be marked as failed and the measurement (set bycount) will be considered failed. If the number of failed measurements exceeds the failureLimit, the analysis will be considered failed.

For every measurement Job controller will create pods with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed pods appear before the Job’s next status check. If the backoffLimit is reached, the Job will be marked as failed and no more pods will be created. The default value of backoffLimit is 6. This means that the Job controller will try to create 6 pods and if they all fail, it will mark the Job as failed.

There are multiple providerssupported byAnalysisTemplate.

  • Web
  • Prometheus
  • Wavefront
  • DataDog
  • NewRelic
  • Web Kayenta
  • CloudWatch
  • Graphite
  • InfluxDB Apache
  • SkyWalking

Let’s now run an analysis which will simulate failing tests. There is an exit -1 executed inside a container.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: fail
spec:
metrics:
- name: fail
count: 1
interval: 0s
failureLimit: 0
provider:
job:
spec:
backoffLimit: 0 # run once
template:
spec:
containers:
- name: failure
image: alpine:3.8
command: [sh, -c]
args: [echo "tests finished at $(date)" && exit -1]
restartPolicy: Never

Let’s test that.

We are going to changing the image to a more recent version, deploy it, and observe that the analysis did not succeed, revision:1 remains active. There is no switch to revision:2:

Next, let’s modify the analysis from always failing exit -1 to always succeeding exit 0, and use a little trick to retry the analysis with the command kubectl argo rollouts retry rollout my-app

We can see that another analysis was performed for revision:2 and revision:2 became the active replica set.

Summary

Argo Rollouts is a powerful and flexible tool for application delivery to the Kubernetes cluster.

Argo Rollouts is able to stop the promotion of a new version if an metrics/tests fails. This prevents the new version from being promoted to the active service and receiving production traffic. Argo Rollouts can also rollback the deployment of a new version. Both analyses can be configured by using .spec.strategy.blueGreen.prePromotionAnalysis or .spec.strategy.blueGreen.postPromotionAnalysis fields in the Rollout spec.

With Argo Rollouts installed, we can also relatively easily move on to another strategy which is a canary deployment.

Thanks

--

--

Piotr Kleban
Piotr Kleban

Written by Piotr Kleban

Wizard of automation. Makes sure that code does not explode when it goes live. Obsessed with agile, cloud-native, and modern approaches. # x.com/PiotrKlebanDev

Responses (1)