Unexpected HPA Scale Down of ArgoCD Rollouts

Andrei Mihai
5 min readJun 22, 2024

--

Unexpected Scale Down Issue

If you configured an Horizontal Pod Autoscaler for a Rollout managed by ArgoCD and you notice an unexpected Scale Down Behaviour even if your application traffic (or load) is still high, the reason is that ArgoCD is resetting(overwrite) the number of replicas from the Rollout at every Sync stage:

Even if you configure explicitly the Scale Down policy on the HPA it will not prevent the sudden drop to minimum of the desired replicas.

For example let say that you have your HPA defined like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler

metadata:
name: my-hpa

spec:
behavior:

scaleDown:
policies:
- periodSeconds: 900
type: Pods
value: 1
selectPolicy: Min

scaleUp:
policies:
- periodSeconds: 60
type: Pods
value: 4
selectPolicy: Max

maxReplicas: 48
minReplicas: 16

metrics:
- resource:
name: cpu
target:
averageUtilization: 70
type: Utilization
type: Resource

scaleTargetRef:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
name: my-rollout

You would expect that if the CPU utilisation stay above 70% the Rollout will be increased with 4 pods every minute and it will decrease with 1 pod every 15 minutes if the CPU utilisation drop below 70%.

However the behaviour will be that every few minutes the HPA will reset the number of desired replicas to minimum. Even if you try to disable the scale down using selectPolicy: Disabled in the HPA, the Rollout will still be scaled down aggressively without any reason. It’s like the Scale Down policy is totally ignored.

Reason

How HPA works with scaleTargetRef is:

  1. Monitoring Metrics: The HPA queries the metrics server (like metrics-server or Prometheus) to fetch the current metrics for the pods associated with the target resource.
  2. Calculating Desired Replicas: HPA compares the current metrics against the specified target metrics. If the observed metric value deviates significantly from the target, the HPA calculates the desired number of replicas required to bring the metric value back to the target.
  3. Updating Replica Count: HPA then updates the replicas field of the target resource (specified in scaleTargetRef) to the desired number. This triggers Kubernetes to adjust the number of running pods to match the desired replicas.

However ArgoCD will continuously compares the desired state from the Git repository with the actual state of the Kubernetes cluster. If you have defined the replicas field in your Rollout manifest (desired state) it will overwrite the value of the replicas from the cluster (actual state) at every automatic sync stage. If this is what happens you should be able to see it also in the ArgoCD interface when you click on the Application, search for your Rollout and click Details:

Rollout in ArgoCD UI

and you should see the Diff tab in the bottom of the page:

Rollout Diff between the Actual State and Desired State

Solution

Two possible solutions:

  1. Do not include the replicas field in the manifest of the Rollout if you are using a HPA, or
  2. Ignore the replicas field from Rollout during ArgoCD sync stage

Solution 1: Do not include the replicas field in the manifest of the Rollout if you are using an HPA

One of the role of ArgoCD is to manages applications declaratively, meaning the desired state is defined in code and stored in version control. The desired state usually is a set of Helm Chart Templates in a git repo.
In your Rollout helm template you should not set the spec.replicas if the HPA is enabled:

# templates/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: {{ .Release.Name }}
spec:
{{- if not .Values.autoscaler }}
replicas: {{ .Values.replicas }}
{{- end }}
selector:
matchLabels:
app: {{ .Release.Name }}
template:
metadata:
labels:
app: {{ .Release.Name }}
spec:
containers:

and in values.yaml you can have something like this:

# values.yaml
autoscaler: true # Enable or disable autoscaler
replicas: 3 # Default number of replicas if autoscaler is disabled

# The rest of the values, like the HPA min and max replicas and so on..

⚠️ Be aware that if the autoscaler is not configured properly you might end up with a Rollout with no replicas and no autoscaler. Kubernetes generally defaults to creating a single pod. This is because the absence of the replicas field implies a default of replicas: 1,and this might not be what you want.

Solution 2: Ignore the replicas field from Rollout during ArgoCD sync stage

Argo CD allows ignoring resource updates at a specific JSON path, using JSON patches or JQ expressions. For example to ignore the spec.replicas field you can use something like:

apiVersion: argoproj.io/v1alpha1
kind: Application
spec:

ignoreDifferences:
- group: argoproj.io
kind: Rollout
jqPathExpressions:
- '.spec.replicas'

syncPolicy:
syncOptions:
- RespectIgnoreDifferences=true

This can be added directly in the ArgoCD UI (you have the option to edit the manifest of your argo Application) but it’s better to have all your Argo CD Application manifests as-a-code in a git repo.

RespectIgnoreDifferences is telling ArgoCD to consider the ignoreDifferences section also during the sync stage. By default, Argo CD uses the ignoreDifferences section just for computing the diff between the live and desired state which defines if the application is synced or not. However during the sync stage, the desired state is applied as-is [See Respect Ignore Differences documentation]

This ignoreDifferences option will not be taken into consideration first time when the resource is created, so we don't have the risk of creating a Rollout with replicas empty.

⚠️ However, this option has the drawback that you will not be able to update anymore the number of replicas from the values.yaml after you created the Rollout. So if you have a global ignoreDifferences configuration you should be careful to not ignore the spec.replicas field for the Rollouts that doesn't have an HPA.

Conclusions

The decision to go with Solution 1 (not set spec.replicas on the Rollout) or Solution 2 (ignore spec.replicas from the sync stage) depends on your DevOps infrastructure setup.
If you have only one global manifest for all your ArgoCD Applications it might be better the first option of not setting spec.replicas on the Rollout.
If you don't have access to the Rollout helm template (it is part of a helm chart developed outside of your team) maybe is better just to add spec.replicas on the ignoreDifferences list.

Also, consult the latest ArgoCD syntax, since the project is changing rapidly.

Resources

--

--

Responses (1)