Making Myself Obsolete — Implementing cluster autoscaling and node autoremediation at Hootsuite

Published in

Hootsuite Engineering

15 min readDec 21, 2022

It’s Wednesday at 2am and you’re startled awake by Pagerduty. You groggily login to your laptop and check the details of the alert, which indicates pods are stuck in pending on Kubernetes. You describe one of the pending pods and see “FailedScheduling, 0/23 nodes available”. Wonderful, you were woken up to act as a human cluster autoscaler.

Since migrating to EKS and setting up deployments using Gitops and ArgoCD, the team at Hootsuite had unblocked a project that had long been sitting on deck — cluster autoscaling and node autoremediation. With Kubernetes, it is possible to enable automation to correctly size the node groups in the cluster. This would remove the need for the team to periodically increase the node count for the clusters, sometimes outside of working hours. Node autoremediation automation could also be introduced to fully automate the removal of problematic nodes. Nodes can degrade for a variety of reasons, some of which are not always visible to the Kubernetes control plane. These situations required manual intervention by a cluster operator to drain and terminate the degraded nodes.

Once implemented, cluster autoscaling and node autoremediation automation reduced pager load on the team and increased reliability of the compute platform at Hootsuite. This post will dive into the specifics of how these systems were introduced.

Shaping our clusters to match our traffic

The team added cluster autoscaling primarily to support dynamic resizing of our node groups. This ensured we had adequate space on the cluster when workloads horizontally scaled due to increased traffic. It also ensured that unnecessary space on the cluster was removed when traffic decreased.

Cluster autoscaling was also necessary to support node autoremediation. If a degraded node could be detected and drained, something would need to take care of terminating the problematic node and replacing it with a healthy one.

Adding the Cluster Autoscaler

What is the Kubernetes Cluster Autoscaler? To quote from the source directly:

Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when one of the following conditions is true:

there are pods that failed to run in the cluster due to insufficient resources.
there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing nodes.

Prior to introducing cluster autoscaling, the node groups for the Kubernetes clusters at Hootsuite were configured to exist in three availability zones for high availability. However, it can be useful to have the ability to scale nodes up and down for a single availability zone. This capability can make it easy to respond to an AWS outage/degradation in one availability zone by downscaling the infrastructure in that availability zone and upscaling the infrastructure in other, unaffected availability zones.

Another use case for this architecture is for workloads that use EBS persistent volumes. EBS volumes are coupled to a single availability zone. Any pod(s) using this volume must be scheduled in the same availability zone. If the node group the pod(s) run on spans multiple availability zones, new nodes may be brought up in any of these zones. To be able to guarantee the availability zone of a node, the node group must have a 1:1 mapping with an availability zone.

Our team refactored the node groups for the Kubernetes clusters such that each existing node group was replaced with three separate node groups (referred to internally as a node group set), one per availability zone. The Cluster Autoscaler was configured with the
— balance-similar-node-groups flag, so that the sizing across the three node groups would be balanced. Finally, the Cluster Autoscaler was configured to automatically discover node groups to scale by assigning a specific label to the node groups.

With these changes in place cluster capacity could now be resized in an automated way depending on resourcing needs for workloads. However, these changes made the clusters far more dynamic than before. Before introducing cluster autoscaling, pods were typically only assigned to different nodes during deployments. Now, pods could be rescheduled at any time. There was work to be done to retain high availability in an increasingly dynamic cloud environment.

Pod Disruption Budgets

Before turning on the Cluster Autoscaler for production Hootsuite clusters, ensuring high availability could be maintained for all workloads was a priority. To achieve this, it was necessary to introduce Pod Disruption Budgets. Pod Disruption Budgets provide availability guarantees for replicated applications during voluntary disruptions, such as the Cluster Autoscaler gracefully draining an under-utilized node so that it can be terminated. Because each microservice at Hootsuite is deployed with an internally maintained Helm chart, this was easy to add. To roll out Pod Disruption Budgets to all microservices, all the team needed to do was publish a new version of the Hootsuite Helm chart.

In addition to internal services, it was also necessary to ensure Pod Disruption Budgets were added for all third party tooling installed on Hootsuite clusters. Many of the Helm Charts provided to install third party tooling require a boolean value be set to true to explicitly enable the creation of a Pod Disruption Budget for the tool.

When adding Pod Disruption Budgets, it was important to make sure the Pod Disruption Budget was not overly restrictive for the workload. For the majority of Hootsuite microservice deployments, allowing a single pod to be unavailable at a time was sufficient. However, for some larger deployments, this setting would make cluster draining operations take much longer than necessary. Care was needed to set the maximum number of unavailable pods to an appropriate percentage for the workload. A good starting point for setting the appropriate maxUnavailable percentage is to refer back to the update strategy for the workload to see what kind of unavailability it can tolerate.

# An example of a Pod Disruption Budget that allows a single pod belonging
# to a deployment to be unavailable at a time during a voluntary disruption.
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: foo
  namespace: bar
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: foo

One caveat is that Pod Disruption Budgets should not be set for workloads with a single replica. If a Pod Disruption Budget restricts maxUnavailable to one and the workload only has a single replica, it becomes impossible to drain the workload from a node. In the case of some single replica workloads, it is still necessary to minimize the likelihood the workload will be evicted from a node. In these scenarios, it may be appropriate to annotate the workload with the “cluster-autoscaler.kubernetes.io/safe-to-evict”: “false” annotation. This annotation instructs the cluster autoscaler to exclude the underlying node from downscaling operations.

Topology Spread Constraints

The team introduced another high availability improvement to the Hootsuite clusters as part of the autoscaling initiative — topology spread constraints. Topology spread constraints can be used to spread the pods belonging to an individual workload across failure domains such as availability zones and nodes. They are a superset of affinity rules, layering on functionality to help ensure even spread of pods across failure domains. There is an interesting discussion in this Kubernetes Enhancement Proposal about how specifying affinity rules alone can result in pods being stacked in a single failure domain. Topology spread constraints were added to the Hootsuite microservice Helm Chart and enabled in third party Helm charts where supported. Not all third party Helm charts supported adding constraints. In these cases, the team instead layered the configuration on top of the manifests generated by the Helm chart using Kustomize.

# An example of topology spread constraints that will prefer scheduling pods
# with the label "app.kubernetes.io/name: foo" across zones and nodes.
topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: foo
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: foo
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway

Self-healing at the cluster level

As part of the work to implement cluster autoscaling, the team also planned to add enhancements to the self healing capabilities of the cluster. Occasionally, the clusters at Hootsuite experienced issues with single nodes entering a degraded state. These issues typically required human intervention to fix. This human intervention consisted of a human operator manually draining the problem node and terminating it in the AWS console. This seemed like a textbook case for automation!

Node Problem Detector and Draino

Enter Node Problem Detector and Draino. Node Problem Detector is a DaemonSet that can be used to detect node problems such as hardware issues, kernel issues, container runtime issues and system daemon issues. Node Problem Detector reports these problems by setting a condition on the node. In addition to the default conditions, it is possible to extend Node Problem Detector with custom conditions. An example of a custom condition the team implemented for Hootsuite clusters is one to detect clock drift.

Once a condition is set on a node, a remedy system must be in place to take care of evicting all workloads on the node and terminating the node. Draino is a deployment that can be configured to watch for node conditions and automatically cordon and drain any node with one of the conditions set. Once cordoned, Kubernetes will no longer schedule workloads on the node. Once Draino has evicted all workloads from the node, the node will have low utilization. The cluster autoscaler will detect the drained node as being underutilized and will then terminate it.

Applying Some Polish

Finally, a first pass at setting up and configuring the Cluster Autoscaler, Node Problem Detector and Draino on Hootsuite non-production clusters was complete. Of course, we all know the last 10% of the work takes 90% of the time. In this case, it became apparent that testing the end-to-end flow of the new components was challenging. The non-production clusters had very different traffic patterns than the production clusters — the scale was much smaller. This meant scaling and node failure events were far less frequent. We would need to simulate these events in order to adequately test the new flows.

Continuous Verification with Litmus Chaos Tests

Rather than developing a manual test plan, we decided to employ continuous verification to trigger the automated flows in all non-production environments repeatedly throughout the day. These flows are complex and could break for a variety of reasons, such as:

AWS infrastructure changes
Kubernetes configuration changes
Kubernetes version upgrades
Tooling version upgrades
…

The main advantage that continuous verification had over manual testing was that it removed the possibility of human error. It was unrealistic to expect human operators to remember to test the flows every time a change was made that could impact them. Manual testing would also not scale well — if we went that route, we’d be accepting taking on an increasing number of manual tests as the platform grew.

We wanted to surface issues with automated flows as early as possible, rather than find out they were broken in production when they were needed. We opted to introduce the Litmus Chaos framework to our platform, which allowed us to trigger automated, custom tests written to exercise the autoscaling and autoremediation systems.

Litmus Chaos is a framework for running chaos tests — the deliberate injection of faults or failures into a system to test the system’s response to failure. Litmus runs on Kubernetes using the operator pattern. Operators are essentially an extension of Kubernetes — they use Custom Resources to manage various cluster resources. A Custom Resource is similar to built-in Kubernetes resources, like Deployments, Services and Pods. A Custom Resource is simply another type of resource defined by a third party.

In Litmus Chaos’ case, the two key Custom Resources are ChaosEngine and ChaosExperiment. A ChaosEngine provides configuration for executing a collection of ChaosExperiments. The Litmus Operator manages the lifecycle of these resources and will run ChaosExperiments according to their definitions. Metrics for ChaosExperiments runs are exposed as Prometheus metrics. These metrics can be used to alert when tests are consistently failing.

# A promql query that can be used to alert when chaos tests are consistently failing.
rules:
  - alert: ChaosTestsNotRunningSuccessfully
    expr: (sum by (env, chaosengine_name) (increase(litmuschaos_passed_experiments{}[4h]))) < 1

The Litmus Chaos community provides a runner image with a selection of pre-built experiments that can be configured to run on a cluster. Some examples of pre-built experiments are:

Randomly deleting pods on a cluster
Randomly deleting containers on a cluster
Driving CPU usage on a pod very high
Check out the complete list

If the pre-built experiments don’t cover what you’re looking for, the community-maintained runner image can easily be swapped out with a customized runner image. The customized runner image gives you the flexibility to write custom chaos tests in your language of choice.

The value of this “bring your own” customization was realized quickly. The team triggered repeated testing of the autoscaling flow using the community maintained autoscaling experiment and quickly realized that rapid scaling in production environments was not all that rapid — horizontally scaled deployments would often need to wait a few minutes for new nodes to be provisioned. Waiting a couple of minutes for the new nodes to come up was undesirable. Intentionally overprovisioning the clusters would be necessary to keep some standby capacity warm.

Overprovisioning to Reduce Startup Time

The Cluster Autoscaler docs recommend using a deployment configured with low pod priority on the cluster to accomplish overprovisioning. An open source Helm Chart can be used to create an installation of the deployment on a cluster. We created an installation of this chart for each node group set on Hootsuite production clusters.

Each installation was configured to provision three replicas of the overprovisioning deployment. These replicas were spread across availability zones using topology spread constraints. This ensured there would be at least one warm, standby node in each availability zone for each node group set. The deployment was configured to provision a single replica in Hootsuite non-production clusters to keep costs down.

When a cluster is unable to accommodate cluster workloads due to limited capacity, the overprovisioning deployment pods are evicted due to their low priority. The overprovisioning deployment pods then transition to a Pending state, which causes the Cluster Autoscaler to trigger a scale out event. The overprovisioning pods are scheduled on the newly created node(s) to keep them warm for when capacity is needed in the future.

Testing Autoscaling

A custom Litmus test was written to explicitly test the overprovisioning functionality with the autoscaling flow. This test scales out a dummy deployment with pods that are sized to use the majority of resources on worker nodes. The dummy deployment scales out to three replicas and uses topology spread constraints to require each pod to be scheduled in its own Availability Zone. The test validates that:

the overprovisioning deployment pods used to keep one standby node per Availability Zone warm are evicted
the dummy deployment is successfully scheduled in one minute
the overprovisioning deployment is successfully rescheduled in five minutes

The test fails if it is unable to validate any of the above conditions. This set of conditions ensures that standby nodes are always available when needed for rapid scaling, and that new standby nodes are created as needed.

Testing Autoremediation

With testing in place for autoscaling, the team moved on to creating a test for the autoremediation flow. The test selects a node, taints it and explicitly schedules a pod on the tainted node which should trigger the KernelDeadlock node condition. This node condition is set by Node Problem Detector. After verifying that this condition is set, the test waits for the node to be cordoned and drained by Draino. Upon detecting the node is terminated by the Cluster Autoscaler, the test completes successfully.

The tests described above were configured to be run hourly on all non-production clusters at Hootsuite. Thanks to the tests, the team was able to quickly discover shortcomings of the flows and apply fixes:

The defaults for scale up/scale down needed to be overridden to be faster.
Certain labels on the node groups needed to be explicitly excluded to get the balancing similar groups feature working as expected.

# Example of some of the configuration options that worked for us for
# the cluster autoscaler, YMMV =)
command:
  - ./cluster-autoscaler
  - --cloud-provider=aws
  - --namespace=my-cluster-autoscaler-ns
  - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/k8s-development
  - --balance-similar-node-groups=true
  - --cluster-name=k8s-development
  - --expander=least-waste
  - --logtostderr=true
  - --max-failing-time=5m
  - --max-node-provision-time=4m0s
  - --skip-nodes-with-local-storage=false
  - --stderrthreshold=info
  - --scale-down-unneeded-time=5m0s
  - --scale-down-delay-after-add=5m0s
  - --balancing-ignore-label=eks.amazonaws.com/nodegroup-image
  - --balancing-ignore-label=eks.amazonaws.com/sourceLaunchTemplateId
  - --balancing-ignore-label=eks.amazonaws.com/sourceLaunchTemplateVersion
  - --v=2

Monitoring

The chaos tests were now running hourly and were exercising the new autoscaling and autoremediation flows. This surfaced an unexpected issue — the existing node monitoring approach was no longer appropriate. When we initially setup the clusters many years ago, we used the same monitoring checks for Kubernetes nodes that were already in use by the traditional VM based system. These checks assumed that a single node going down or system daemons not running were critical problems that needed manual intervention to resolve.

With autoscaling and autoremediation in place on our clusters, these checks were outdated. A single node, or even a small subset of nodes, entering a degraded state should not page a human, as long as workloads are unaffected. If autoscaling and autoremediation are working as expected, degraded nodes will be terminated and replaced automatically. The team decided to refactor the existing monitoring to take a more holistic view of the system. Critical alerts were configured to be sent only if an issue was not self-healing or if it couldn’t wait until working hours.

The new monitoring included the following checks (among others not listed here):

Unscheduleable pods (critical) — this may indicate that autoscaling is not working. However, it can surface other configuration issues, such as incorrect tolerations.
Node having a condition set for an extended period (critical) — this may indicate that autoremediation is not working. Workloads may be impacted if they are not drained from the node.
Node is NotReady for an extended period (warning) — this may end up leading to cluster capacity issues if new nodes are unable to join the cluster.
Node having memory/disk/load pressure (warning) — these situations are typically gracefully handled with problematic workload eviction (don’t forget to configure node allocatable appropriately!). However, they may indicate that a workload’s configuration needs to be adjusted to have adequate resourcing.
Cluster is over/under provisioned (warning) — this may indicate autoscaling is not working, as there is too little or too much capacity available.

With the refined monitoring in place, the autoscaling and autoremediation systems were promoted to Hootsuite production clusters. As expected, this reduced pager toil. Another additional positive benefit was that autoscaling significantly reduced cluster capacity on the clusters running the CI/CD systems during non-working hours. The clusters running CI/CD workloads now regularly scale between 20 and 70 nodes depending on time of day.

Grafana dashboard panel showing cluster autoscaling throughout the day for CI/CD clusters.

Spot Instances

One subsequent addition made since introducing these systems is the usage of AWS Spot instances in Hootsuite clusters. From the AWS docs:

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices.

The major caveat with using Spot instances is that the instances can be reclaimed at any time, with only a two minute warning. Fortunately, the high availability improvements made before introducing autoscaling and autoremediation made it simple and safe to introduce this change.

The team added new node groups for each availability zone consisting solely of Spot instances. The cluster autoscaler was configured to use a priority expander to prefer scaling Spot groups out before falling back to the On-Demand groups.

# An example of a priority expander configmap used to configure the
# cluster autoscaler to prefer scaling out node groups using SPOT instances.
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: my-cluster-autoscaler-ns
data:
  priorities: |
    10:  
      - .*
    20:
      - .*spot.*

Not all workloads are a good fit for Spot instances. If a workload’s graceful termination period is greater than two minutes, it may not have time to shutdown before the underlying Spot instance is reclaimed. The Hootsuite microservice Helm chart was updated to automatically add an affinity rule to any workloads with a long graceful termination period. This affinity rule ensured the workload would only be scheduled on the On-Demand node groups.

# An example of the affinity rule applied to workloads with a graceful 
# termination period that wasn't appropriate for SPOT instances. This
# affinity rule configures the workload to be scheduled on ON DEMAND 
# instances only.
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: eks.amazonaws.com/capacityType
            operator: In
            values:
              - ON_DEMAND

Finally, monitoring was added to track Spot versus On-Demand prices. If the Spot price goes above the On-Demand price, the team is notified so a decision can be made on whether or not to disable the Spot groups until the price goes back down.

Conclusion

Kubernetes already has great self-healing capabilities by default, but it is possible to take these up a notch using tools like Node Problem Detector, Draino and Cluster Autoscaler. With these tools and the configuration of Pod Disruption Budgets and Topology Spread Constraints, workloads can be even more resilient to failure in a dynamic cloud environment. As complexity is introduced to a cluster, continuous verification can be used to catch issues early, before they reach production environments and cause outages. Who doesn’t like increased availability and decreased pager volume?