How autoscaling took down my application..!!

4 min readMar 25, 2025

Wait, what? Yes, that was a clickbait (Sorry)! Actually, it should’ve been “How a small oversight in autoscaling configuration took down my application?” (That was a mouthful)

My team and I work with Google Kubernetes Engine (GKE) and are pretty happy with it. The diagram below shows our high-level architecture.

Source: Google Docs

When deploying a new app, we create a workload and a clusterIP service. We create Network Endpoint Groups (NEGs) by annotating the service with cloud.google.com/neg, as mentioned here. This will create NEGs in all availability zones, and then we attach them to a backend on an HTTPS load balancer.

We deployed two internal applications a few weeks ago on our staging environment, and after creating the service, we saw only 2 NEGs instead of 3 (one per availability zone). Since these are internal applications, we only deployed one pod per deployment. Our application was still working because there were two active NEGs and healthy pods in either of the two.

We deployed the apps in the northamerica-northeast1 region. Two NEGS were created in availability zones (AZs) northamerica-northeast1-b and northamerica-northeast1-c

The Challenge

Fast-forward a couple of weeks, and we got an alert that both the applications we deployed had gone down. I observed new pods are running on both deployments and both pods appeared healthy, but the Load Balancer (LB) cannot detect them and reported there are no healthy pods in the backend. I quickly checked if there were any pod restarts, and both workloads had brand-new pods running in them. Which was good because now we know that was the trigger for our problem.

Further debugging shows that both new pods are on the same node in northamerica-northeast1-a, the one AZ that doesn’t have an NEG attached to LB. This never happened earlier, and we want to understand what caused this issue!

Root Cause Analysis

After debugging it for a while, we created a ticket with Google and talked to one of their engineers. He mentioned there was no active node in northamerica-northeast1-a when we created services and assured us it was a default behavior. We were advised to deploy a dummy workload with annotations to make sure there’s at least one node in each availability zone. This approach works, but we are now adding unnecessary overhead to an already complex infrastructure.

After investigating the problem, we realized the issue stemmed from autoscaling settings within the node pool. Specifically, because we were exposing the application as an NEG, it was essential for GKE to maintain an NEG in every AZ. However, the autoscaling configuration led to an uneven distribution of nodes across AZs. As a result, no node in the northamerica-northeast1-a AZ could host the necessary NEG.

When the auto scaler is not properly configured, it can lead to scenarios where, for example, if a pod restarts and is scheduled in an unavailable AZ, the LB backend cannot reach the healthy pod, causing downtime. This was precisely what was happening: the absence of a NEG in one AZ meant that traffic routed to that zone couldn’t reach the service, leading to partial service failure.

The Solution

The solution was to revisit and adjust the autoscaling configuration to ensure an even distribution of nodes across all availability zones. After reviewing the options, I identified vital settings that had a direct impact:

1. Location Policy Options:

  • Balanced: This option spreads nodes equally across AZs, taking pod requirements and resource availability into account.
  • Any: Prioritizes the utilization of unused reservations but can result in uneven node distribution across AZs.

2. Size Limits Type:

  • Per Zone Limits: Enforces node limits per-zone basis, ensuring a more even distribution of resources across AZs.
  • Total Limits: Places a cap on the number of nodes without regard for distribution across zones.
Source: GCP console — GKE Nodepool config

In my case, switching the Size Limits Type to Per Zone Limits ensured that each AZ would maintain an appropriate number of nodes, thus guaranteeing that NEGs were created in every AZ as required. This adjustment resolved the issue and restored proper traffic routing to all zones.

Mitigating Future Issues

To prevent similar issues in the future, I recommend implementing the following best practices:

1. Regularly Review Autoscaling Settings: It’s essential to ensure that your autoscaling is configured to align with your Application’s architecture and resource requirements. Periodically revisiting autoscaling settings can help avoid configuration drift, which can lead to issues like uneven node distribution.

2. Monitor Node Distribution: Leverage monitoring tools to monitor the distribution of nodes across availability zones. GKE offers various ways to track cluster health and resource utilization, so catching any imbalances early is essential.

3. Test Failover Scenarios: Conduct regular failover tests to simulate node or AZ failures. This proactive testing will help identify vulnerabilities in your deployment strategy and ensure high availability in production.

4. Use Regional Autoscaling: Consider configuring your node pools in a regional setup rather than zonal to ensure even distribution across multiple AZs by default.

Conclusion

Deploying applications in a distributed environment like GKE can be complex, especially when leveraging features like NEGs and LB. However, understanding the underlying autoscaling configuration and taking a few preventive measures can significantly reduce the chances of encountering these issues.

In my case, a slight change to the autoscaling configuration significantly improved application availability and resilience. This experience reinforced the importance of carefully managing autoscaling settings, especially when dealing with critical components like NEGs in a multi-AZ environment.

Thank you for your time!

--

--

Nikhil Naidu
Nikhil Naidu

Written by Nikhil Naidu

0 Followers

A DevOps Engineer who loves to solve problems. Experienced with all stages of DevOps Lifecycle

No responses yet