Balancing Snowflake’s Cloud Services Across Availability Zones

Author: Ja Wattanawong on behalf of the ECS team.

Introduction

Customers want Snowflake to be available at all times and this means designing our Cloud Services Layer, which coordinates various services and schedules warehouses to run queries, to be resilient against a number of failure modes. One such rare but catastrophic case is when a cloud service provider’s datacenter suffers an unexpected outage.

All three public cloud service providers that Snowflake runs on provide Availability Zones (See AWS, GCP, Azure documentation), which are isolated datacenters in a single region that we can choose to provision resources from. By keeping Cloud Services instances balanced across these availability zones, we can ensure minimal customer impact in the event of zonal failures, as requests can transparently be redirected to an instance in another zone.

Figure 1. Availability Zone Outage And Failover For Balanced Cluster
Figure 1. Availability zone outage and failover

The Zone Balancing Problem

Intuitively, a set of instances is balanced when we have a roughly equal number of instances in each zone. We can quantify how balanced a set of instances is by calculating the difference between the number of instances in the most loaded zone and the least loaded zone, which we will call AZ skew.

There are several reasons that make minimizing AZ skew more than just striping all of our virtual machine instances across availability zones during provisioning. Within a single regional deployment, Snowflake’s Cloud Services implements a multi-cluster architecture where each cluster serves different groups of customers. Each cluster can scale independently of each other to respond to current load as described in our autoscaling blog post. In order to minimize the impact of an AZ outage on each cluster and on the deployment as a whole, we must zone balance at both the cluster level and the deployment (global) level.

Figure 2. Various Balanced Scenarios. Colors Represent Different Clusters.
Figure 2. Various balanced scenarios where colors represent different clusters

Not only are these competing goals at times, but the free pool which we draw instances from to scale clusters may not have instances of that type in the zone we want. Since cloud provisioning and preparing a new instance can take on the order of minutes, Cloud Services maintains a free pool to trade instances to and from clusters, allowing us to scale each cluster up or down within seconds. An active instance may fail, and we may not have a free instance in that zone to replace it, or suddenly we may need to scale up a cluster to handle increased load, and we only have free instances in zones that are heavily loaded by that cluster. Even if we start off well-balanced, the cluster and deployment can become more imbalanced over time.

Furthermore, while we can rebalance a cluster by adding an instance in one zone and removing from another, this incurs a cost overhead since the instance being removed needs to finish executing queries. Therefore, we would also like to keep the number of rebalancing actions to an acceptable level.

Our objective is then to minimize both the AZ skew for each cluster and for the entire deployment, constrained by an acceptable number of rebalancing changes. We would also like to prioritize minimizing cluster skew over minimizing global skew, as a total outage for any single cluster would be catastrophic.

Zone Balanced Cluster Orchestration

The first piece of the puzzle is fairly straightforward. We modified our cluster manager background service to scale clusters in a zone-balanced manner. This means when we scale a cluster out, we should pick the least loaded zone globally out of the set of least loaded zones for that cluster. When scale a cluster in, we should pick the most loaded zone globally out of the set of most loaded zones for that cluster.

Figure 3. Scaling Clusters In A Zone-balanced Way
Figure 3. Scaling clusters in a zone-balanced way

However there may be situations in which we are unable to maintain either global zone balancing or even cluster-level zone balancing as shown in the diagram below.

Figure 4. Cases where we cannot maintain zone balance

Thus, there is an additional active zone rebalancing background task, which examines the current state of the deployment and executes a series of moves to balance it. Namely, it will prioritize moves in the following order:

  1. Moves that improve both cluster level and global zone balancing
  2. Moves that improve cluster-level balancing
  3. Moves that improve global zone balancing
Figure 4. Picking Ideal Rebalancing Moves For Each Cluster
Figure 5. Picking ideal rebalancing moves for each cluster

To generate the moves, we can compute the balanced threshold for a cluster, which is simply the number of instances in that cluster divided by the number of available zones. Then, any move from a zone with more instances than that level to a zone with fewer instances than that level cannot make cluster level balancing worse. After that we can evaluate what criteria each move improves our deployment on and select the best one.

Each move is executed with minimal customer impact, as the old instance is allowed to finish currently running jobs while the new one accepts incoming queries. Additionally, our cluster manager is aware of how many instances are needed to reach a target skew, and will provision free instances of the correct category and zone.

After enabling this change on some small deployments, we noticed that it was difficult to maintain full global zone balance (skew ≤ 1) without making too many moves within a certain time period. For example, a cluster could scale up into a specific zone to maintain cluster level zone balancing, but if that zone was already heavily loaded globally, we may need to rebalance another cluster to maintain global zone balancing. As mentioned above, churning through too many instances incurs overhead for us as instances need to finish executing running jobs.

As a result we added a global AZ skew leniency threshold below which the zone balancer will only consider cluster-level rebalancing moves. This parameter essentially trades skew leniency for instance churn rate, and we plan to find an optimal value that can give us acceptable skew while smoothing over temporary global zone imbalances that are incurred as part of normal cluster scaling operations.

Findings

Cluster-level Zone Balancing

We have been running cluster-level zone balancing in production for several years now, and it produces excellent results. Here is a graph from a large Snowflake deployment showing instantaneous skew for all clusters in the deployment in a day:

Figure 4. Cluster AZ Skew In A Large Snowflake Deployment
Figure 6. Cluster-level AZ skew in a large Snowflake deployment

As you can see, the vast majority of clusters stay balanced, and the few that become unbalanced are rebalanced within the hour.

Global Zone Balancing

Global Zone Balancing was recently implemented this summer in order to support moving towards smaller clusters as part of workload isolation improvements. We initially deployed the feature with a leniency of 1, meaning that we should target an instance count difference of no more than one from the most loaded to the least loaded zone. After the feature was deployed in a large production deployment, the active zone rebalancer was able to correct the large global zone imbalance and converge within the hour.

Figure 5. After Enabling Global Zone Balancing In A Large Snowflake Deployment
Figure 7. After enabling global zone balancing in a large Snowflake deployment

We eventually increased the leniency on global zone balancing to lower the number of instances we were churning through to rebalance the whole deployment.

Conclusions

In this blog post we have described how Snowflake keeps its Cloud Services layer zone-balanced to minimize the impact of a cloud provider’s zone outage. We also touched upon how zone balancing operates in the context of our cluster manager, which was introduced in previous blog posts (Elastic Cloud Services, Autoscaling). We showed that our method was effective at correcting availability zone imbalance in a reasonable amount of time. We are continuously advancing our service platform and hope to share some of those improvements with you in the future.

Acknowledgements

Thank you to Ioannis Papapanagiotou, Rares Radut, Samir Rehmtulla, Johan Harjono, and Hana Andersen for suggesting much-needed improvements.

--

--