Elastic Cloud Services — Auto Scaling and Throttling

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

13 min readJul 6, 2021

Authors: Johan Harjono, Dan Karp, Kunal Nabar, Ioannis Papapanagiotou, Rares Radut, and Arthur Shi on behalf of the ECS team.

Introduction

In our previous blog post, we introduced our Snowflake’s Elastic Cloud Service layer; the brain of delivering Snowflake as a service. We covered how we manage large and small clusters, automatic upgrades, fast rollback strategies, customer account partitioning, and our state diagram of the instance lifecycle. In this post, we will discuss how we improve elasticity with auto-scaling and dynamic throttling of our Service layer.

The traffic routed to each Global Services (GS) cluster varies widely by time of day, often in unpredictable patterns. Earlier this year (version 5.12), we introduced two features to improve our elasticity in responding to these traffic fluctuations. Dynamic Throttling and Autoscaling ensure our rapid response to fluctuating customer workloads. For the scope of this work, we will focus on CPU load-based automatic scalability. In a future post, we will cover other resource constraints like memory, network, etc. We used the following success criteria:

Responsiveness: The total amount of time a query is running sub-optimally due to needing more computing resources. Ideally, we want this to be as close to 0ms as possible.
Cost-efficiency: Minimize the number of instances required while maintaining a smooth experience for our customers.
Cluster volatility: The number of times the algorithm resizes the clusters. We prefer stable steady-state over highly volatile clusters. There is a time cost, delays in provisioning/deprovisioning compute resources in a cloud environment, and an impact on the cache hit ratio. For example, an algorithm that does a scale-in, and then 1 minute later performs a scale-out is highly undesirable (if provisioning a new VM and updating the traffic routing table would have taken 3 minutes, then in this case it would be better to just keep the cluster size steady)
Throughput: The total number of successful queries that client(s) can submit to the cluster. Ideally, this is as high as possible for the utilized instance type.
Latency: average duration for a query to complete. Ideally, this is as low as possible. Note: there is usually an inverse relationship between latency and throughput. Given a static cluster size, generally, one query can finish fast, but as we pile up more and more queries (e.g. throughput increase) the p99 of the query latency may start to increase. The expected value-add of a good autoscaling algorithm is that as we keep adding queries and stress the cluster, latency stays flat (or at least, increases slower than without autoscaling).

Before throttling and autoscaling, we had imposed fixed per-instance, per-account, and per-user limits on the number of concurrent queries entering a single instance. This was set to limit unexpected or overwhelming workloads and protect the cluster. These limits were imposed by gateways, which exist at several layers of the request lifecycle, ensuring that certain resources of nodes do not suffer from excessive or unfair usage. However, these limits were unreliable: some workloads that have a low query cardinality can fall below static limits but can still cause load issues depending on their compilation patterns. Other high concurrency workloads may be artificially throttled by the static limits but could very well be run right away on available machine resources. To solve this, we have moved to an approach that does not rely on static limits.

When more work enters an instance than it can handle, the CPU load (as measured by the Linux /proc/loadavg) on the box can go very high and several undesirable knock-on effects may occur. If a query runs afoul of one of the gateway limits, an HTTP error code is returned. The Snowflake client transparently performs backoff and retry for some amount of time. The hope is that during the retry duration, our autoscaling algorithm will have provisioned enough resources and the query will succeed!

Dynamic Throttling

Our solution involves dynamic throttling in foreground instances and a centralized autoscaling system that runs in a background instance. When CPU load becomes high on a node, the instance-level gateway limits, which control the concurrency limit locally on the instance, are immediately lowered and then adjusted until the CPU load returns to and remains at an acceptable level. If this instance-level throttling results in rejections (note that rejections are in most cases not directly customer-visible due to the client’s backoff/retry logic), that signal is transmitted to the autoscaler and triggers a cluster scale-out. In the event that we are rejecting work at the instance level without hitting load issues, we recognize that we can take more work on the GS and increase the gateways until either we stop rejecting or we fully load the GS and thus trigger a scale-out.

To complement this instance-level throttling, we run a new centralized autoscaler BG (The Background instances are described in our previous blog post). When the aggregate CPU load is high across the active instances in a cluster or if a quorum of the cluster’s instances is throttling and rejecting, we increase the number of instances in the cluster. Likewise, when no rejections are occurring and the cluster’s aggregate load is low, we reduce the cluster size. To ensure fairness, separate throttling calculations are also done per cluster at the account level and are used to increase or decrease account and user gateways for the instance, which alters the concurrency allowed for individual users and accounts.

Many instances run enough jobs that the law of large numbers begins to come into effect and the assumption that each job takes up a similar load becomes more accurate. Typically when instances are running into resource constraints this assumption holds well enough for us to use as an initial estimate of what degree we should throttle to. We use two types of throttling coefficients. The first is the instance coefficient, which will adjust the number of queries that each individual instance can take. The second is the account coefficient, which will adjust the total number of concurrent queries an account can be running in its cluster. We dynamically update these coefficients every 30 seconds.

The figure below illustrates a synthetic workload running TPC-DS which analyzes the performance of OLAP databases. You can find more information on our sample datasets page. In this test environment, we disabled Autoscaling. Only Dynamic Throttling was active and we configured the cluster to have two GS instances. The generated workload exceeded the CPU capacity of two GS instances. Dynamic throttling reacted to the excessive load and reduced the gateway limits to maintain the healthy state of the available instances. In the top left, we report the CPU load of the two instances. We observe that the load exceeded the available CPU capacity. This is highly correlated with the query count in the bottom left. The top right shows the throttle coefficient of the throttled instance being lowered to enforce a new gateway size, effectively applying a small multiplier to our gateways. Finally, after the throttler determines the new gateway limit, fewer queries are let through. In the bottom left figure, the query throughput drops to a sustainable amount, which also causes a drop in the associated load the instances face.

Figure 1. High load deflection through Dynamic Throttling

Furthermore, after the initial estimation, the coefficient in the top right figure can be seen to adjust incrementally. It initially drops lower to temper the workload further and maintains a balance keeping the CPU load at approximately 1.0, our target.

The bottom right figure displays our coefficient used for the account & user-specific limits, to maintain fairness we apply the limit evenly, causing all accounts to reduce evenly, and therefore the account with the highest job count will be the first to be limited. It should be noted the limitation of dynamic throttling is transient and upon rejecting these queries we signal to the autoscaling framework to add instances to overloaded clusters. Our intent is not to slow down customers, but to temper transient load spikes and prevent them from causing downstream issues in our system until we can accommodate their workload.

Now that we have the base algorithm, we enter the optimization stage. There are a number of parameters we can tweak to attempt to optimize for the features we desire. These features that we seek effectively form the pillars of our throttling technique. The three pillars of throttling:

Lower the load as soon as possible, but not overshoot;
Minimize oscillation;
Reduce throttle as soon as possible.

We reject queries as cheaply as possible and retry once more instances are available. We are effectively deferring load to a later point in time by which the autoscaler will have increased our computing capacity. Because we throttle to keep our CPU load at a healthy level, the throttler must surface metrics about its rejection behavior as a signal to the autoscaler — the CPU load signal ideally will not kick in if we throttle effectively. These rejections are persisted in FoundationDB, our operational metadata database, and aggregated globally by our Autoscaling framework to make decisions.

Once capacity has been added, we incrementally revert the throttling coefficients to bring clusters back to a steady-state where they are neither throttling nor overloaded, satisfying our three throttling design pillars.

Autoscaling

For the first iteration of the autoscaling we used some basic ideas from the Kubernetes Horizontal Pod Autoscaling algorithm. However, there were a few challenges with the algorithm. First, we wanted to scale both on rejection rate and load. Second, in our testing, we saw oscillations in cluster sizes. We wanted to remedy this by designing a new algorithm that was “stickier” and would not cause clusters to oscillate frequently in size. In a query cluster, there is a monetary cost to removing an instance from the active topology. As we discussed earlier in the state transition diagram, an instance that is removed needs to quiesce and drain all of its jobs before returning to the free pool. If we oscillate too much and fill/empty clusters too often, we end up using more instances that aren’t taking customer load and likely causing free pool churn. We also spend a higher percentage of time warming caches relative to execution time on warm instances.

It seems that the Horizontal Pod Autoscaling algorithm doesn’t encounter this problem in most Kubernetes clusters. However, when applied to Snowflake query clusters, as the cluster size increases, the window for our “optimized state” becomes tighter and tighter and we are more likely to scale on minor deviations in the normalized load average (load per core). Below is a graph of the scaling decisions that would be made by Kubernetes Horizontal Pod autoscaling for load average. The red section is the “cluster size stability” section. As the load stays within that window, we will not change the number of instances in the cluster. As we add more instances, the window becomes tighter.

Figure 2. Scaling decisions for the Kubernetes Horizontal Pod autoscaling for load average

We showcase some of the above issues with the Kubernetes autoscaling in the picture above. The way the desired number of instances in the cluster are being calculated is using the following formula:

This makes a lot of sense at low values for currentReplicas but it becomes unstable at higher values. For example: if currentReplicas = 30 and desiredMetricValue for load is 1.0. If currentMetricValue drops to 0.98 then we’ll scale in by 1 instance. Since we’re looking at load in dynamic workloads, we didn’t feel this type of load was satisfactory. In addition, the challenge in our case is that removing an instance is actually quite costly since there may be significant quiesce time. So instead, we decided to establish a window that’s used as a filter before using the Kubernetes horizontal pod autoscaling algorithm. So basically we evaluate currentMetricValue and see if it falls between some set acceptable window (let’s say 0.9 and 1.0) and if it does we don’t do anything. If it ever exits that window, then we use the pod autoscaling algorithm to scale. Finally, to scale out we only look at the 1-minute and check if it’s high. To scale in all 3 (1-minute, 5-minute, 15-minute) have to be low.

Our first version of auto scaling has been very effective. There are also some interesting challenges that we have faced: Recurring bursty workloads followed by long periods of idleness or random bursts (e.g. bursts are distributed randomly / non-uniformly over a time window). Queries also may play a big role. For example, some queries can be short-lived, some are long-running and these ones can occupy a GS node for hours. When the algorithm scales in and “releases” a host, the host still counts against the total cost of the system until a) all jobs complete or b) the system does a subsequent scale-out and the host is returned to active service.

Autoscaling and Dynamic Throttling Results

Example 1 — Snowflake internal analysis cluster:

Autoscaling and Dynamic Throttling are synergistic and therefore were rolled out together in version 5.12. When enabling the features on our internal data analysis account we scaled in the cluster. The figure below shows the cluster instance count in the bottom chart. We reduced the cluster to two instances as the scaling framework recognizes we did not require the compute resources of all four instances.

In figure 4 below it can be seen that two of the lines, each representing an instance, drop to serving 0 queries per second as they are removed from topology. The queries per second and concurrent requests of the remaining two instances increase as we now have to maintain the same overall cluster throughout with two instances.

Figure 4. Two instances take on cluster throughput

With dynamic throttling, the throttling coefficients were expanded automatically, providing additional gateway space for queries we could safely handle. This removed any manual work to scale instances or gateway limits making our platform more resilient. Both the instance and account gateways were expanded which allowed us to facilitate a higher throughput on each individual instance.

Now the system recognizes when clusters are overprovisioned and can reduce the number of instances to optimize resource usage. This workflow, when performed manually has become infeasible as Snowflake deployments continue to grow exponentially. Furthermore, gateway limits which were previously adjusted by on-call engineers to reduce load are now self-managed on every single instance.

Example 2: Customer Query Concurrency increased

Many of our customer’s clusters in production had a similar fate, running into previously static gateway limits. After the rollout, they were able to increase their total concurrency and their rejection rate was reduced.

The figure above shows a customer’s workload, the top chart shows our account coefficient, which in this case will be greater than 1 when we expand the respective customer’s gateways. After the rollout around 3 pm the rejection rate, in the bottom chart, was reduced and concurrent requests increased which can be seen in the center chart. You’ll notice we did not increase the coefficient too high, which may have eliminated the rejections entirely.

Example 3: Noisy Neighbor

Oftentimes we also witness “noisy neighbor” problems. A customer in a multi-tenant cluster begins a massive workload that starves other accounts in the cluster of resources. After the deployment of autoscaling and dynamic throttling we are able to scale the cluster very quickly. The figure below is an example of a cluster we scaled up dramatically to handle the increased workload. The cluster size chart represents the time the decision was made, not the actual VM being moved into the cluster after the cache warmed & moved into the topology. Prior to the noisy neighbor our traffic was low enough for the autoscaling system to decide to reduce our instance count to two or three. The successful queries per cluster showcases the effective throughput after the autoscaling system increased the number of instances. Effectively, we were able to immediately to quadruple the throughput of the cluster. As a side effect of this project, we launched a number of initiatives to speed up the time required for an instance to enter topology.

Figure 7. Scaling to mitigate Noisy Neighbor

Example 4: Deployment load

Finally one of the major goals of the Dynamic Throttling project was to reduce cases of high CPU load that instances encounter. After rolling out Dynamic Throttling, the 5-minute CPU load average charts lowered noticeably, in particular almost no nodes had a 5-minute CPU load higher than our target threshold of 1.

Figure 8. Deployment Wide Load reduction

In the figure above, the rollout occurs as the coefficient spreads, around 6:30 pm. The 5-minute CPU load average chart, showing the load average for every individual load in the deployment, no longer has cases where the load exceeds our target threshold. This effect was noticed across all of our deployments, with a few exceptions of particular instances managing to exceed this target. The end result of this has been a decrease in isolations (i.e. there is a significant reduction in the false positive rate that causes removing instances from topology due to a suspected bad instance health) based on a high load average.

Conclusion

Our Elastic Cloud Service infrastructure now scales to match the elasticity and volume that we require with our ever-growing consumption. In this blog post, we described some of the dynamic throttling and horizontal autoscaling work in response to CPU load changes. We showcased experimentally how we were able to increase the utilization of our instances, increased the customer’s query concurrency, and reduced issues caused by noisy neighbors. In a follow-up post, we plan to discuss in more detail how we are able to improve the ECS performance with memory throttling and a combination of automatic horizontal and vertical scaling.

Come join us!

Please reach out to us if you have any questions or feedback. We are also looking for talented individuals and technical leaders to help us grow our infrastructure.