Turbo-charge Snowflake Query Performance by Avoiding Spilling to Storage

This article will demonstrate how a simple Snowflake virtual warehouse configuration change can turbo-charge your query performance, with queries running over a hundred times faster and 40% cheaper.

Plot spoiler — it involves spilling to storage.

What is Spilling to Storage?

The diagram below illustrates the internals of a Snowflake X-Small Virtual Warehouse, which includes eight CPUs, memory, fast SSD local storage and remote cloud storage.

Snowflake Virtual Warehouse Internals

Spilling occurs when Snowflake needs to sort data, including executing an ORDER BY, but more often because of a GROUP BY operation.

Initially, Snowflake attempts to complete the sort operation in memory, but once that’s exhausted, data is temporarily spilled to local storage. For huge sorts, even the local SSD can be used up, in which case data will be spilled to remote cloud storage.

The key point is main memory access is much faster than SSD, which is considerably faster again than cloud storage. To put this in context, if a single CPU cycle took one second, access to main memory takes around 6 minutes, access to SSD around three days and access to cloud storage up to a year.

In summary, spilling to storage can have a significant impact on query performance. Worse still, a slow running query keeps your virtual warehouse running longer which in turn leads to higher Snowflake costs.

Benchmarking Spilling to Local Storage

To demonstrate the impact of spilling, I executed the same query against a range of virtual warehouse sizes from X-Small to X4-Large and recorded the results. The query below fetches and sorts over 28 billion rows, sorting around a terabyte of data.

select ss_sales_price
from snowflake_sample_data.tpcds_sf100tcl.store_sales
order by ss_sales_price;

The table below shows the improvement in execution time as we increased the warehouse size. It shows a speed improvement of over 140 times from 7.5 hours to 3 minutes while the cost remained the same, at around the $20 level.

The graph below illustrates the reduction in execution time as the warehouse size is increased. It again demonstrates run time is reduced by half each time.

This is consistent with previous tests which show that, provided the query is large enough, increasing the warehouse size produces the same results twice as fast for the same cost.

Benchmarking Spilling to Remote Storage

As described above, spilling to remote storage is even slower but on a large virtual warehouse, it needs a monster query. The SQL statement below was executed on varying warehouse sizes and the results recorded.

SELECT ss_sales_price
From snowflake_sample_data.tpcds_af100tcl.store_sales
order by SS_SOLD_DATE_SK, SS_SOLD_TIME_SK, SS_ITEM_SK, SS_CUSTOMER_SK,
SS_CDEMO_SK, SS_HDEMO_SK, SS_ADDR_SK, SS_STORE_SK, SS_PROMO_SK,
SS_TICKET_NUMBER, SS_QUANTITY;

The table below shows the results of running the above query, which sorts ten terabytes of data on varying warehouse sizes. It shows the query running over a hundred times faster on an X4LARGE while the query cost fell from over $150 to $95, a 40% reduction.

One interesting point worth noting, is that when the execution difference is greater than 100%, this indicates the query cost is cheaper at each step as the charge rate is doubled each time, but the query runs more than twice as fast. Clearly spilling to remote storage adds to both run time and cost as the query cost levels out after an X2LARGE as remote spilling is eliminated.

Best Practices to Reduce Snowflake Spilling

Effectively, there are two approaches to reduce spilling to storage on Snowflake:

  1. Reduce the volume of data sorted.
  2. Increase the warehouse size.

My article on tuning Snowflake query performance includes several techniques to reduce the volume of data fetched, which will in turn, reduce the number of rows sorted and help eliminate spilling to storage.

One of the easiest ways of improving query performance and avoiding spilling is to include a LIMIT clause in the query. To demonstrate this, the following query was executed without and then with a LIMIT clause, which returned the top 10 entries or all entries.

select * 
from store_returns
order by sr_reason_sk
limit 10;

The screenshot below shows that including a LIMIT improved query performance by over 60 times from over two hours to under two minute. That's a remarkable performance improvement from such a simple change.

It’s worth noting that the LIMIT clause was pushed down into the query, halved the partitions scanned, and eliminated both local and remote spilling. Finally, both queries were executed on an X-Small warehouse, demonstrating the potential cost saving of this simple change

How do you identify Spilling to Storage?

The query below will quickly identify each warehouse size and extent of spilling to storage. As this is based upon query-level statistics, it is possible to drill down to identify specific SQL.

select warehouse_name
, warehouse_size
, round(avg(total_elapsed_time)/1000) as elapsed_seconds
, round(avg(execution_time)/1000) as execution_seconds
, count(iff(bytes_spilled_to_local_storage/1024/1024/1024 > 1,1,null)) as count_spilled_queries
, round(sum(bytes_spilled_to_local_storage/1024/1024/1024)) as local_gb
, round(sum(bytes_spilled_to_remote_storage/1024/1024/1024)) as remote_gb
from Snowflake.account_usage.query_history
where warehouse_size is not null
group by 1, 2
having local_gb > 1
order by six desc;

Conclusion

Clearly on Snowflake, spilling to local storage can have a significant impact upon query performance and any query that spills more than a gigabyte of storage is a candidate to run on a larger warehouse.

Typically, if a query includes significant spilling to storage it will often execute twice as fast for the same cost on the next size virtual warehouse. That means you're getting faster performance at zero additional cost.

Spilling to remote storage impacts both performance and often increases query cost as Snowflake charges per second for each virtual warehouse, and spilling to remote storage extends the query elapsed time. Eliminating remote storage spilling often reduces query. While this may seem a paradox, the results shown here are consistent with my personal experience.

Be aware however, that spilling to local storage is only important when it exceeds the gigabyte level, although remote spilling should always be avoided.

Disclaimer: The opinions expressed on this site are entirely my own, and will not necessarily reflect those of my employer.

This article was first published as Snowflake Spilling to Storage in Plain English.

--

--