Faster and Cheaper: A Quick Recap of Snowflake’s Latest Automatic Clustering Improvements

Here at Snowflake, we’re on a continuous quest to improve efficiency and query performance. Earlier this year, we published a blog post about some recent optimizations to how we store and access Snowflake micro-partitions. We’re excited to share three new improvements to Automatic Clustering that build on top of these optimizations. As of June 2024, all of these improvements have been released to all Snowflake accounts.

Automatic Clustering can be used to optimize tables where queries repeatedly filter, aggregate, or join on the same columns. Users can specify a clustering key for their tables and Snowflake automatically keeps the data in a well-clustered state. With our latest improvements, Automatic Clustering is now over 10% more cost-effective!

As part of our optimization efforts, we focused on two ways to make Automatic Clustering more efficient: faster clustering job execution and a smarter clustering algorithm. We’ll break our latest optimizations into these two categories.

Faster Clustering Job Execution

Smarter Job Scheduling

With smarter job scheduling, Automatic Clustering now makes more efficient use of underlying infrastructure. In an earlier blog, we showed how Snowflake selects overlapping micro-partitions for clustering. These micro-partitions are organized into batches, each of which maps to a separate segment of the data that needs to be sorted. We’ve improved clustering execution through smarter organization of batches which are run within the same clustering jobs.

The result is a cheaper cost in credits for running Automatic Clustering!

Smarter Clustering Algorithm

Improved Efficiency for Workloads With a High Volume of Concurrent Queries Which Modify Data

Automatic Clustering utilizes optimistic locking. This means that when internal clustering jobs are run, they proceed optimistically and assume that none of the micro-partitions that are being sorted will be modified by any concurrent customer queries.

This approach has the advantage of not disrupting customer workloads. One potential downside is that if any of the newly sorted micro-partitions were modified by a customer query within the same time window, the internal clustering job must roll back. The clustering job is retried successfully at a later time, but the outcome is less efficient clustering.

With our latest improvements, Automatic Clustering now uses advanced heuristics to intelligently predict what micro-partitions are most likely to be modified next. For example, newer data is often much more likely to be modified than older data, so deferring clustering of newer micro-partitions for a short period of time can dramatically improve efficiency for workloads with a high volume of concurrent queries which modify data.

Many clustered tables saw significant improvements here, and in some cases, the efficiency gains were over 50%!

Improved Efficiency For Tables Populated With Snowpipe

In the past, Snowpipe and Automatic Clustering did background micro-partition consolidation independently. Background micro-partition consolidation (pictured below), is a process that maintains an optimal micro-partition size without interfering with query execution.

With our latest improvements, we’ve unified background micro-partition consolidation for both services and optimized it for performance. The result: Automatic Clustering runs more efficiently on tables populated with Snowpipe!

You can get started with Automatic Clustering and improve query performance by optimizing your storage.

--

--