A Goldilocks Story: How Snowflake automatically optimizes micro-partition size for “just right” query performance

This is the third blog in a series covering Snowflake’s workload optimization features. Read the second blog in the series, which covered the Query Acceleration Service. You can also read our past blog on Automatic Clustering. In this blog, we’ll walk through a few of our recent optimizations for storing and accessing data on Snowflake micro-partitions. Over the past year, we have released a wide-range of performance improvements in this space. Specifically, we’ve focused on fine-tuning micro-partition size and in doing so, have improved query performance.

But first, let’s cover some background. All data in Snowflake tables is automatically divided into micro-partitions.

Unlike in static partitioning in traditional data warehousing offerings, Snowflake micro-partitions do not need to be explicitly defined or maintained by users. We store metadata for each micro-partition which includes:

  • The range of values for each of the columns in the micro-partition.
  • The number of distinct values of each column
  • Additional properties used for both optimization and efficient query processing.

This metadata allows Snowflakes to prune micro-partitions that it doesn’t need to scan (e.g. if a micro-partition contains a range of values that are entirely filtered out by a query). Additionally, if no rows within a micro-partition are modified by incoming Data Manipulation Language (DML) statements, then the micro-partition doesn’t need to be re-written by the DML. This allows for efficient DML operations.

Naturally, there are trade-offs between micro-partition sizes. Queries that scan a lot of data will typically benefit from large micro-partitions sizes. The fewer the micro-partitions, the faster scan performance, even if the total number of rows is the same. Scans are faster with fewer micro-partitions because a smaller percentage of time is spent on micro-partition metadata operations. Additionally, compression is generally more efficient when there are more rows in each micro-partition. On the flip side, queries that prune down to a few micro-partitions or have a very small range (eg. point look up) will work best on tables with a small micro-partition size.

Snowflake’s recent optimizations around micro-partition size balance these two extremes and strive for a “Goldilocks” state — where micro-partition size balances cost and performance across all database operations. There are two ways Snowflake achieves this: foreground micro-partition consolidation and background micro-partition consolidation.

Foreground micro-partition consolidation:

In DML operations such as COPY,INSERT, and MERGE, Snowflake will automatically combine new small micro-partitions into more optimally sized micro-partitions. This process maintains an optimal micro-partition size in the table. Systems without foreground micro-partition consolidation would generate a lot of micro-partitions on DML statements that only added or modified a small number of rows at a time, leading to an excessively small micro-partition size.

Foreground micro-partition consolidation example:

Background micro-partition consolidation:

In addition to foreground micro-partition consolidation, Snowflake asynchronously analyzes all micro-partitions in the table and optimizes the micro-partition size. Background micro-partition consolidation is more flexible, can be decoupled from the frequency of foreground DML statements, and is more scalable.

Background micro-partition consolidation example:

These optimizations occur automatically and without any specific user action. Snowflake automatically tunes the optimal micro-partition size for workloads. Best of all, foreground and background micro-partition consolidation come at no extra cost.

We are always working to make sure we are delivering the best possible price/performance for customers, and automatic micro-partition maintenance is no exception. Recent optimizations in the past year included expanding the coverage of foreground micro-partition consolidation as well as beginning background micro-partition consolidation.

In aggregate, here were the performance improvements that we saw on queries which scanned impacted tables:

Reducing query execution time can lead to customer warehouses running for a shorter period, which unlocks warehouse capacity and saves credits. These improvements are tracked over time in the Snowflake Performance Index (SPI). All of the improvements covered in this blog were taken into account in the October 2023 SPI publishing.

--

--