Snowflake Clustering Demystified

Clustering is a very powerful performance tool for Snowflake, but failing to understand clustering or failure to choose an appropriate clustering key can lead to excessive costs and missed opportunities for better performance. Let’s explore a few aspects of clustering!

Goal of Clustering

Let’s start with the goal of why we’re even considering clustering. The goal of clustering is to maximize performance while keeping costs manageable. This goal brings to mind two main questions.

The first question this might bring to mind is the performance of what exactly? There are many workloads and often many important workloads on any given table. Selecting either the most important workloads or the most important underperforming workloads is critical, and something that is particularly hard to do programmatically — it nearly always involves human judgment. In larger organizations, this may mean cooperation by different teams.

The second question is what cost trade-offs are acceptable. The cheapest clustering may appear to be no clustering. Clustering involves compute resources in some way, and therefore involves cost. The cost benefit from a good clustering key speeding up queries may be higher than the cost of clustering itself. This is why considering the balance is critical.

Micro-Partitions

Snowflake automatically writes data to micro-partitions. At this point, each micro-partition is an object in cloud object storage. Each micro-partition is targeted to hold about 16 MB of compressed data. Within the micro-partition, data is stored in a columnar format, making it easier to fetch only portions of a micro-partition, depending on which columns a query is using.

The partitioning of data is done by Snowflake, and there are few syntactic or human actions that influence how the data is partitioned. This may be in contrast to partitioning in other data systems, which require more human decision making in partitioning decisions.

A wealth of additional information is available in the Snowflake documentation on Micro-Partitions & Data Clustering.

Partition Pruning

Achieving better performance with clustering relies on achieving better partition pruning. Partition pruning happens on a query level. The Snowflake optimizer can determine that no row on a micro-partition can possibly qualify for a query, and therefore Snowflake can completely skip reading a micro-partition. This is called partition pruning. Partition Pruning can lead to outstanding query performance, even on very large data sets, and consistent query performance as data sets grow.

The way that the Snowflake optimizer decides which micro-partitions may qualify for a query is through metadata. One of the pieces of metadata that Snowflake stores and leverages for partition pruning are the minimum and maximum values for every eligible column on each micro-partition. Using these minimum and maximum values, the Snowflake optimizer can determine if the value it is searching for falls within the minimum/maximum range for the micro-partition. If the value it is searching for is not within the minimum/maximum range for the micro-partition, then the micro-partition can be pruned. Pruning means not reading a micro-partition in a table.

Visualizing this can be powerful, so let’s use some images to think through the implications of this. We will use an artificially small data set for illustrative purposes.

This is how our values may look when the table is not clustered:

Here we are representing the values of one column, size. Each stripe or band represents a micro-partition. Each triangle represents a value for the column. Every value in this column has a triangle. Real micro-partitions store all of the columns for a set of rows, but for simplicity’s sake, we’re focusing on just one column.

In an unclustered state, this table has ranges for the micro-partitions that overlap. If we were to query on a specific value for this column, that query might look something like this:

SELECT *
FROM table
WHERE size=16;

The orange stripe here represents the query. In order to find rows with a value of 16, we have to scan all five micro-partitions, because 16 is between the minimum and maximum for this column on every micro-partition. Five micro-partitions is a lot of data to work with in order to only return two rows.

Let us instead consider clustering this table on this size column. When we do that, our table would look something more like this:

In this representation, similar values are grouped together on the same micro-partition, and we have a lot less overlap between micro-partitions.

The same query as before, looking for a value of 16, looks something like this:

SELECT *
FROM table
WHERE size=16;

In this case, we have a table with the same data, but we are able to scan only one micro-partition, greatly reducing the I/O we have to do to satisfy queries with conditions on this column. The two rows with the value we were looking for were on the same micro-partition.

The work that clustering does narrows the range between the minimum and maximum values on each micro-partition. This also has the effect of minimizing the overlaps between the micro-partitions.

This is obviously a much-simplified representation, but hopefully it is useful for understanding clustering in Snowflake. In a real scenario, there are often thousands or millions of rows on a single micro-partition. A table can consist of hundreds, thousands, millions, or more micro-partitions. Reducing what we have to read from object storage can have a huge impact on query performance.

Performance Impact

The performance impact of partition pruning can be quite drastic. Reducing the number of files and size of data that must be processed to fulfill a query can reduce query time from hours to minutes, especially for selective queries in large data sets. Skipping irrelevant portions of the data set makes many queries faster from start to end. Reducing the time spent on a query also reduces the cost of the query, when using the same size of warehouse.

Natural Clustering

There are different ways of achieving clustering in Snowflake. Because Snowflake uses the minimum/maximum range on the micro-partitions for partition pruning, it is not technically required to define a clustering key to get the benefits.

Loading data in an ordered manner can also achieve narrow minimum/maximum ranges, and therefore the performance impacts of partition pruning. This is referred to as natural clustering. It often happens by a date/time dimension if data is loading over time without any additional effort. However, there are some complications.

The top problem with relying on natural clustering is that, without using auto-clustering, clustering is not preserved over time. Since micro-partitions are immutable, with each DML, new micro-partitions are written. Updates and deletes build and register new micro-partitions. As these new micro-partitions are written, without a defined clustering key, they are written using methodology that values both correctness and speed, which does not preserve the natural clustering.

Manual Methods of Clustering

It is common to experiment with clustering and try different clustering keys. This often involves creating a copy of a table with a Create Table AS (CTAS) statement including an ORDER BY clause. Using an order by clause can be overkill, as it sorts the result set in the exact order of the columns or functions on columns chosen, when our goal is simply to group them. Ordering data in this manner still narrows the minimum/maximum ranges on the micro-partitions and enables partition pruning.

This can absolutely be done as the primary method of achieving clustering. The OVERWRITE keyword on the INSERT statement can be used to re-write a table in order. The syntax for that looks something like this:

INSERT OVERWRITE INTO tabname
SELECT *
FROM tabname
ORDER BY cluster_key;

This has the advantage of running more quickly than auto-clustering, and may actually be cheaper for larger tables in a poor clustering state. Other inserts against the table will be queued while the overwrite operation is occurring. This means they will wait until the INSERT OVERWRITE operation is complete before they will complete.

When using this methodology, it often makes sense to use the largest warehouse you’re willing to use for tables of a significant size. Because of the work being done in this operation, it is one area where larger warehouses are likely to actually be cheaper than smaller warehouses. This is particularly true if using a smaller warehouse would cause spilling to remote storage.

Most often, this methodology is used for testing and experimentation, but again, it can be used for long-term clustering as well. The main problem is again DML against the table. Updates, inserts, and deletes build and register new micro-partitions, which do not respect the order specified in the original CTAS or the last INSERT OVERWRITE.

Using Auto-Clustering

Auto-clustering is a serverless feature with a number of advantages.

Auto-clustering will run over time to cluster data as other actions on the data occur and may change the clustering of data. If using data engineering or a manual method to initially cluster data, and then updates and deletes are done to data, it makes sense to use auto clustering to cluster data as the clustering state becomes less ideal.

One of the advantages of auto-clustering is that it runs in the background and does not block other activity on the table. This is in contrast to simply manually clustering a table with INSERT OVERWRITE where other activity must wait. There is a potential disadvantage here in that clustering may perform some work and then be unable to commit the work it has performed. If update and delete activity on a table is particularly high, auto clustering may need to be limited to running only at times when activity is lower so that it can complete. This can be accomplished by manually enabling and disabling auto clustering.

Use the table function AUTOMATIC_CLUSTERING_HISTORY to understand how often automatic clustering is running and how much it is costing.

Summary

Clustering in Snowflake is a powerful tool to improve performance. Understanding some of the details of how it works and how to achieve clustering can help you leverage clustering at a reasonable cost.

--

--

Ember Crooks
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

As a Principal Architect in the Field CTO office at Snowflake, I tackle performance problems our customers find, while educating myself and others.