A Data-Driven Approach to Choosing a Clustering Key in Snowflake
The goal of this article is to walk through some of the decision points and key metrics for choosing an appropriate clustering key for a Snowflake table.
To level-set with what clustering is and is not in Snowflake, please read Snowflake Clustering Demystified first. That article contains definitions and methodologies that are very useful when selecting a clustering key.
Goals of Clustering
The primary goal of defining a clustering key for a Snowflake table is to improve performance. A secondary goal is to balance that performance improvement with the cost of achieving it. There is a cost to auto clustering and to static clustering. Choosing a clustering key with an appropriate cardinality helps to balance the goals of good performance and reasonable cost.
Clustering Key Cardinality Targets
Snowflake will let a clustering key contain a large number of columns/elements. The most I’ve seen in my time as a Field CTO specializing in performance is 12! This was far too many by every measure I have. Our official documentation recommends no more than 3–4 elements in a clustering key, but that is usually too many.
Balancing the cardinality of the clustering key with the table size in micro-partitions is critical to balancing performance and cost. Cardinality in this case refers to the number of unique values for a clustering key. While we often simulate clustering by ordering the data, ordering is actually more work than is needed. Ordering data places the data in perfect order based on the key specified. This would include, in the case of using CREATE TABLE … AS (CTAS) statement, ordering the rows exactly as they are placed into micro-partitions. Snowflake does not use, or even necessarily respect, any order within a micro-partition. This means that the real goal of clustering is to group data. For the best balance of performance and cost, all rows on a micro-partition would have the same value of the clustering key. When a micro-partition has the same value for the minimum and maximum of the clustering key, it is considered a constant micro-partition, and auto-clustering is less likely to do further work on it. The number of constant micro-partitions is reported in the output of SYSTEM$CLUSTERING_INFORMATION. It is nearly impossible to get 100% of partitions to a constant state with real world data, so we often simplify it to the idea that the cardinality of the clustering key should be the same as or lower than the number of micro-partitions in the table.
What happens when a high-cardinality clustering key is specified? The primary negative impact is cost, particularly if using auto-clustering. If a high-cardinality key is specified, auto-clustering tends to do more work to try to achieve that perfect ordering when only clustering is needed. Since auto-clustrering’s work costs credits, this work can be expensive, and may not have a proportional corresponding benefit.
If a clustering key is too low in cardinality, the main impact is that we don’t see the performance impact that we would expect from clustering. Imagine an extreme scenario where every row in our table has exactly the same value in a column. If we were to use that column alone for our clustering key, it would have no opportunity to provide a performance impact, because we could not prune any micro-partitions based on that column. While that’s an extreme scenario, it is not unusual to have columns that have only a dozen or fewer unique values across a huge data set. While these columns might be very useful in a multi-part clustering key, they alone may not provide the performance impact we are looking for.
One side note — the order of columns in the clustering key matters in several ways, but that is beyond the scope of this article. When in doubt, we tend to love a column or function on a column that relates to date/time on the leading edge (first element) of a clustering key.
Calculating Clustering Key Cardinality Targets Based on Data
It is important to take a data-based approach to the choices we make in Snowflake. The vast array of different customers and different data sets makes it nearly impossible to come out with a one-size-fits-all rule. Just within the last week, I’ve worked with customers on optimizing access to a table with 160 micro-partitions and with optimizing access to a table with 25 million micro-partitions. These table sizes require different approaches, but we can define a target for our clustering key cardinality that applies to both.
In general, we’re balancing the desire to be able to eliminate or prune as many micro-partitions as possible when querying the table, with the desire to reduce the amount of work that auto-clustering has to do. Our goal here is generally to end up with 1–10 micro-partitions with the same actual value for the clustering key, and to minimize the number of micro-partitions with minimum/maximum ranges that overlap. Those goals work for most table structures, though there are edge cases where another goal might be appropriate. Very large tables may be one edge case where the goal needs to be adjusted, depending on the data and target workload.
Micro-Partition Count
One of the key inputs here is the number of micro-partitions in a table. This is not a metric that is relevant to many things. It can be queried from the SYSTEM$CLUSTERING_INFORMATION system function. Here is syntax that works well for returning the number of micro-partitions in a table:
SELECT PARSE_JSON(SYSTEM$CLUSTERING_INFORMATION(
‘<table_name>’,
‘(null)’)
):total_partition_count::integer as PARTITION_COUNT
;Filling in a table name from one of the SNOWFLAKE_SAMPLE_DATA tables, that might look something like this:
This operation may take a bit, depending on the size of your table. In my example here, using a medium warehouse, it took about 10 seconds. This is because the SYSTEM$CLUSTERING_INFORMATION function is performing work analyzing the table.
If the number returned here is exactly two million, you might also be hitting a default limitation of SYSTEM$CLUSTERING_INFORMATION. You can alternatively see the number of micro-partitions in the table by looking at the query profile of a query that accesses the table, clicking on the table scan box for the table and looking at the value for “Partitions total”.
Keep in mind that the number of micro-partitions can change with any activity on the table or even when a background process takes action, so small fluctuations in the number of micro-partitions should not be surprising. This is particularly true if small micro-partitions are created due to small DML operations.
We’d like our clustering key to have a cardinality less than the number of micro-partitions in the table — approximately between 7,271 and 72,718 unique values given the data above.
Clustering Key Cardinality
When looking at a potential clustering key, the easiest way to understand the cardinality is often through a query grouping by the clustering key. We’re most interested in the number of groups that query returns.
While we can estimate the cardinality of a clustering key by taking the cardinality of each element (column or function on column) of the clustering key and multiplying them, data correlation can mean that the actual cardinality is lower.
Here’s an example of what that might look like on a table in the Snowflake sample data:
The number that we’re looking for is not the results of this query, but the number of rows returned. In the example above, that number is 2.4 billion. Depending on the data size, the queries to do this may take a while to return, and may require a decently sized warehouse.
This step is often done a number of times to explore the cardinality of different candidate clustering keys. For example, I might have determined that the cardinality was too high based on the data above, but I still want this column to be my clustering key. I then might try different functions on this column while trying to target the desired clustering key cardinality identified earlier (7,271 to 72,718):
This output shows a cardinality that is much more appropriate for the number of micro-partitions in this specific table, while still respecting the properties of clustering keys that lead to partition pruning.
Process of Selecting a Clustering Key
The process of selecting a clustering key involves human judgment and human decisions. If it were possible to select the perfect clustering key for you, Snowflake would probably already be doing it. In addition to the complexities of data models and data patterns, we have the involvement of different workloads with differing priorities. Clustering is a physical arrangement of data, and without additional materializations of that data, the data may be clustered in only one way. Sometimes the workload that is most important for performance is not the one that is most expensive or slow or the highest volume of queries.
The process of choosing a clustering key looks something like this for each table:
- Identify workload or queries to target
- Establish baseline performance for the target workload
- Identify filtering predicates in the workload
- Select potential clustering keys with reasonable cardinality
- Create table copies clustered on each potential key
- Test query workload against clustered test tables and record results
- Compare test results with baseline and select a clustering key
- Implement clustering
- Periodically review performance and clustering actions
Let’s dig into those steps in more detail
Identify Workload
The starting point is to identify a query or set of queries where we care most about the performance. The answer cannot be “everything”, without more advanced techniques like multiple materializations of the data. The choice of a workload is usually based on business factors. The choice might be to make the queries behind the CEO’s dashboard the priority. It might be to make the queries behind a specific application the priority. The choice might be to optimize for merge statements to reduce the latency for data landing in the table. The choice might be to speed up specific statements that users are complaining about. There are a lot of factors here.
Establish Baseline Performance for Workload
In the real world, we might even use a performance testing tool to help us, but the general idea here is that we run all the queries in our workload, interleaved, in a dedicated warehouse, to establish a performance baseline. We run the queries multiple times, preferably over the course of several days at different times, to ensure that transitory factors aren’t impacting our results. The primary measure we’re interested in here is the total execution time for each query. Depending on our performance goals, we might look at this several different ways for each query — average duration, maximum duration, median duration, the 90th percentile (p90), or whatever statistical measure makes the most sense to you. We look at total duration because clustering might change the balance of what operations make up our query execution time.
Identify Filtering Predicates
Once we have our workload and our baseline performance, we need to look at the predicates in the queries in the workload. Most of the time, the predicates we are looking for are very explicit in the query, but depending on the workload, we may also need to look at join predicates or predicates in view definitions.
Predicates are expressions that evaluate to either TRUE or FALSE. Filtering predicates are often specified in the WHERE clause of the query, but can also appear in HAVING or QUALIFY clauses. While predicates in CASE statements that are not in a WHERE or JOIN clause can also change what the output looks like, they do not usually filter the output. Filtering predicates can also be specified in views that are referenced in the queries.
Filtering can also occur based on join predicates. For example, imagine that we have a table with only one customer ID in it, and we then join that table to a second table that has a variety of values for the customer ID. If we join the two tables on customer ID, we then significantly limit the number of rows returned from the second table. Partition pruning can happen based on join predicates. This also happens when a predicate is applied to one table that then limits the number of rows we return from the second table.
Select Potential Clustering Keys with Reasonable Cardinality
With the filtering predicates in hand, we then want to look at what kind of impact the predicates have for the queries involved. We’ll often get the largest impact for a query by using predicates that filter out the most rows. This involves understanding the distribution of data within the table and how that might change as data is added in the future.
For example, let’s say we have two predicates that are specified in nearly every query in our target workload. In this example, both predicates are looking for equality to a single value. One predicate is on a date column. We have about uniformly distributed data and 365 days worth of data in the table. The other predicate is on a customer id, and we have approximately 1,000,000 different customer ids with about 100 rows for each customer id. In this scenario, specifying a single customer id will filter out more data than specifying a single date. Rarely are the details in the real world so cut and dried, but the concepts still apply.
The columns or functions in the predicates themselves may not get us to the cardinality target we’re looking for in a clustering key. We may need to apply a function to the columns involved to balance out the cardinality. This step in the process often requires a lot of querying of the table to understand not only which predicates filter out the most data, but also using GROUP BY to understand what the cardinality for each candidate column looks like, and what it looks like with functions applied to reduce cardinality.
Test Query Workload Against Clustered Tables
Once we have several options identified, it is best to actually test what we have. This is best done with a full copy of the data, if the expense of having a second copy of the data for a short time is acceptable. One good way of testing this for smaller tables is to use a CTAS (CREATE TABLE … AS …) statement with an ORDER BY on the candidate clustering key. Usually this kind of action is best done using a larger warehouse. If we can avoid the sort spiling to remote storage, that can make the CTAS vastly faster and therefore cheaper.
For larger tables, the full CTAS may not be an option due to cost or timing concerns. If a subset of the data is used, it is critical that the distribution of that data over both the candidate clustering key and any additional predicates in the queries in the workloads closely resembles the full table. We want to be careful about relying too heavily on the results we see from a subset of the data. Maybe a representative subset of the data works for comparison reasons, but a final test on the full data is an excellent idea before moving on to implementation.
Compare test results with baseline and select a clustering key
As you’re testing, be sure to follow best practices of testing. Much like what is done to establish a baseline in the first place, use multiple runs of a query over multiple days to judge the impact. Pay attention not just to average or whatever metric you’ve chosen to use, but also to the standard deviation. Understanding variability in performance can be important. Use the numbers collected to make a decision on which candidate key you would like to use for the table.
Implement Clustering
Implementing or changing a clustering key is different depending on the size of the table. For particularly large tables, reach out to your Sales Engineer for advice to reduce cost. Auto-clustering is enabled by defining a clustering key on the table and allowing the auto-clustering to run in the background. Clustering for a table can be suspended or resumed, so be sure that you haven’t previously suspended clustering for an existing table.
Allowing automatic clustering to do all the work can be less optimal and more expensive than giving automatic clustering a good place to start. Starting in a good place, depending on the table size, can be done with an INSERT OVERWRITE statement that includes an ORDER BY clause that aligns with your clustering key. Usually this is an operation that benefits from using the largest warehouse size you are willing to use. Smaller warehouses can lead to remote spilling, which can lead to very dramatically slower execution times. Often auto-clustering may still perform some initial work, but it is likely to be less.
There is a table size at which the use of INSERT OVERWRITE or CTAS with an ORDER BY clause will not perform well. At the time of this article’s creation the line for this is probably somewhere between 100,000 and a million micro-partitions. Reach out to your Sales Engineer for help on strategies to efficiently implement clustering at these larger sizes.
Periodically review performance and clustering actions
Clustering is not a “set it and forget it” thing. While Snowflake will indeed keep a table clustered over time, data patterns, query patterns, or the definition of the most important workload sometimes changes over time. It is important to periodically revisit the clustering key to understand if variables are changing that require something different.
Summary
Clustering is an incredibly powerful tool in Snowflake to improve performance. It allows Snowflake to simply not read data that does not qualify for a query. There is a cost to clustering, and balancing that cost with performance is the best of both worlds. Hopefully the methodical and data-driven approach presented here is helpful in a comprehensive Snowflake performance journey.

