Tips for Clustering in Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

9 min readDec 5, 2022

One of the most powerful tools for Snowflake users to gain performance and efficiency is clustering. Data that is well clustered can be queried faster and more affordably due to partition pruning. Partition pruning allows Snowflake to skip data that does not pertain to the query based on statistics of the micro partitions.

Some assumptions made by new users are incorrect and provide no performance benefit while wasting credits. This quick write up will hopefully help you understand what clustering is on Snowflake, Why it is important, How to pick cluster keys, and what not to do.

What is Clustering on Snowflake?

Clustering keys on Snowflake are metadata on a table which is used by the Automatic Clustering service. The clustering keys are configured during create/alter of the table. Clustering in Snowflake relates to how rows are co-located with other similar rows in a micro partition. Snowflake does not shard micro partitions to only store one set of cluster key values, but rather a micro partition will often have a set of contiguous values.

When data is inserted into an existing table, the micro partitions are split based on the volume of data and in the order it was received regardless of the clustering configuration on the table. If a new table is created via a CTAS and has cluster keys, it will automatically be clustered.

When clustering is defined, the Automatic Clustering service will, in the background, use that information to rewrite micro partitions to group rows with similar values for the clustering columns in the same micro partition.

To see how this works, you can insert some data and I’ll show you what is happening. First, create the table by running the following commands which will generate enough random data to create multiple micro partitions in a table with clustering on.

CREATE DATABASE CLUSTERING_TIPS;

CREATE SCHEMA CLUSTERING_TIPS;

CREATE OR REPLACE TABLE CLUSTERING_DEMO cluster by (TENANT_ID)(
ID NUMBER(10,0),
TENANT_ID NUMBER(4,0),
KEY VARCHAR(4),
VALUE VARCHAR(255)
);

INSERT OVERWRITE INTO CLUSTERING_DEMO
SELECT
seq4() AS ID,
uniform(1, 1000, random()) AS TENANT_ID,
randstr(4, random()) AS KEY,
randstr(255, random()) AS VALUE
FROM table(generator(rowcount => 250000));

We now have some data in the table. To prove data is inserted as it comes in (in this case completely random), we should now see that if we query all data for a single tenant it still does a full table scan.

Run this query and open the Query Profile Statistics.

SELECT * FROM CLUSTERING_DEMO WHERE TENANT_ID = 1;

As you can see, there are 3 micro partitions in the table and all 3 were scanned to return the result. If you don’t, make sure you do this quickly after the table is created or Automatic Clustering service will impact the results.

Now you can set a timer for several hours from now to come back and run the same query. Auto Clustering service only runs every several hours. After Automatic Clustering runs on this table you will see this query will only scan 1 partition of the 3.

If you’re impatient like me, you may want to change how the table is created to sort the data based on that tenant identifier which will then land it in a way to allow for more efficient queries.

This is being done for demonstration purposes, this is something to be done with care for large datasets.

CREATE OR REPLACE TABLE CLUSTERING_DEMO_SORTED cluster by (TENANT_ID)(
ID NUMBER(10,0),
TENANT_ID NUMBER(4,0),
KEY VARCHAR(4),
VALUE VARCHAR(255)
);

INSERT OVERWRITE INTO CLUSTERING_DEMO_SORTED
SELECT
seq4() AS ID,
uniform(1, 1000, random()) AS TENANT_ID,
randstr(4, random()) AS KEY,
randstr(255, random()) AS VALUE
FROM table(generator(rowcount => 250000))
ORDER BY TENANT_ID;

Now run the same query and look at the query profile statistics to see how it now eliminated ⅔ of the data.

SELECT * FROM CLUSTERING_DEMO_SORTED WHERE TENANT_ID = 1;

The reason this is now able to eliminate most of the data is because the data for the tenant only existed in a small number of micro partitions (1 in this case). Snowflake knows this because it calculated and stored metadata for every micro partition and that data is queried during compile time to eliminate as much data as possible. The metadata used in this query would have been the MIN and MAX values of the TENANT_ID on each micro partition.

When micro partitions have random data across a large span of values in a field there is no way to eliminate those during query execution which means the query will need to scan more micro partitions, taking more compute and requiring larger warehouses to complete in a timely manner.

Why is Clustering Important?

Many Snowflake workloads have some set of queries which run frequently on their clusters. These often have common shapes and often filter and group by the same fields. For example, data that is often queried by a range of dates, can be clustered by the date and queries will read a smaller set of micro partitions. In this case, by clustering the data well, queries can be pruned which will allow them to run on smaller warehouses, be faster to execute, and save credit consumption!

Yes, we can spin up thousands of cores and query a massive dataset that’s sitting in blob store fairly quickly which is impressive. That’s just throwing money at the query. I don’t mind for research and occasional ad-hoc queries because it’s not worth the effort and time to optimize, but it shouldn’t be done for queries that are expected, frequent, AND can be optimized.

How to Pick Cluster Keys?

Cluster keys should be fields that are often filtered on. In the case above, I chose tenant_id because I see a lot of multi-tenant tables and often those could really gain an advantage with clustering. Time Series data is also often clustered by a record DATE. You’ll need to know the shape of the data and the queries pretty well to determine the best cluster keys.

If I’m looking into how to cluster a table and don’t know enough about the shape or common queries, I will often look at recent query history filtered down to those queries landing on the table. If there is a very common filter, that’s a great place to start. If there is always one filter, and occasionally another, that pair is what you want to use.

At this point everyone asks… What do you do if you have 2 different fields that are filtered on separately but each separately, often. You have to pick 1 or the other. The keys are hierarchical so consider the relationship of the data for similar values of the first field to determine if this is productive. This is also related to how we store in blob store. To have another query pattern that can also prune well, it will have to be persisted differently. If you need to solve this for multiple query patterns (this is pretty rare), you’ll want to look at materialization so another copy of the data can be clustered in a way to support all query patterns. This should ONLY be done when you have a huge number of queries that have these predicates. If it’s only a small number, you’ll end up spending more time engineering and compute costs maintaining the materialization than you would have throwing bigger warehouses at the queries.

When picking cluster keys the cardinality of the key(s) values should be considered. In cases where the cardinality is higher than the micro partition count there will be inefficiencies in clustering. To avoid this inefficiency, it is common to truncate dates, substring varchars, divide ints, and in some cases persist a new value and filtering on that to have the good performance and reasonable costs of clustering.

What NOT to do?

There are a few things I see frequently in clustering choices that don’t have any performance gains and are a waste of consumption. I’ll go through the ones I see frequently that you should avoid.

DO NOT: Cluster by a timestamp when most predicates are by DAY.

The reasoning is that the Automatic Clustering service will spend a lot of time organizing records that happened the same minute/second/hour together. The choice of the TIMESTAMP field means work will be done that may never be needed in predicates.

In this case you should cluster by the DATE_TRUNC by DAY as long as most filters only filter by day. Similarly to DATE_TRUNC to HOUR if most filters go to hours. This will save unnecessary reclustering costs. All existing queries will prune as the MIN and MAX on the columns will work on the existing predicates without any modifications. Et al: There is no need to change the predicate to DATE_TRUNC.

DO NOT: Cluster on a VARCHAR field that has low cardinality prefix

Snowflake only clusters on the first 6 characters. Due to this, it often does not behave how customers expect when clustering on varchar fields. I’ve seen this several times when dates were strings starting with YYYY, or a uuid generator had a common prefix. The reason this is a bad idea is that there isn’t enough cardinality in those first few characters to allow for good pruning (micro partition elimination).

One solution to this challenge is to convert those fields to dates which will be more efficient to compare and where clustering will work directly. If that is too much work, you can leave the dates as strings, but then cluster on the value as a date Ex: Cluster on to_date(<stringdate>,’YYYY-MM-DD’). This works well because the MIN/MAX statistics stored for the micro partitions will work as long as they start with something like YYYY-MM-DD which is alphabetically sorted the same as a date field. In this case the predicates are all comparing strings, not date fields.

For cases where the beginning of the string has low cardinality, but characters later have high cardinality, cluster by a substring containing the higher cardinality. In this case the data with similar substrings will be put together with Automatic Clustering and the MIN & MAX values stored for each micro partition will be useful for pruning.

DO NOT: Try to compensate for starting character cardinality with a HASH function

Often Snowflake users will notice the front edge cardinality not allowing for good clustering and will jump to a use hash function. We only store statistics on stored fields, so even if this is done AND the predicates are modified to have a match on the HASH we still can’t eliminate any micro partitions. The only way this will work is if this HASH value is also persisted so we can keep statistics on it as well as that new field being added where needed to the predicates for elimination.

A better way to reduce cardinality when the starting characters are similar is the cluster by a substring of the field which was discussed in the previous case.

DO NOT: Try to reduce cardinality with MODULO functions in cluster definition without persisting the MODULO result and adding it in predicates

This can help in some occasions where there is too much cardinality in the field(s) chosen to cluster on, but ONLY works if that field is persisted because we need statistics to do pruning. Queries also have to be modified to add the additional predicate on that new persisted field.

In cases where the field value is an integer/numeric it is more useful to divide the field by a constant and cluster by that value as it will reduce the cardinality of the values while still allowing MIN & MAX values to be useful for pruning.

DO NOT: Change clustering in production on a table w/o a manual rewrite

When you change the clustering configuration, the background service will take a lot of time and credits while trying not to impact other queries. It will likely rewrite many micro partitions multiple times to get to the best outcome.

You can do an overwrite in production to do a one time recluster.

INSERT OVERWRITE INTO YOUR_TABLE 
SELECT * FROM YOUR_TABLE ORDER BY <YOUR NEW CLUSTER CHOICE>;

You can do that on a very large warehouse and get the entire table overwritten immediately. Any table up to a tb can easily be rewritten this way. The Automatic Clustering service will then not have to rewrite all the micro partitions in the table in the background to meet the new clustering configuration.

This can be an efficient way to recluster a table, but care should be taken to size the warehouse properly for the dataset size to prevent spilling. For very large tables (many tbs), it may be possible to rewrite in pieces to run on warehouses without spilling.

Auto Clustering service does work on any table if these alternatives are too complex but will use more credits to complete.

Conclusion

Snowflake clustering is an extremely important tool to have in your toolbox. Run some tests on smaller datasets/queries to understand how it works (hopefully my example gave you some ideas). If I need queries to be faster AND more efficient this is the first thing I look at when I see slow TableScans in Query Profile to see if I can leverage it to my advantage.

Don’t forget to go back and run the query on the table that was set to Automatically Cluster on the tenant. You’ll see after that data is rewritten that the query will only need to scan 1 micro partition.