Optimizing Performance in Snowflake: The Power of Table Clustering

Cédric Souverain

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readJan 16, 2024

Unlocking Efficiency Through Strategic Data Segmentation and Query Acceleration

Introduction

If optimizing your Snowflake environment is part of your New Year’s resolution, you might be interested in Snowflake’s table clustering system.

Snowflake, a cloud-based data warehousing platform, provides a robust environment for storing and analyzing vast amounts of data. To fully harness its capabilities, it’s crucial to optimize the performance of your data warehouse. This article explores the importance of table clustering in Snowflake and how it can be a game-changer for enhancing query performance and overall efficiency.

Understanding Table Clustering in Snowflake

Table clustering involves dividing large tables into smaller, more manageable segments based on specific column values. Snowflake supports both time-based and column-based clustering.

Time-Based Clustering

In time-based clustering, data is segmented based on intervals of time, such as days, months, or years. This technique is particularly advantageous for tables containing date or timestamp columns. The beauty of this approach lies in its ability to significantly improve query performance when filtering data based on time. For example, if you often query data for a specific month or year, time-based clustering allows Snowflake to skip irrelevant clusters, leading to faster and more efficient queries.

Column-Based Clustering

Column-based clustering involves segmenting data based on the values in a specific column. By doing so, it enhances performance by reducing the amount of data that needs to be scanned during queries. This method is suitable for scenarios where queries frequently filter or aggregate data based on a particular attribute. For instance, if your queries often involve filtering or aggregating data based on product categories, clustering the table by the product category column can significantly speed up these operations.

Advantages of Table Clustering

Improved Query Performance

Queries that filter or aggregate data based on cluster columns can skip irrelevant clusters, resulting in faster query execution.
Reduced data scanning leads to lower costs and faster results.

Efficient Data Loading

Loading data into clustered tables is often more efficient, especially when using Snowflake’s bulk loading capabilities.
Loading data into specific clusters minimizes the need for redistributing or reorganizing the entire table.

Enhanced Maintenance

Clustering simplifies data maintenance tasks. For example, purging old data becomes more efficient when it’s confined to specific clusters.

Implementing Table Clusters in Snowflake

Creating a Clustered Table

Creating a clustered table in Snowflake involves using the CLUSTER BY clause during table creation. Let’s take the example of creating a time-based clustered table by day.

CREATE TABLE sales (     
  sale_date DATE,     
  product_id INTEGER,     
  revenue DECIMAL(10, 2) 
) CLUSTER BY (sale_date);

Altering Existing Tables

Altering existing tables to introduce clustering is possible in Snowflake, but this operation may involve redistributing existing data. It’s advisable to plan this operation during periods of low activity.

ALTER TABLE sales CLUSTER BY (sale_date);

You can then verify that your table is now clustered by using the following command and looking at the cluster_by column:

SHOW TABLES LIKE 'sales';

Best Practices for Effective Table Clustering

Choose Appropriate Cluster Columns

Select columns that are frequently used in queries and filtering conditions.
For time-based clustering, use columns with a clear chronological sequence.

Consider the Query Patterns

Analyze the typical query patterns on your tables and design the clusters accordingly.
Align clustering strategies with common filtering or aggregation requirements.

Regularly Monitor and Adjust

Keep an eye on query performance and adjust clustering strategies as data queries evolve.

Clustering Performance Use Case

Before closing this article, I wanted to show you how important it is to cluster your tables with a real world use case. For this example, we will compare insert and select query profiles between clustered and non-clustered tables.

In this use case, we will use one of Snowflake’s sample data SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM. This table contains 6 001 215 records, table size is 157.6MB, Warehouse used is sized Small.

Let’s create clustered and non-clustered tables :

CREATE TABLE DEMO_DB.PUBLIC.LINEITEM_CLUSTERED (    
    L_ORDERKEY NUMBER(38,0),
    L_PARTKEY NUMBER(38,0),
    L_SUPPKEY NUMBER(38,0),
    L_LINENUMBER NUMBER(38,0),
    L_QUANTITY NUMBER(12,2),
    L_EXTENDEDPRICE NUMBER(12,2),
    L_DISCOUNT NUMBER(12,2),
    L_TAX NUMBER(12,2),
    L_RETURNFLAG VARCHAR(1),
    L_LINESTATUS VARCHAR(1),
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT VARCHAR(25),
    L_SHIPMODE VARCHAR(10),
    L_COMMENT VARCHAR(44)
) CLUSTER BY (L_SHIPDATE)
AS SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM;

-- Query duration: 6.5s

CREATE TABLE DEMO_DB.PUBLIC.LINEITEM_NOT_CLUSTERED (    
    L_ORDERKEY NUMBER(38,0),
    L_PARTKEY NUMBER(38,0),
    L_SUPPKEY NUMBER(38,0),
    L_LINENUMBER NUMBER(38,0),
    L_QUANTITY NUMBER(12,2),
    L_EXTENDEDPRICE NUMBER(12,2),
    L_DISCOUNT NUMBER(12,2),
    L_TAX NUMBER(12,2),
    L_RETURNFLAG VARCHAR(1),
    L_LINESTATUS VARCHAR(1),
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT VARCHAR(25),
    L_SHIPMODE VARCHAR(10),
    L_COMMENT VARCHAR(44)
)
AS SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM;

-- Query duration: 9.2s

Select

With the below select queries, we can already see some differences. Clustered table respond 2 times faster than the non-clustered table.

SELECT 
  YEAR(L_SHIPDATE), 
  COUNT(*), 
  SUM(L_QUANTITY), 
  MIN(L_RECEIPTDATE), 
  MAX(L_RECEIPTDATE) 
FROM DEMO_DB.PUBLIC.LINEITEM_CLUSTERED
GROUP BY 1 ORDER BY 1;
-- Query duration: 868ms

SELECT 
  YEAR(L_SHIPDATE), 
  COUNT(*), 
  SUM(L_QUANTITY), 
  MIN(L_RECEIPTDATE), 
  MAX(L_RECEIPTDATE) 
FROM DEMO_DB.PUBLIC.LINEITEM_NOT_CLUSTERED
GROUP BY 1 ORDER BY 1;
-- Query duration: 1.8s

Insert

On the insert side, even if you can see that on the delete query the non-clustered table was quicker, it took +30% process time to insert the same amount of data.

DELETE FROM DEMO_DB.PUBLIC.LINEITEM_CLUSTERED;
-- Query duration: 721ms

INSERT INTO DEMO_DB.PUBLIC.LINEITEM_CLUSTERED (
    L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT,
    L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT,
    L_SHIPMODE, L_COMMENT)
  SELECT
    L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT,
    L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT,
    L_SHIPMODE, L_COMMENT
  FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM;
-- Query duraction: 6.9s

DELETE FROM DEMO_DB.PUBLIC.LINEITEM_NOT_CLUSTERED;
-- Query duration: 340ms

INSERT INTO DEMO_DB.PUBLIC.LINEITEM_NOT_CLUSTERED (
    L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT,
    L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT,
    L_SHIPMODE, L_COMMENT)
  SELECT
    L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT,
    L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT,
    L_SHIPMODE, L_COMMENT
  FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM;
-- Query duraction: 9.8s

If we really would like to compare both performances, we could take a larger dataset and run those queries multi times. The above performance are just on single queries, so the variance can be important. In the future, I may write an article dedicated to a specific use case with deep insight on performance improvement.

Conclusion

Table clustering is a powerful optimization technique in Snowflake, offering substantial benefits in terms of query performance, data loading efficiency, and maintenance simplicity. By understanding the nuances of time-based and column-based clustering and adopting best practices, you can unlock the full potential of your Snowflake data warehouse. Incorporate table clustering into your data management strategy to ensure your Snowflake environment operates at peak efficiency for both current and future analytical demands.

Hope this article helped you optimize your Snowflake data warehouse.

Thank you for reading!