Introducing the Snowflake Visual Table Clustering Explorer

Visualize how Snowflake tables are clustered

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

11 min readFeb 25, 2022

Introduction

My first exposure to the Snowflake Data Cloud came in early 2018. After working very closely with several different databases and BI products over the past two decades, I can still remember how excited I was after learning more about Snowflake’s multi-cluster, shared data architecture. “A fully managed service, simple to use but capable of powering an unlimited number of concurrent workloads”… tell me more! Every business I had ever worked with desperately needed the performance, flexibility, and near-infinite scalability to easily load, integrate, analyze, and securely share their data.

Recently I had several opportunities to help clients optimize various workloads for Snowflake. While working on these engagements my team and I saw firsthand how Snowflake is designed to handle much of the heavy lifting, so teams can focus on what matters to their business. Occasionally, though, we came across certain analytic and data engineering workloads that needed a little boost. I spent time with data engineers, developers and architects, reviewing Snowflake Query Profile diagrams and table pruning statistics to better understand what was happening in certain queries where we suspected there was room for additional improvements

Snowflake’s single elastic performance engine delivers instant and near-unlimited scale, providing reliable, fast performance, thanks to Snowflake’s multi-cluster resource isolation. Still, when running any data workload on top of Snowflake’s centralized storage layer, especially for exceedingly large tables, it is helpful to have an understanding of their physical structure.

When data is loaded, Snowflake reorganizes that data into its internal optimized, compressed, columnar format and stores this optimized data in cloud storage

All data in Snowflake tables are automatically divided into micro-partitions. Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion. This size and structure allow for extremely granular pruning of very large tables which can be comprised of millions, or even hundreds of millions, of micro-partitions.

Typically, table data is sorted/ordered along natural dimensions (e.g. date and/or geographic regions). This “clustering” is a key factor in queries because table data that is not sorted (or is only partially sorted) may impact query performance, particularly on very large tables.

OK, OK — that’s a lot to process — I’ll pause here briefly and allow the reader to come up for some air :) For anyone thinking “That’s way over my head!”, “Why should I care?”, or “I know of all of this already”, keep reading — these next few sections are just for you!!

“That’s Way Over My Head!”

Let’s make a fun comparison. When was the last time you spent too much time in the supermarket? Have you ever had trouble finding an item on your shopping list, walking through every grocery aisle, scanning each and every shelf? I have!! More often than I would like to admit. Most of us have some level of familiarity with the basic layout of our favorite store. We know how goods have been organized: produce, deli, butcher, dairy, frozen, dry goods, etc.. All of these items have been “co-located”.

Store layout by category — how quickly could you find everything colored red? Get ready to walk every aisle

We could simplify this grocery store concept (convenience store) or make it even more complex (distribution warehouse) — you get the point. The level of effort required by you, the shopper, to find what you need likely depends upon several factors: the size of the store, the number of items (and complexity) of your shopping list, store organization, signs and labels, how familiar you are with this particular location… All of these can make or break your afternoon!

Who’s ready to shop? I’m ready for the convenience store, or the grocery store if I’ve been there before. I’m not going near that distribution warehouse without a map and 10 personal shopping assistants!!

Now let’s suppose that you have a requirement to do some shopping (or inventory analysis) based upon a less common requirement. How long might it take you identify every item:

Colored red?
In cardboard packaging?
Priced less than $6?
Weighing less than 1 pound?
Containing wheat flour as the first ingredient?
Manufactured in the USA?
and so on…

This would be quite the undertaking!!! (Again, influenced by the size of the store you’ve been assigned). Why? Because items are usually NOT organized (or “co-located”) by color, packaging material, price, weight, ingredient, or country of origin. It’s just not how the average buyer shops.

Hopefully by now you’re identifying the similarities between the above scenario and Snowflake database tables (and their corresponding micro-partitions). Data in these files is organized, or “clustered”, a certain way. Similar data (based upon date, geography, category, etc.) is “co-located” based upon a defined cluster key (if present), or the sorting / ingestion order used on the original load — just like a grocery store organizes the goods on the shelf, across several aisles.

“Why should I care?”

In Snowflake, most often you won’t care! Because in general Snowflake produces well-clustered data in tables. And for simple queries that deal with smaller tables it will never matter. Snowflake compute resources, or virtual warehouses, are extremely efficient when it comes to scanning the required micro-partitions in order to find and return the correct results for a given query.

A virtual warehouse provides the required resources, such as CPU, memory, and temporary storage, to execute SQL SELECT statements that retrieve rows from tables

Really large tables, though (with a larger number of micro-partitions) — can take longer to scan, especially if the query joins or filters on columns different than the defined cluster key (if present), or the columns used for sorting in the original load (the “natural sort”, or “natural cluster key”).

In summary — queries against very large tables that join or filter on columns that are not well clustered (similar to searching a grocery store for goods using criteria that differs from the way those goods have been organized, ex: “everything colored red”) may take longer to run, as the compute resources scan all of the necessary micro-partitions. (just as you might have to walk every aisle in the store if you were shopping by color)

Luckily, Snowflake provides a function, SYSTEM$CLUSTERING_INFORMATION, that can be run against any table (usually, larger tables) to see how well that table is clustered. The function returns a JSON object containing the following:

cluster_by_keys: Columns in table used to return clustering information; can be any columns in the table.
total_partition_count: Total number of micro-partitions that comprise the table.
average_overlaps : Average number of overlapping micro-partitions for each micro-partition in the table. A high number indicates the table is not well-clustered.
average_depth: Average overlap depth of each micro-partition in the table. A high number indicates the table is not well-clustered.
partition_depth_histogram: A histogram depicting the distribution of overlap depth for each micro-partition in the table.

In the client engagement I mentioned earlier (focused on workload optimization), we ran this function every time we focused on improving query performance for large tables. Often we ended up creating new clustering keys, or altering existing keys, based upon several considerations.

“I know all of this already!”

Over time, as I ran this function against more and more tables, passing different column names (required for tables with no defined key), the better I got at interpreting the results (How many partitions? Do we have high or low numbers for “overlaps” and depth”?). And so I started thinking, what if there were an easier way to run this, perhaps using a “Visual Table Clustering Explorer”? Many of our BI Technology Partners have already developed different flavors of Snowflake Account Usage Dashboards (see the bottom of this article). Could we extend that concept of exploring Snowflake Account Data, to this use case?

I’m excited to describe below how Snowflake collaborated with several of our Elite Snowflake Ready BI Technology Partners in designing solutions to bring this Visual Table Clustering Explorer idea to life.

Qlik

In this Qlik Sense dashboard, a list of database tables and columns from the Snowflake Information Schema Tables View (including table sizes and information around any pre-defined cluster keys) is pre-loaded into memory. The user searches for and clicks on tables they would like to learn more about. Upon selection of a given table, the dashboard constructs a dynamic query (based upon the selection) which executes on-demand in Snowflake to retrieve statistics for the table using Qlik Sense Dynamic Views for query pushdown.

Sigma

Sigma is an analytics and BI platform that sits directly on top of Snowflake and extends the possibilities of the data cloud to business users. Inside of the tool SQL is generated in real-time and passed to Snowflake. Tables, views, and functions are all accessible either through the UI or custom SQL. In this example, a Sigma workbook was created to expose some of the rich clustering information that is tracked in Snowflake’s metadata. There are tables and graphs that expose the “TABLES” and “COLUMNS” views from the “INFORMATION_SCHEMA”. As well as a section that showcases Sigma’s ability to call the multi-parameter “SYSTEM$CLUSTERING_INFORMATION” function and parse its response, including a JSON object. If you would like to experience this workbook for yourself, we have embedded it on our website! https://www.sigmacomputing.com/live-snowflake-clustering-demo/

Thoughtspot

The Table Clustering app from ThoughtSpot provides a GUI front-end to check existing clustering statistics on database tables and to evaluate new clustering keys to tune for ideal query performance. The app is built using ThoughtSpot Everywhere, ThoughtSpot’s embedded analytics product which allows developers to embed visualizations, Liveboards, and search interface directly into their own websites and applications. This collection of embeddable objects and flexible API’s make it easy to bring ThoughtSpot’s power and ease-of-use to any audience for any data-driven use case

Tableau

By using Tableau’s Hyper API, in addition to the Snowflake Connector for Python, an extract was built with all the current Snowflake micro-partition statistics. Visualizations were then built on top of those statistics, and combined into a couple of dashboards for analysis.

The Python script used Snowflake’s Connector for Python to build three Pandas dataframes: one for each database on the instance, another for each table within those databases, and a third for the result of the system$clustering_information function for each table that had a clustering key defined. The script then used pantab, a Python wrapper around Tableau’s Hyper API, to write the three DataFrames to three tables in a hyper extract. The hyper extract simply looks to Tableau desktop like an extremely fast local database.

Putting it all together

These incredibly innovative solutions from Snowflake’s Elite BI Technology Partners enable the user to easily search for and explore tables that they would like to learn more about. For example, we could select WEB_SALES from our Snowflake Sample Database (TPCDS_SF100TCL schema). At almost 5 terabytes and 72 billion rows, it’s a good idea to learn more about this table if business users have plans to query it often!!!

As I started to ideate on this concept, I tried getting a little more creative with the core SYSTEM$CLUSTERING_INFORMATION function that was mentioned previously. Because this returns a JSON object, we can run it first within a common table expression (CTE), and then parse the CTE output further to extract individual values to display within a dashboard.

with sys_cluster_info as 
    (
      SELECT system$clustering_information('SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.WEB_SALES')
    )
select 
 
 PARSE_JSON($1):cluster_by_keys as cluster_by_keys,
    PARSE_JSON($1):total_partition_count as total_partition_count,
    PARSE_JSON($1):total_constant_partition_count as total_constant_partition_count,
    PARSE_JSON($1):average_overlaps as average_overlaps,
    PARSE_JSON($1):average_depth as average_depth,
    PARSE_JSON($1):partition_depth_histogram as partition_depth_histogram
from sys_cluster_info;

As shown in several of the above examples, this table has over 300,000 micro-partitions and is clustered on “sold date” and “item sk”. Average overlaps (3.5) and average depth (2.7) are very low — so we know that this table is currently very well clustered on these columns. This also means that queries that join or filter on these columns should run extremely quickly — Snowflake will achieve some very nice pruning as it scans this table (eliminating unnecessary partitions).

Let’s test out our theory just to be sure. Within Snowsight (the new UI), when previewing data for SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.WEB_SALES I can see several column and row values.

I’ll stick with the first value of 2451505 in WS_SOLD_DATE_SK (our cluster key), and select all rows from the table that match.

select * 
from SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.WEB_SALES
where WS_SOLD_DATE_SK = 2451505;

On my SMALL warehouse (run from cold, with no data cache), this query executes in just over one minute, returning 77.9 million rows. (not too bad for a 5 terabyte, 72 billion row table!!!)

Let’s take a look at our query profile to see how the table cluster key, used within a query that filters on this same column, affects overall performance… just over 300 partitions scanned from a possible 300k.

Snowflake’s Cloud Services layer, the “brain” of the solution, stores metadata (including column min/max values) for all 300,112 partitions and directs query planning so that our virtual compute warehouse knows exactly which partitions are relevant for our query — the perfect (highly scalable) shopping assistant that supports hundreds of millions of queries every day. Fantastic pruning for optimal performance!!!

The cloud services layer is a collection of services that coordinate activities across Snowflake, including metadata management and query optimization

Conclusion

This article provided a brief overview of Snowflake table structures and how thoughtful clustering strategies (for very large tables) can dramatically improve query performance. Additionally, it described how Snowflake developers can run built-in system functions to learn more about defined cluster keys, and how well-clustered a table is for any given column. Finally, it highlighted how all of this information can also be visually explored with just a couple of mouse clicks using creative solutions developed by a few Snowflake BI Technology Partners.