One of the ways we like to make sense of the world is by grouping similar things together.
In design, we group colours into shades. In sales and marketing, customers are usually segmented to accommodate their differences. And in statistics, clustering is a widely used unsupervised learning technique to help organise a dataset.
As you may have noticed from my previous stories, Snowflake can be used for way more than just retrieving or aggregating data. It can be extended to a wide range of data-related tasks at a large scale, and coexist peacefully alongside the more traditional workloads.
In this story, I’ll demonstrate k-means clustering running in Snowflake, in a manner that scales well to large datasets. Here’s a quick Tableau visualisation of the final product:
What is k-means?
I’ll leave the explanation to the experts, StatQuest has a great video describing how k-means clustering works:
First, get some weather data from the sample data we all have in our accounts:
It’s a huge table, so I just quickly cut a million rows from the top and sampled them, to save time.
Now, we’ll create a table to track our k-means metadata for different tables and columns:
We’ll cover what the CENTROIDS column is for later.
We want to be able to assign each row in our source table to a particular cluster, so add an extra column for that:
The whole thing is tied together using a stored procedure. The full source code is here but I’ll step through it and explain.
Calling the k-means stored procedure looks like this:
call k_means('AUSTRALIAN_WEATHER', 'LAT,LON','CLUSTER_INDEX', 5, 10)
This simply says: For each row in the AUSTRALIAN_WEATHER table, assign the LAT and LON combination to one of 5 clusters, using 10 iterations of training. Put the final cluster identifier into the CLUSTER_INDEX column.
The stored procedure does the following:
- Removes any existing cluster metadata for ‘LAT,LON’ in ‘AUSTRALIAN_WEATHER’ from the KMEANS_CLUSTERS table.
- Grabs a random sample of 5 rows from the table, and uses the LAT and LON values from these as the centre point (or ‘centroid’) of 5 initial clusters.
- Inserts these initial cluster centroids into the KMEANS_CLUSTERS table. After this point, the KMEANS_CLUSTERS table looks something like this:
4. With clusters defined, the stored proc now uses a User Defined Function (UDF) to assign each row to the closest cluster:
The stored proc generates SQL which uses this UDF as part of an UPDATE statement:
5. With the cluster indexes updated on all the records, the stored proc calculates new centroids by taking the average x and y values for each cluster:
Steps 4 and 5 repeat in a cycle until all of the iterations are used up. Each time, the cluster centroids will usually shift slightly as records move from one cluster to another.
“How did you make that cool visualisation?”
Glad you asked.
Snowflake has a feature called time travel, which basically means you can query a table from any point in time within the defined time travel window.
To leverage this, I first gave Tableau a custom query which would retrieve the exact timestamp of every query I’ve run in the past hour, where I was updating the AUSTRALIAN_WEATHER table:
select END_TIME::timestamp_ntz as END_TIME
from table(information_schema.query_history_by_user(USER_NAME => CURRENT_USER(), END_TIME_RANGE_START => TIMESTAMPADD(hour,-1,CURRENT_TIMESTAMP())))
where QUERY_TEXT like 'update "AUSTRALIAN_WEATHER"%'
order by start_time asc
Here’s the result:
Then when querying your actual table, you can create a parameter that takes values from this list:
and use the “at” qualifier to select the point in time from the parameter.
Add this to your Tableau sheet, and you have a time travelling device!