ML in Snowflake Part 2: k-means clustering

James Weakley
Aug 21, 2019 · 4 min read

One of the ways we like to make sense of the world is by grouping similar things together.

In design, we group colours into shades. In sales and marketing, customers are usually segmented to accommodate their differences. And in statistics, clustering is a widely used unsupervised learning technique to help organise a dataset.

As you may have noticed from my previous stories, Snowflake can be used for way more than just retrieving or aggregating data. It can be extended to a wide range of data-related tasks at a large scale, and coexist peacefully alongside the more traditional workloads.

In this story, I’ll demonstrate k-means clustering running in Snowflake, in a manner that scales well to large datasets. Here’s a quick Tableau visualisation of the final product:

Image for post
Image for post
The large dots are the cluster centroids, small dots are the data points

What is k-means?

I’ll leave the explanation to the experts, StatQuest has a great video describing how k-means clustering works:

Snowflake Implementation

First, get some weather data from the sample data we all have in our accounts:

It’s always good beach weather in Australia

It’s a huge table, so I just quickly cut a million rows from the top and sampled them, to save time.

Now, we’ll create a table to track our k-means metadata for different tables and columns:

We’ll cover what the CENTROIDS column is for later.

We want to be able to assign each row in our source table to a particular cluster, so add an extra column for that:

The whole thing is tied together using a stored procedure. The full source code is here but I’ll step through it and explain.

Calling the k-means stored procedure looks like this:

call k_means('AUSTRALIAN_WEATHER', 'LAT,LON','CLUSTER_INDEX', 5, 10)

This simply says: For each row in the AUSTRALIAN_WEATHER table, assign the LAT and LON combination to one of 5 clusters, using 10 iterations of training. Put the final cluster identifier into the CLUSTER_INDEX column.

The stored procedure does the following:

  1. Removes any existing cluster metadata for ‘LAT,LON’ in ‘AUSTRALIAN_WEATHER’ from the KMEANS_CLUSTERS table.
  2. Grabs a random sample of 5 rows from the table, and uses the LAT and LON values from these as the centre point (or ‘centroid’) of 5 initial clusters.
  3. Inserts these initial cluster centroids into the KMEANS_CLUSTERS table. After this point, the KMEANS_CLUSTERS table looks something like this:
Image for post
Image for post
These are just randomly chosen points, now the k-means algorithm takes over and moves them around.

4. With clusters defined, the stored proc now uses a User Defined Function (UDF) to assign each row to the closest cluster:

For some reason this preview is being truncated, see here for the full code: https://gist.github.com/jamesweakley/edd9edfa3d9cffe3134e8e0f83670844

The stored proc generates SQL which uses this UDF as part of an UPDATE statement:

5. With the cluster indexes updated on all the records, the stored proc calculates new centroids by taking the average x and y values for each cluster:

Steps 4 and 5 repeat in a cycle until all of the iterations are used up. Each time, the cluster centroids will usually shift slightly as records move from one cluster to another.

“How did you make that cool visualisation?”

Glad you asked.

Snowflake has a feature called time travel, which basically means you can query a table from any point in time within the defined time travel window.

To leverage this, I first gave Tableau a custom query which would retrieve the exact timestamp of every query I’ve run in the past hour, where I was updating the AUSTRALIAN_WEATHER table:

select END_TIME::timestamp_ntz as END_TIME
from table(information_schema.query_history_by_user(USER_NAME => CURRENT_USER(), END_TIME_RANGE_START => TIMESTAMPADD(hour,-1,CURRENT_TIMESTAMP())))
where QUERY_TEXT like 'update "AUSTRALIAN_WEATHER"%'
order by start_time asc

Here’s the result:

Image for post
Image for post

Then when querying your actual table, you can create a parameter that takes values from this list:

Image for post
Image for post

and use the “at” qualifier to select the point in time from the parameter.

Image for post
Image for post

Add this to your Tableau sheet, and you have a time travelling device!

Image for post
Image for post

Snowflake

Articles for engineers, by engineers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store