Using MATCH_RECOGNIZE in Snowflake
A clever way to collapse daily snapshot data into relative start/end date ranges.
I’ve been using SQL for over 15 years on various platforms (Oracle, SQL Server, Teradata, Postgres, and Snowflake) and today I came across MATCH_RECOGNIZE
. I was blown away by what this thing can do.
This optional subclause to the FROM
clause is supported by many database platforms. I’m using Snowflake these days, so here’s a link to their documentation for MATCH_RECOGNIZE.
MATCH_RECOGNIZE
is much more powerful than what is shown here, but I was so excited by what I was able to accomplish in a relatively small amount of SQL that I had to share.
Ok, so here’s what I was working on…
The backstory
My team wanted to start detecting schema changes in our upstream app databases. I know that there are other solutions for schema change detection, but I wanted to see what I could build on my own, cheaply.
So I built a simple python script that pulls data from the information_schema.tables
and information_schema.columns
views in each app database and uploads the results to s3. In addition, to just pulling the raw data, I also added an MD5 hash column based on all columns in the view, and built a primary key column (a period-separated concatenation of the table_catalog, table_schema, table_name and column_name columns) for each row.
-- generated MD5 record hash for information_schema.tables
MD5(
COALESCE(table_catalog::text,'$null$')
|| COALESCE(table_schema::text,'$null$')
|| COALESCE(table_name::text,'$null$')
|| COALESCE(table_type::text,'$null$')
|| COALESCE(self_referencing_column_name::text,'$null$')
|| COALESCE(reference_generation::text,'$null$')
|| COALESCE(user_defined_type_catalog::text,'$null$')
|| COALESCE(user_defined_type_schema::text,'$null$')
|| COALESCE(user_defined_type_name::text,'$null$')
|| COALESCE(is_insertable_into::text,'$null$')
|| COALESCE(is_typed::text,'$null$')
|| COALESCE(commit_action::text,'$null$')
) AS md5_record_hash,
-- generated primary key column for information_schema.tables
COALESCE(table_catalog::text,'$null$')
|| '.' || COALESCE(table_schema::text,'$null$')
|| '.' || COALESCE(table_name::text,'$null$') AS pk,
The s3 files are then ingested from s3 via Snowpipe.
Let’s focus on the data from the information_schema.columns
view. It has a grain of one row per database/schema/table/column in the source db. Pulling in the daily snapshots, we end up with one row per day, per app db of that basic grain. That data contains a lot of noise as it contains rows whether or not definition of the column changed. It would be nice to eliminate the noise (days where there are no changes), and instead only see when the definition of a column actually changes. This is where the MD5 hash column, primary key, and MATCH_RECOGNIZE
come to the rescue.
Here’s a simplified example of the ingested data.
-- Create temporary table to hold test data
CREATE OR REPLACE TEMPORARY TABLE column_data (
pk VARCHAR,
md5_ecord_hash VARCHAR,
run_date DATE
);
-- Insert test data
INSERT INTO column_data VALUES ('app_db.public.users.id','1d89ed6a7992c77ca6b354dc740c4863','2024-01-01');
INSERT INTO column_data VALUES ('app_db.public.users.id','1d89ed6a7992c77ca6b354dc740c4863','2024-01-02');
INSERT INTO column_data VALUES ('app_db.public.users.id','1d89ed6a7992c77ca6b354dc740c4863','2024-01-03');
INSERT INTO column_data VALUES ('app_db.public.users.id','64423ab2626a19efd4c3b58ff3a81ac9','2024-01-04');
INSERT INTO column_data VALUES ('app_db.public.users.id','64423ab2626a19efd4c3b58ff3a81ac9','2024-01-05');
INSERT INTO column_data VALUES ('app_db.public.users.id','df8e14b4da70837b1e89095790988380','2024-01-06');
INSERT INTO column_data VALUES ('app_db.public.users.username','c05a71f3b6efd66f773e63ba6ce9fc3e','2024-01-05');
INSERT INTO column_data VALUES ('app_db.public.users.username','c05a71f3b6efd66f773e63ba6ce9fc3e','2024-01-05'); -- Intentional duplicate
INSERT INTO column_data VALUES ('app_db.public.users.username','c05a71f3b6efd66f773e63ba6ce9fc3e','2024-01-06');
INSERT INTO column_data VALUES ('app_db.public.users.username','271f05dc35cc5b5a33a8da89e42c02dd','2024-01-07');
INSERT INTO column_data VALUES ('app_db.public.users.username','271f05dc35cc5b5a33a8da89e42c02dd','2024-01-15'); -- Intentional gap
INSERT INTO column_data VALUES ('app_db.public.users.username','99210cdd38519cf6ea90b7eb25e27c05','2024-01-16');
-- Show inserted data
SELECT *
FROM column_data
ORDER BY pk, run_date;
Notice that for PK valueapp_db.public.users.id
, I have daily snapshots from 2024–01–01
thru 2024–01–06
. Some of those daily snapshots have the same hash value while others are different. The change in md5_record_hash
value indicates that something about the source database’s column has changed.
MATCH_RECOGNIZE
can collapse these common records and return the start/end dates that the column had a specific md5_record_hash
value. This eliminates the noise when the daily snapshot stayed the same.
The Solution
-- Collapse time frames to only show when the record changed.
SELECT *
FROM column_data
MATCH_RECOGNIZE (
PARTITION BY pk
ORDER BY run_date
MEASURES
FIRST(md5_record_hash) AS md5_record_hash,
FIRST(run_date) AS start_date,
LAST(run_date) AS end_date
ONE ROW PER MATCH
PATTERN (initial_row_or_change{1} row_no_change* )
DEFINE
initial_row_or_change AS
LAG(md5_record_hash) IS NULL
OR md5_record_hash <> NVL(LAG(md5_record_hash), md5_record_hash),
row_no_change AS
md5_record_hash = NVL(LAG(md5_record_hash), md5_record_hash)
)
ORDER BY
pk,
start_date;
This yields exactly the output that I wanted!
I’ll explain the pieces of MATCH_RECOGNIZE
to make it clearer what’s going on in the SQL.
Breaking down the components
First you have the PARTITION BY
and ORDER BY
portions. These define the scope within which you want to perform the matching and the order in which the records are placed before searching for the pattern.
The MEASURES
portion defines what aggregates you want returned when the pattern is matched. In my case I wanted to return FIRST
md5_record_hash, theFIRST
run_date, and the LAST
run date within the pattern. I aliased the dates to the desired output of START_DATE
and END_DATE
respectively.
Here’s where it starts to get more complex. In my case I only wanted to return one row with each matched pattern, so I chose the ONE ROW PER MATCH
option. This collapses all of the “no-change” records into a single record. You can also choose ALL ROWS PER MATCH
which will not collapse the records.
Let’s skip over the PATTERN
for now and move on to the DEFINE
.
DEFINE
allows you to assign a named classifier to records that meet certain criterion. Recall that I wanted to detect changes in the md5_record_hash
column, so I used the LAG() window function to check if this value changed from that of the previous record. In addition, I know from experience that the first record in a window/partition will always have a NULL
value for LAG()
and I wanted to account for that.
So for the classifier that I defined initial_row_or_change
, records that evaluate to TRUE
for my custom expression of
-- First record in a window/partition
LAG(md5_record_hash) IS NULL
-- The hash value on this record is different than the last record
OR md5_record_hash <> NVL(LAG(md5_record_hash), md5_record_hash)
will be given assigned this classifier.
The next line defines another classifier:
row_no_change AS
md5_record_hash = NVL(LAG(md5_record_hash), md5_record_hash)
Similar to the previous classifier, any records where the expression evaluates to TRUE
will be assigned the classifier of row_no_change
.
Now that we’ve defined the classifiers, let’s return to the PATTERN
. This describes the pattern of classifiers that you’re trying to match within the partition.
In my case, I want to find exactly one initial_row_or_change
followed by zero or many row_no_change
records.
PATTERN (initial_row_or_change{1} row_no_change* )
The symbols to the right of the defined classifier mimic regular expression syntax for the number of desired occurrences. The classifiers are space separated and must be contained in parentheses.
An alternate solution
Here’s another version of the query that defines three classifiers instead of two. It uses an OR
condition implemented as a |
character in the PATTERN
section. This returns the same result as the original query, but separates the initial row and the change row into two separate classifiers.
SELECT *
FROM column_data
MATCH_RECOGNIZE (
PARTITION BY pk
ORDER BY run_date
MEASURES
FIRST(md5_record_hash) AS md5_record_hash,
FIRST(run_date) AS start_date,
LAST(run_date) AS end_date
ONE ROW PER MATCH
PATTERN ((initial_row_or_change{1} | change_row{1}) no_change_row* )
DEFINE
initial_row AS
LAG(md5_record_hash) IS NULL,
change_row AS
md5_record_hash <> NVL(LAG(md5_record_hash), md5_record_hash),
no_change_row AS
md5_record_hash = NVL(LAG(md5_record_hash), md5_record_hash)
)
ORDER BY
pk,
start_date;
Which one do you like better?
Conclusions
I’ve done this type of collapsing daily change records into start/end dates before, and it has taken multiple analytic functions and complicated CTEs. The MATCH_RECOGNIZE
implementation, though complex at initial glance, really simplifies things.
Have you solved this same problem in a different way? How does it compare to the MATCH_RECOGNIZE
solution? Leave me a comment, I’d love to hear your experience.