Using MATCH_RECOGNIZE in Snowflake

Published in

In the weeds

6 min readMay 8, 2024

A clever way to collapse daily snapshot data into relative start/end date ranges.

I’ve been using SQL for over 15 years on various platforms (Oracle, SQL Server, Teradata, Postgres, and Snowflake) and today I came across MATCH_RECOGNIZE. I was blown away by what this thing can do.

This optional subclause to the FROM clause is supported by many database platforms. I’m using Snowflake these days, so here’s a link to their documentation for MATCH_RECOGNIZE.

MATCH_RECOGNIZE is much more powerful than what is shown here, but I was so excited by what I was able to accomplish in a relatively small amount of SQL that I had to share.

Ok, so here’s what I was working on…

The backstory

My team wanted to start detecting schema changes in our upstream app databases. I know that there are other solutions for schema change detection, but I wanted to see what I could build on my own, cheaply.

So I built a simple python script that pulls data from the information_schema.tablesand information_schema.columns views in each app database and uploads the results to s3. In addition, to just pulling the raw data, I also added an MD5 hash column based on all columns in the view, and built a primary key column (a period-separated concatenation of the table_catalog, table_schema, table_name and column_name columns) for each row.

-- generated MD5 record hash for information_schema.tables
  MD5(
    COALESCE(table_catalog::text,'$null$')
    || COALESCE(table_schema::text,'$null$')
    || COALESCE(table_name::text,'$null$')
    || COALESCE(table_type::text,'$null$')
    || COALESCE(self_referencing_column_name::text,'$null$')
    || COALESCE(reference_generation::text,'$null$')
    || COALESCE(user_defined_type_catalog::text,'$null$')
    || COALESCE(user_defined_type_schema::text,'$null$')
    || COALESCE(user_defined_type_name::text,'$null$')
    || COALESCE(is_insertable_into::text,'$null$')
    || COALESCE(is_typed::text,'$null$')
    || COALESCE(commit_action::text,'$null$')
  ) AS md5_record_hash,

-- generated primary key column for information_schema.tables
COALESCE(table_catalog::text,'$null$') 
    || '.' || COALESCE(table_schema::text,'$null$')
    || '.' || COALESCE(table_name::text,'$null$') AS pk,

The s3 files are then ingested from s3 via Snowpipe.

Let’s focus on the data from the information_schema.columns view. It has a grain of one row per database/schema/table/column in the source db. Pulling in the daily snapshots, we end up with one row per day, per app db of that basic grain. That data contains a lot of noise as it contains rows whether or not definition of the column changed. It would be nice to eliminate the noise (days where there are no changes), and instead only see when the definition of a column actually changes. This is where the MD5 hash column, primary key, and MATCH_RECOGNIZE come to the rescue.

Here’s a simplified example of the ingested data.

-- Create temporary table to hold test data
CREATE OR REPLACE TEMPORARY TABLE column_data (
  pk VARCHAR,
  md5_ecord_hash VARCHAR,
  run_date DATE
);

-- Insert test data
INSERT INTO column_data VALUES ('app_db.public.users.id','1d89ed6a7992c77ca6b354dc740c4863','2024-01-01');
INSERT INTO column_data VALUES ('app_db.public.users.id','1d89ed6a7992c77ca6b354dc740c4863','2024-01-02');
INSERT INTO column_data VALUES ('app_db.public.users.id','1d89ed6a7992c77ca6b354dc740c4863','2024-01-03');
INSERT INTO column_data VALUES ('app_db.public.users.id','64423ab2626a19efd4c3b58ff3a81ac9','2024-01-04');
INSERT INTO column_data VALUES ('app_db.public.users.id','64423ab2626a19efd4c3b58ff3a81ac9','2024-01-05');
INSERT INTO column_data VALUES ('app_db.public.users.id','df8e14b4da70837b1e89095790988380','2024-01-06');
INSERT INTO column_data VALUES ('app_db.public.users.username','c05a71f3b6efd66f773e63ba6ce9fc3e','2024-01-05');
INSERT INTO column_data VALUES ('app_db.public.users.username','c05a71f3b6efd66f773e63ba6ce9fc3e','2024-01-05');   -- Intentional duplicate
INSERT INTO column_data VALUES ('app_db.public.users.username','c05a71f3b6efd66f773e63ba6ce9fc3e','2024-01-06');
INSERT INTO column_data VALUES ('app_db.public.users.username','271f05dc35cc5b5a33a8da89e42c02dd','2024-01-07');  
INSERT INTO column_data VALUES ('app_db.public.users.username','271f05dc35cc5b5a33a8da89e42c02dd','2024-01-15');   -- Intentional gap
INSERT INTO column_data VALUES ('app_db.public.users.username','99210cdd38519cf6ea90b7eb25e27c05','2024-01-16');

-- Show inserted data
SELECT *
FROM column_data
ORDER BY pk, run_date;

Notice that for PK valueapp_db.public.users.id , I have daily snapshots from 2024–01–01 thru 2024–01–06 . Some of those daily snapshots have the same hash value while others are different. The change in md5_record_hash value indicates that something about the source database’s column has changed.

MATCH_RECOGNIZE can collapse these common records and return the start/end dates that the column had a specific md5_record_hash value. This eliminates the noise when the daily snapshot stayed the same.

The Solution

-- Collapse time frames to only show when the record changed.
SELECT *
FROM column_data
  MATCH_RECOGNIZE (
    PARTITION BY pk
    ORDER BY run_date
    MEASURES
      FIRST(md5_record_hash) AS md5_record_hash,
      FIRST(run_date) AS start_date,
      LAST(run_date) AS end_date        
    ONE ROW PER MATCH
    PATTERN (initial_row_or_change{1} row_no_change* )
    DEFINE
      initial_row_or_change AS
        LAG(md5_record_hash) IS NULL
        OR md5_record_hash <> NVL(LAG(md5_record_hash), md5_record_hash),
      row_no_change AS
        md5_record_hash = NVL(LAG(md5_record_hash), md5_record_hash)
    )
ORDER BY
  pk,
  start_date;

This yields exactly the output that I wanted!

I’ll explain the pieces of MATCH_RECOGNIZE to make it clearer what’s going on in the SQL.

Breaking down the components

First you have the PARTITION BY and ORDER BY portions. These define the scope within which you want to perform the matching and the order in which the records are placed before searching for the pattern.

The MEASURES portion defines what aggregates you want returned when the pattern is matched. In my case I wanted to return FIRST md5_record_hash, theFIRST run_date, and the LAST run date within the pattern. I aliased the dates to the desired output of START_DATE and END_DATE respectively.

Here’s where it starts to get more complex. In my case I only wanted to return one row with each matched pattern, so I chose the ONE ROW PER MATCH option. This collapses all of the “no-change” records into a single record. You can also choose ALL ROWS PER MATCH which will not collapse the records.

Let’s skip over the PATTERN for now and move on to the DEFINE.

DEFINE allows you to assign a named classifier to records that meet certain criterion. Recall that I wanted to detect changes in the md5_record_hash column, so I used the LAG() window function to check if this value changed from that of the previous record. In addition, I know from experience that the first record in a window/partition will always have a NULL value for LAG() and I wanted to account for that.

So for the classifier that I defined initial_row_or_change, records that evaluate to TRUE for my custom expression of

-- First record in a window/partition
LAG(md5_record_hash) IS NULL
-- The hash value on this record is different than the last record
OR md5_record_hash <> NVL(LAG(md5_record_hash), md5_record_hash)

will be given assigned this classifier.

The next line defines another classifier:

row_no_change AS
  md5_record_hash = NVL(LAG(md5_record_hash), md5_record_hash)

Similar to the previous classifier, any records where the expression evaluates to TRUE will be assigned the classifier of row_no_change.

Now that we’ve defined the classifiers, let’s return to the PATTERN. This describes the pattern of classifiers that you’re trying to match within the partition.

In my case, I want to find exactly one initial_row_or_change followed by zero or many row_no_change records.

   PATTERN (initial_row_or_change{1} row_no_change* )

The symbols to the right of the defined classifier mimic regular expression syntax for the number of desired occurrences. The classifiers are space separated and must be contained in parentheses.

An alternate solution

Here’s another version of the query that defines three classifiers instead of two. It uses an OR condition implemented as a | character in the PATTERN section. This returns the same result as the original query, but separates the initial row and the change row into two separate classifiers.

SELECT *
FROM column_data
  MATCH_RECOGNIZE (
    PARTITION BY pk
    ORDER BY run_date
    MEASURES
      FIRST(md5_record_hash) AS md5_record_hash,
      FIRST(run_date) AS start_date,
      LAST(run_date) AS end_date        
    ONE ROW PER MATCH
    PATTERN ((initial_row_or_change{1} | change_row{1}) no_change_row* )
    DEFINE
      initial_row AS
        LAG(md5_record_hash) IS NULL,
      change_row AS
        md5_record_hash <> NVL(LAG(md5_record_hash), md5_record_hash),
      no_change_row AS
        md5_record_hash = NVL(LAG(md5_record_hash), md5_record_hash)
    )
ORDER BY
  pk,
  start_date;

Which one do you like better?

Conclusions

I’ve done this type of collapsing daily change records into start/end dates before, and it has taken multiple analytic functions and complicated CTEs. The MATCH_RECOGNIZE implementation, though complex at initial glance, really simplifies things.

Have you solved this same problem in a different way? How does it compare to the MATCH_RECOGNIZE solution? Leave me a comment, I’d love to hear your experience.