Monitoring Data Quality using Snowflake Data Metric Functions
Data governance and data quality are top-of-mind in more and more organizations. Snowflake’s Data Trends 2024 report (available here) lists governance as a top industry concern. Snowflake continues to make a commitment to governance and data quality with features such as tagging, classification, masking and more. In this post I’m going to do a brief walk through of the new Data Metric Functions, which have been in Public Preview in Enterprise Edition and above since March of 2024. These functions make the automatic inspection of your data for quality issues extremely easy to implement.
(A quick note on this feature: unfortunately at this time it is not available in “trial” editions — you will need a paid account to explore this feature).
Data Metric Functions (DMF) are a class of functions that can be used to monitor the quality of your data. There are both “out of the box” functions, provided by Snowflake, and user-defined functions available. Once enabled, these functions can be used to provide regular metrics on data quality issues within the tables you specify. For example, you can define a DMF to inspect a column for invalid data (such as columns missing mandatory values) that does meet the threshold of a referential integrity violation but still signals a problem. And you aren’t limited to something like missing data — you can inspect any aspect of your data and generate metrics on it, such as the frequency of a particular event. Data Stewards can then take action based on the results of your DMF.
You might be thinking “So what? I can build a proc that runs a query over a table and logs the results in some table that I can monitor, and I can also schedule the proc, so why do I need DMFs?” The answer is that DMFs provide a pre-built framework that lets you do all of this very quickly, easily and in a standardized way. Even the scheduling of DMFs is simple, and you don’t need to build your own target table to store the results. The DMFs are integrated right down to the billing layer, so you can instantly see the costs entailed in inspecting for data quality.
Consider the following rows of data, from a table called RAW.STRIPE.PAYMENTS
These rows represent raw data of payments through a payment processor. Sometimes payments fail. It might be a good idea to know how often this happens, and enable users to have regular insight into the number of failures over a given period of time. DMFs make this extremely easy.
We start by declaring the DMF itself:
CREATE DATA METRIC FUNCTION IF NOT EXISTS
fail_status (ARG_T table(ARG_C1 STRING))
RETURNS NUMBER AS
'SELECT COUNT(*) FROM ARG_T WHERE ARG_C1=''fail'' ';
This creates a DMF called “fail_status”. Something key to note here: for arguments to the DMF, we are providing the name of a table, and a column for the table, meaning that this function is not tied to any particular table. This means we can create DMFs that are at a level of abstraction above any particular table, and hence use a particular DMF over multiple tables if it is appropriate.
Once we have created our DMF, we have to turn to the table we want to monitor, and establish how often we want to run this function. Obviously there is a lot to consider here, but as an example let’s run it every 5 minutes:
ALTER TABLE RAW.STRIPE.PAYMENT
SET DATA_METRIC_SCHEDULE='5 MINUTE'
(It may seem a little out of order to schedule the table before we attach the function, but that is the sequence required — if you try to attach the function to the table first you will get an error).
Next, we need to attach the function to the table we are interested in:
ALTER TABLE RAW.STRIPE.PAYMENT
ADD DATA METRIC FUNCTION fail_status ON (status)
That’s it! Every 5 minutes our DMF will execute, run the SELECT defined in the DMF using the column “status” and store the results.
But wait, store them where?
Simple! Everything is logged to a table in SNOWFLAKE.LOCAL called DATA_QUALITY_MONITORING_RESULTS_RAW, which is in turn accessed through a view called DATA_QUALITY_MONITORING_RESULTS.
In the same way that you don’t have to integrate your DMF into a scheduler, you don’t need to set up a target of any kind — just SELECT over this table and you will see the output of any DMFs you have running. The metadata for this table is basically self-explanatory, but a full list of the columns and their definitions is available here.
A simple query to find out when the measurement was taken and the results of the DMF is as follows:
SELECT MEASUREMENT_TIME,
VALUE as "CUMULATIVE FAILED PAYMENTS"
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE TABLE_DATABASE='RAW'
AND TABLE_SCHEMA='STRIPE'
AND TABLE_NAME='PAYMENT'
AND METRIC_NAME='FAIL_STATUS'
As the DMF executes we will see subsequent rows in appear in this table, in our case at 5-minute intervals.
A note on billing: you don’t have to connect your DMF to any particular warehouse. Instead, to make this as simple as possible, Snowflake runs DMFs under a warehouse it creates on your behalf. You are billed for this consumption under a special “Data Quality Monitoring” section of your bill. This enables you to see immediately what the costs associated with this kind activity are. You can also query DATA_QUALITY_MONITORING_USAGE_HISTORY to see your consumption.
One last thing: to avoid unnecessary resource consumption, disconnect the DMF from the table after you have finished experimenting with this feature:
ALTER TABLE RAW.STRIPE.PAYMENT
DROP DATA METRIC FUNCTION fail_status ON (status)