Monitoring Data Quality Natively in Snowflake

Using Snowflake’s data metric functions to monitor the quality of your data, such as data freshness, duplicate counts, number of NULL values, or any other metric you write as a custom data metric function

Photo by Ruvim Miksanskiy on Pexels

The Snowflake functionality we’ve been waiting for is finally here! Now we can measure data quality natively in Snowflake. Since Snowflake standard tables don’t enforce constraints, we previously had to resort to writing our own code to check uniqueness, for example.

Now we can add a data metric function to a table or a view, specify the schedule for how frequently we want the metric function to be evaluated, and check the results in the event table. We can then configure alerts that send notifications based on criteria we provide.

Let’s walk through an example to demonstrate how the data metric functionality works.

Data Quality and data metric functions are currently available in Snowflake Enterprise Edition or higher and are not available in Snowflake trial accounts.

Setting up the environment and sample data

Let’s create a role named DQ_ENGINEER we will use for the example and grant the role to our current user so we can use it:

use role securityadmin;
create role DQ_ENGINEER;
grant role DQ_ENGINEER to user <my_user>;

Next, grant the privileges that the DQ_ENGINEER role needs to work with data metric functions:

  • database role SNOWFLAKE.DATA_METRIC_USER — to access system defined data metric functions in the SNOWFLAKE.CORE schema
  • EXECUTE DATA METRIC FUNCTION — to execute data metric functions
  • application role SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER — to view the output of data metric functions
use role ACCOUNTADMIN;
grant database role SNOWFLAKE.DATA_METRIC_USER to role DQ_ENGINEER;
grant EXECUTE DATA METRIC FUNCTION on account to role DQ_ENGINEER;
grant application role SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER
to role DQ_ENGINEER;

The DQ_ENGINEER role also needs access to a virtual warehouse and the privilege to create a database:

use role SYSADMIN;
create warehouse DQ_WH;
grant usage on warehouse DQ_WH to role DQ_ENGINEER;
grant create database on account to role DQ_ENGINEER;

Now we are ready to use the DQ_ENGINEER role to create a database named DQ_DB, two schemas named DQ_DATA and DQ_FUNCTIONS, and a table named SALES with some sample data:

use role DQ_ENGINEER;
use warehouse DQ_WH;
create database DQ_DB;
create schema DQ_FUNCTIONS;
create schema DQ_DATA;

create or replace table DQ_DATA.SALES(
sale_id integer,
sale_date date,
customer_id varchar,
payment_method varchar);

insert into DQ_DATA.SALES values
(826108, '2024-04-08', '4276', 1),
(260982, '2024-04-26', '644A', 1),
(819485, '2024-04-29', '2344', 2),
(538988, '2044-04-09', '2610', 1),
(538988, '2024-04-26', '6317', 1),
(781004, '2024-04-30', '3306', 0),
(303412, '2024-04-27', '2664', 2),
(805451, '2024-05-02', '2884', 1),
(729482, '2024-04-25', NULL, 1),
(254959, '2024-04-19', '6343', 1);

Let’s look at the sample data. Here are some observations:

  • The SALE_ID column should be the primary key in this table and as such it should be unique. But we see that there are two rows with 538988 as the value of the SALE_ID column.
  • The CUSTOMER_ID column contains a NULL value.
  • The CUSTOMER_ID column is type VARCHAR, but the values appear as numbers, with one exception where the value is ‘644A’.

Now let’s apply data metric functions that will quantify these observations.

System Data Metric Functions

Snowflake provides the following system data metric functions:

  • FRESHNESS — returns the freshness of the data measured as the number of seconds between the last time the data was updated in the table and the data metric function scheduled time
  • DATA_METRIC_SCHEDULED_TIME — returns the timestamp of the data metric function scheduled time
  • NULL_COUNT —returns the number of NULL values in a column
  • DUPLICATE_COUNT — returns the number of duplicate values in a column including NULL values
  • UNIQUE_COUNT —returns the number of unique non_NULL values in the column or combination of columns
  • ROW_COUNT —returns the number of rows in a table

To use a data metric function, we add it to a table column or columns. But before we can do that, we must define the schedule based on which the data metric functions execute. We can schedule data metric functions on a table in different ways:

  • Use a time interval in minutes that executes the data metric functions repeatedly
  • Use a cron expression
  • Define a trigger event that executes the data metric functions when the data in the table changes as a result of DML operations

For testing our example we will schedule the data metric functions to execute on the SALES table every 5 minutes, like this:

alter table DQ_DATA.SALES set DATA_METRIC_SCHEDULE = '5 MINUTE';

Now we can add system data metric functions to columns in the table. We want to check uniqueness of the SALE_ID column using the DUPLICATE_COUNT system data metric function. This function returns the number of duplicate values. If the column is unique, there are no duplicate values and the function returns 0. Otherwise, it returns a value greater than 0. We add the DUPLICATE_COUNT function to the SALE_ID column by executing the ALTER TABLE command:

alter table DQ_DATA.SALES
add data metric function SNOWFLAKE.CORE.DUPLICATE_COUNT
on (sale_id);

Similarly, we can add other system data metric functions to the same or other columns in the table. We will add the NULL_COUNT data metric function to the CUSTOMER_ID column, like this:

alter table DQ_DATA.SALES 
add data metric function SNOWFLAKE.CORE.NULL_COUNT
on (customer_id);

Now we wait for 5 minutes until the next scheduled data metric function execution occurs and check the results in the DATA_QUALITY_MONITORING_RESULTS view:

select measurement_time, table_name, metric_name, argument_names, value
from SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
order by measurement_time desc;

It’s convenient to order the results by the MEASUREMENT_TIME in descending order so that we see the latest results first.

Here is the output of the previous command:

Data quality monitoring results after adding system defined data metric functions

We see that the DUPLICATE_COUNT data metric function returns the value 1, which is correct, because we have one duplicate row based on the SALE_ID column. The NULL_COUNT data metric function also returns the value 1, which is again correct, because we have one NULL value in the CUSTOMER_ID column.

After 5 more minutes, we should see two more rows with the same results from the next time the data metric functions executes, and so on.

User-defined Data Metric Functions

In addition to the system defined data metric functions, we can create our own functions to perform custom data checking. A user-defined data metric function follows the same syntax as a standard UDF, except the parameters are provided as a table name and column names on which the function operates.

Let’s create a user-defined data metric function that checks if there are any values in the CUSTOMER_ID column that don’t represent a number. We will use the TRY_TO_NUMBER() function for this purpose. The TRY_TO_NUMBER() function returns a number if the input is a valid number and NULL otherwise. Here is how we define the data metric function named NON_NUMERIC_CUSTOMER_ID, using the CREATE DATA METRIC FUNCTION syntax and providing the table name and the column name as the parameters:

create data metric function DQ_FUNCTIONS.NON_NUMERIC_CUSTOMER_ID(
SALES table(customer_id varchar))
returns integer
AS
$$
select count(*)
from SALES
where try_to_number(customer_id) is null
$$;

We now add this user-defined data metric function to the CUSTOMER_ID column in the SALES table just like we did earlier with the system data metric functions:

alter table DQ_DATA.SALES
add data metric function DQ_FUNCTIONS.NON_NUMERIC_CUSTOMER_ID
on (customer_id);

The table already has a schedule defined so we can wait another 5 minutes to see the results:

select measurement_time, table_name, metric_name, 
metric_schema, metric_database, argument_names, value
from SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
order by measurement_time desc;
Data quality monitoring results after adding system defined data metric functions and a user-defined data metric function

We see that the NON_NUMERIC_CUSTOMER_ID data metric function returned a value of 2, because it counted the ‘644A’ and the NULL values as non-numeric. The benefit of having a user-defined data metric function is that we can modify it as we wish, for example to not count the NULL values.

Changing the data metric function schedule

We defined the data metric schedule on the SALES table to execute every 5 minutes because it was convenient to test. Unless we have data arriving in the table at a high frequency, we probably don’t need to check data quality on the table every 5 minutes.

Let’s change the schedule so that it executes only when data is changed in the table via DML operations. To do that, we have to unset the current schedule and set a new schedule to trigger on changes:

alter table DQ_DATA.SALES unset DATA_METRIC_SCHEDULE;

alter table DQ_DATA.SALES set DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';

Now, let’s make some changes in the table, for example add a new row:

insert into DQ_DATA.SALES values
(267017, '2024-05-01', '698B', 1);

After the insert, we don’t have to wait 5 minutes like earlier, because the data metric functions should trigger soon after we execute the DML statements. Let’s see what we have in the data quality monitoring results:

Data quality monitoring results after changing the schedule to trigger on changes and inserting a new row of data

We can see that the data quality monitoring functions were triggered and that the NON_NUMERIC_CUSTOMER_ID function now identifies 3 CUSTOMER_IDs with non-numeric values, because we just inserted the third one.

Viewing data metric function information

Adding and removing data metric functions from a table and changing the schedule can become complex after a while, so Snowflake provides ways to view what we have defined to help us keep track of the data quality monitoring functions on each table.

To see the metric schedule on a table, we can use the SHOW PARAMETERS command with the table name as the parameter, like this:

show parameters like 'DATA_METRIC_SCHEDULE' in table SALES;
Value of the DATA_METRIC_SCHEDULE parameter on the SALES table

To see the data metric functions associated with a table, we can call the DATA_METRIC_FUNCTION_REFERENCES table function with the object name and object domain as the parameter, like this:

select * from table(
INFORMATION_SCHEMA.DATA_METRIC_FUNCTION_REFERENCES(
ref_entity_name => 'DQ_DB.DQ_DATA.SALES',
ref_entity_domain => 'table'
)
);
Output of the DATA_METRIC_FUNCTION_REFERENCES table function for the SALES table

The parameter REF_ENTITY_DOMAIN can take the values ‘table’ or ‘view’ because we can add data metric functions to both tables and views.

After many rounds of testing, we might want to remove the data metric history so that we only see results of data quality metric functions when we start executing them for real. We can truncate the underlying table that provides data to the SNOWFLAKE.LOCAL. DATA_QUALITY_MONITORING_RESULTS view by executing the TRUNCATE command. However, the DQ_ENGINEER doesn’t have the privileges to TRUNCATE this table. You can perform the truncate command using the ACCOUNTADMIN role, or use the ACCOUNTADMIN role to grant the DQ_ENGINEER role the DATA_QUALITY_MONITORING_ADMIN application role, like this:

use role ACCOUNTADMIN;
grant application role SNOWFLAKE.DATA_QUALITY_MONITORING_ADMIN
to role DQ_ENGINEER;
use role DQ_ENGINEER;
truncate table SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS_RAW;

Alerting when data quality metrics exceed defined thresholds

Once you have your data metric functions configured and functioning on schedule, you don’t want to monitor them manually by selecting from the data quality monitoring results on a regular basis. Most likely, you will define thresholds that the outputs of the data metric functions should not exceed and send alerts when they do.

To set up the alerting functionality, see the Setting Up Alerts Based on Data in Snowflake section in the documentation. You might also refer to this blog that shows an example of setting up an alert: Snowflake Data Metric Functions and Alerts.

I’m Maja Ferle, Snowflake Data Superhero and a senior consultant at In516ht. You can get in touch with me on LinkedIn.

--

--