Tackling Data Quality Challenges Using Data Metric Functions in Snowflake
A step in the right direction ….
Many of us have experienced data quality nightmares and witnessed first-hand the impact it can have on the business, compliance, and credibility of those who are involved. Even with numerous data quality tools and home grown solutions out there, data quality remains a major concern for customers including data privacy, lack of consistency, no centralised metrics repository, multiple technologies and integration challenges.
In the era dominated by AI/ML, ensuring data quality is paramount for intelligent decision-making and achieving reliable outcomes.In this article, I explore how customers can leverage Data Metric Functions in Snowflake to monitor data quality issues and take actions before it is too late!
Data Metric Functions
Data Metric Functions were announced at the Snowflake Summit 2023 and have generated so much customer interest. Why ? because as the data lands in Snowflake, customers will now have the capability to leverage out- of-the-box data metric functions called System Data Metric Functions (freshness and counts that measure duplicates, NULLs, rows, and unique values to begin with) or write their own Custom Data Metric Functions to evaluate the quality of their data and take necessary actions. This feature is now in GA making it possible for all customers to explore it and adopt it to suit their needs and provide feedback.
Key points to note:
- Data Metric function is a new schema level object.
- It is possible to define the metric functions once and reuse them many times.
- User-Defined Data Metric Functions can be used to evaluate metrics to support business logic.
- User-Defined Data Metric functions contain a new TABLE data type input argument for column aggregate metrics.
- User-Defined Data Metric Functions can be created in any database. e.g. central Governance database to promote reusability.
- Out of the box System Data Metric Functions will be available in the SNOWFLAKE.CORE schema. They can be used to evaluate data freshness, volume, accuracy, statistics.
- The results of calling the Data Metric Function is stored in an event table for customer’s account. This table is called SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS_RAW.
- The raw data in the event table are nicely flattened into a view called SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS.
- Data engineers can now fail the ingestion pipelines or alert the supporting teams based on the metric evaluation outcome in-flight.
- Other Data Observerability tools can leverage the data in the event Table as the the data is stored in Open Telemetry format.
- Periodic measurements for all metrics executed via serverless in the background by Snowflake. Schedules can be setup as DATA_METRIC_SCHEDULE=TRIGGER_ON_CHANGES or e.g. 5 MINUTE or your preferred frequency with CRON like schedules.
- Built-in intelligence to scan only incremental data for cost and performance efficiency.
- There can be multiple metric associations with a single column or same metric associated with multiple columns.
Tracking Data Quality in Customer Loyalty Database
Using User Defined Data Metric Function
In this article, I demonstrate how I monitor the data quality of customer loyalty data in my Snowflake Account. The central GOVERNANCE database contains the User-Defined Data Metric functions.
The Metric evaluation results are stored in a default event table in SNOWFLAKE.LOCAL schema, while the LOYALTY_DB database contains the CUSTOMERS table with sample data.
-- Create GOVERNANCE and LOYALTY_DB databases
use role sysadmin;
create database if not exists governance;
create database if not exists loyalty_db;
-- Create schema DMFS
use database governance;
create schema if not exists dmfs;
-- create schema LOYALTY
use database loyalty_db;
create schema if not exists loyalty;
Next I setup the account level security roles DQ_MONITOR_ROLE and LOYALTY_ROLE and grant the required privileges.
-- Create new roles
create or replace role dq_monitor_role;
create or replace role loyalty_role;
use database governance;
use schema dmfs;
-- Grant required privileges to DQ_MONITOR_ROLE
grant usage on database governance
to role dq_monitor_role;
grant usage on schema dmfs
to role dq_monitor_role;
grant create function on schema governance.dmfs
to role dq_monitor_role;
grant database role snowflake.usage_viewer
to role dq_monitor_role;
grant application role snowflake.data_quality_monitoring_viewer
to role dq_monitor_role;
use role accountadmin;
show grants to database role snowflake.data_metric_user ;
show grants to database role snowflake.usage_viewer;
show grants to application role snowflake.data_quality_monitoring_viewer ;
grant usage on database governance
to role loyalty_role;
grant database role snowflake.data_metric_user
to role loyalty_role;
grant usage on schema governance.dmfs
to role loyalty_role;
-- This privilege is required to be granted to the role that owns the
-- table to be monitor reads
grant execute data metric function on account
to role loyalty_role;
grant usage on warehouse compute_wh
to role dq_monitor_role;
grant usage on warehouse compute_wh
to role loyalty_role;
-- Build RBAC hierarchy
grant role dq_monitor_role
to role sysadmin;
grant role loyalty_role
to role sysadmin;
Next, I load the sample data to the CUSTOMERS table in LOYALTY_DB.
Note that there are two invalid emails in this dataset. For simplicity I use the INSERT SQL command to load the data.
-- Create a customers table in the loyalty database
use role loyalty_role;
use database loyalty_db;
show schemas;
use schema loyalty;
create or replace TABLE CUSTOMERS (
ACCOUNT_NUMBER NUMBER(38,0),
FIRST_NAME VARCHAR(200),
LAST_NAME VARCHAR(200),
EMAIL VARCHAR(200))
;
INSERT INTO CUSTOMERS (ACCOUNT_NUMBER,FIRST_NAME, LAST_NAME, EMAIL)
VALUES (2342378, 'REGINA', 'DOLE', 'regina.dole@email.com');
INSERT INTO CUSTOMERS (ACCOUNT_NUMBER,FIRST_NAME, LAST_NAME, EMAIL)
VALUES (8989934, 'STUART', 'CAN', 'stuart.can@email.com');
INSERT INTO CUSTOMERS (ACCOUNT_NUMBER,FIRST_NAME, LAST_NAME, EMAIL)
VALUES (2340883, 'MIKEY', 'GREY', 'mikey.grey@email.com');
INSERT INTO CUSTOMERS (ACCOUNT_NUMBER,FIRST_NAME, LAST_NAME, EMAIL)
VALUES (1340884, 'SUNNY', 'CHAPMAN', 'sunny.chapman@');
INSERT INTO CUSTOMERS (ACCOUNT_NUMBER,FIRST_NAME, LAST_NAME, EMAIL)
VALUES (1340854, 'RANI', 'SUART', 'rani.stuart@r*');
Next, I create a User-Defined metric function to evaluate data quality.
-- Create a custom data metric function in the GOVERNANCE database to
-- track invalid email addresses
use role dq_monitor_role;
use database governance ;
use schema dmfs;
show functions;
CREATE DATA METRIC FUNCTION IF NOT EXISTS
invalid_email_count (ARG_T table(ARG_C1 STRING))
returns number as
'SELECT COUNT_IF(FALSE = (ARG_C1 regexp ''^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$'')) FROM ARG_T'
;
-- Grant permission to use the metric function to the LOYALTY_ROLE
GRANT USAGE ON FUNCTION governance.dmfs.invalid_email_count(
TABLE(
STRING ) ) to role loyalty_role;
Next, I set up a data metric schedule, e.g., every 5 minutes for demonstration purposes. Using the LOYALTY_ROLE, I can perform data quality tests on demand. To enable automatic metric evaluation, I need to associate the data metric function with the EMAIL column of the CUSTOMERS table.
use role loyalty_role;
use database loyalty_db;
use schema loyalty;
use warehouse compute_wh;
-- Set the metric schedule on the table. A valid schedule is required to be
-- setup first.
alter table customers set data_metric_schedule = '5 minute';
--Perform adhoc tests.
select /* show 1 */ governance.dmfs.invalid_email_count
(select email from customers);
-- Associate metric function with the column
alter table customers
add data metric function governance.dmfs.invalid_email_count
on (email);
To find out more details about the metric function and the associated table, I can run the following queries under the LOYALTY_ROLE, which has access to the CUSTOMERS table.
-- As the loyalty_role find mout more details about the invalid_email_count
-- data metric function.
use role loyalty_role;
use database governance;
use schema information_Schema;
select *
from table(information_schema.data_metric_function_references(
metric_name => 'governance.dmfs.invalid_email_count'));
-- As the loyalty_role find mout more details about the customers
-- table and associated metric functions
select *
from table(loyalty_db.information_schema.data_metric_function_references
(ref_entity_name => 'loyalty_db.loyalty.customers', ref_entity_domain => 'TABLE'));
Under the DQ_MONITOR_ROLE, I can run the following query to determine how many invalid emails in the CUSTOMERS table.
select measurement_time,
table_database,
table_name,
table_Schema,
metric_database,
metric_schema,metric_name,
value row_count
from snowflake.local.data_quality_monitoring_results
order by scheduled_time desc limit 10;
Next, I can confirm the two invalid email addresses.
use role loyalty_role;
select email
from loyalty_db.loyalty.customers
where not
REGEXP_LIKE(email, '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$');
To find out how many credits have been used for the metric function evaluations, I can run the following query.
use role dq_monitor_role;
select start_time,database_name, schema_name,table_name, credits_used,
from snowflake.account_usage.data_quality_monitoring_usage_history
-- note, account usage view has data latency, expect 1 - 2 hour delay
where true
and start_time >= current_timestamp - interval '3 days'
limit 100
;
-- account usage view has data latency, expect 1 - 2 hour delay.
Using System Data Metric Functions
I can further extend my data quality checks to leverage the out-of-the-box System Data Metric Functions.
-- Using System Data Metric Functions
use role loyalty_role;
use loyalty_db.loyalty;
-- Associate the system metric functions with the email column
alter table customers
add data metric function snowflake.core.null_count on (email);
alter table customers
add data metric function snowflake.core.unique_count on (email);
-- I can now use the System Data Metric Functions against my CUSTOMERS table
-- to validate the data quality on demand.
select snowflake.core.null_count(select email from customers);
select snowflake.core.unique_count(select email from customers);
I run the following query to get more details on the System Data Metric Functions.
use role loyalty_role;
use database loyalty_db;
use schema loyalty;
select *
from table(information_schema.data_metric_function_references
(ref_entity_name => 'customers', ref_entity_domain => 'TABLE'));
Finally, I can verify the results of the metric functions. I can see there are no NULL emails, five unique records and two invalid email addresses.
-- Wait 5 - 10 minutes and query the data metric evaluation history
use role dq_monitor_role;
use warehouse compute_wh;
select
measurement_time,
table_database,
table_name,
table_Schema,
metric_database,
metric_schema,metric_name,
value row_count
from snowflake.local.data_quality_monitoring_results
where table_name='CUSTOMERS'
order by measurement_time desc limit 10;
I can run the following commands to unset the metric schedule.
use role loyalty_role;
use database loyalty_db;
use schema loyalty;
alter table customers unset data_metric_schedule;
show parameters like 'DATA_METRIC_SCHEDULE'
in table loyalty_db.loyalty.customers;
Disassociate the Data Metric Functions from my Customers table’s Email column.
alter table customers
drop data metric function governance.dmfs.invalid_email_count
on (email);
alter table customers
drop data metric function snowflake.core.null_count
on (email);
alter table customers
drop data metric function snowflake.core.unique_count
on (email);
Streamlit Dashboards
I can also setup Streamlit Dashboard to track the data quality of my customer emails over time.
Snowflake Alerts
In addition, I can leverage Snowflake Alerts feature to generate an alert to notify me about the poor quality data if it exceeds a certain threshold. See more details on how to setup alerts in https://docs.snowflake.com/en/guides-overview-alerts
Data Metric Functions Useful References
https://docs.snowflake.com/en/user-guide/data-quality-intro
https://docs.snowflake.com/en/user-guide/data-quality-working
Wrapping Up
I barely scratched the surface with this test run of Data Metric Functions. The reusable, consistent, metric functions library (standard and custom), serverless metric evaluations, zero data movement out of Snowflake, and on-demand expression evaluations, centralised event table in local account, ability to trigger metric evaluations on change are some of the key capabilities of this feature that stood out for me.
The image included in this post was captured during one of my winter hikes in the South Island, New Zealand.
Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.
Updates: 02nd August Change status from Public Preview to GA