Snowflake Data Quality Metrics/Functions
In the world of data-driven decision-making, ensuring the accuracy and reliability of your information is crucial. Snowflake Data Metric Functions (DMFs) and Data Quality Metrics (DQMs) provide a robust mechanism to automate the monitoring of data quality, freeing you from the tedious task of manual checks. DMFs are user-defined functions that evaluate specific aspects of your data, like completeness, validity, uniqueness, and freshness.
Snowflake offers two types of DMFs:
- System DMFs: Built-in functions for common data quality checks. For example, NULL_COUNT to detect missing values, UNIQUE_COUNT to measure cardinality (the number of unique values), DUPLICATE_COUNT to identify duplicates, and FRESHNESS to assess how recent your data is.
- Custom DMFs: Allow you to define tailored validation rules to meet your specific business requirements. You can create DMFs to check email formats, validate data ranges, or enforce complex data integrity constraints.
A Practical Example: Sales Data Quality
Let’s explore how to use DMFs in a practical scenario.
⛁ Sample Data Model: salesdb-data-model
Setup the SalesDB sample data model and proceed with the rest of the exercises. This model provides a robust foundation for tracking customers, buyers, clients, and sales opportunities. To demonstrate the power of DMFs, we’ll insert both valid and invalid records into our tables.
Data Quality Issues Introduced
Customer Table:
- Missing FirstName
- Missing HomeLocation
- Missing ZipCode
- Duplicate Email
- Invalid Email
- Stale LoadDate
Opportunities Table:
- Missing CustomerID
- Missing ExpectedCloseDate
- Missing Amount
- Negative Amount
- Duplicate Opportunity (same OpportunityID)
- Invalid SalesStage
- Stale LoadDate (for a “Closed Won” opportunity)
Explanation: The customer table is a relational database table that stores information about customers, including their ID, name, email address, location, zip code, and the date and time when their data was loaded into the table. The LoadDate field has a default value that is set to the current timestamp when a new row is inserted. This can be used to track the freshness of the data.
Role-Based Access Control (RBAC) for DMF Management
Data quality is a shared responsibility, but different roles within your organization might have varying levels of control over DMFs. Snowflake’s RBAC model allows you to fine-tune these permissions.
- Account Administrators: Typically have full control over creating, modifying, and deleting DMFs. They also manage the schedules for DMF execution.
- Data Engineers/Analysts: Might be granted privileges to view DMF results and potentially create or modify DMFs within their scope of responsibility.
- Sales Representatives: Might have read-only access to DMF results to understand data quality issues affecting their territories.
Let’s start by setting up the necessary permissions for a data_engineer role to manage DMFs. You can use GRANT and REVOKE statements to control access to DMFs:
-- Setting up Data Metric Functions - RBAC
USE ROLE ACCOUNTADMIN;
ALTER SESSION SET TIMEZONE = 'America/Chicago'; --Optional
GRANT DATABASE ROLE SNOWFLAKE.DATA_METRIC_USER TO ROLE data_engineer;
GRANT DATABASE ROLE SNOWFLAKE.USAGE_VIEWER TO ROLE data_engineer;
GRANT EXECUTE DATA METRIC FUNCTION ON ACCOUNT TO ROLE data_engineer;
This allows the data_engineer to create, execute, and monitor DMFs.
Testing System DMFs
This part of the code uses Snowflake’s built-in Data Metric Functions (DMFs) to directly assess the quality of the data in the Customer table. These DMFs count the number of null values, unique values, and duplicate values in the specified columns (firstname and email in this case).
-- Test System DMFs (Direct Execution)
SELECT SNOWFLAKE.CORE.NULL_COUNT(SELECT firstname FROM customer);
SELECT SNOWFLAKE.CORE.UNIQUE_COUNT(SELECT firstname FROM customer);
SELECT SNOWFLAKE.CORE.DUPLICATE_COUNT(SELECT email FROM customer);
SELECT SNOWFLAKE.CORE.FRESHNESS(SELECT loaddate FROM customer) / 60 AS minutes;
The last statement demonstrates the use of the FRESHNESS system DMF, which measures the age of data. It calculates the time difference in minutes between the current time and the latest timestamp in the LoadDate column. This information can be used to determine the freshness of the data in the table.
Creating Custom DMFs
In addition to out of the box System DMFs, custom DMFs allow you to define tailored validation rules to meet your specific business requirements. You can create DMFs to check email formats, validate data ranges, or enforce complex data integrity constraints.
Now, let’s create several custom DMFs to demonstrate their versatility:
-- Custom DMF for Data Freshness (in hours)
CREATE OR REPLACE DATA METRIC FUNCTION data_freshness_hour(ARG_T TABLE (ARG_C TIMESTAMP_LTZ))
RETURNS NUMBER AS
'SELECT TIMEDIFF(minute, MAX(ARG_C), SNOWFLAKE.CORE.DATA_METRIC_SCHEDULED_TIME()) FROM ARG_T';
-- Custom DMF to Count Invalid Email Addresses
CREATE OR REPLACE DATA METRIC FUNCTION invalid_email_count(arg_t TABLE(arg_c1 VARCHAR))
RETURNS NUMBER
AS
$$
SELECT COUNT(*) AS invalid_email_count
FROM arg_t
WHERE NOT REGEXP_LIKE(arg_c1, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$')
$$;
-- Custom DMF to Count Invalid Sales Stages
CREATE OR REPLACE DATA METRIC FUNCTION invalid_stage_count(arg_t TABLE(arg_c1 VARCHAR))
RETURNS NUMBER
AS
$$
SELECT
CASE
WHEN COUNT(*) = 0 THEN NULL -- No rows in the table
ELSE
SUM(
CASE
WHEN arg_c1 IN ('Prospecting', 'Qualification', 'Proposal', 'Negotiation', 'Closed Won', 'Closed Lost') THEN 0
ELSE 1
END
)
END
FROM arg_t
$$;
-- Custom DMF to Count Composite Duplicates (e.g., First Name and Last Name)
CREATE OR REPLACE DATA METRIC FUNCTION COMPOSITE_DUPLICATE_COUNT(arg_t TABLE(arg_c1 VARCHAR, arg_c2 VARCHAR))
RETURNS NUMBER
AS
$$
SELECT COUNT(*)
FROM (SELECT ARG_C1, ARG_C2, COUNT(*) CNT
FROM ARG_T
GROUP BY ALL
HAVING COUNT(*) > 1)
$$;
The code above tests the functionality of the custom DMFs created earlier. Each line selects the result of calling a specific DMF on a column from the table. The functions return different values based on the specific test:
- invalid_email_count: The number of invalid email addresses in the email column.
- invalid_stage_count: The number of rows with invalid values in the salesstage column.
- COMPOSITE_DUPLICATE_COUNT: The count of duplicate rows where both firstname and lastname are the same.
- data_freshness_hour: The freshness of the data, measured as the time difference in minutes since the last data load.
Testing Custom DMFs
You can test these custom DMFs by executing queries like:
-- Test DMF
select invalid_email_count (select email from customer);
select invalid_stage_count (select salesstage from opportunities);
SELECT COMPOSITE_DUPLICATE_COUNT(SELECT firstname, lastname FROM customer);
SELECT data_freshness_hour(SELECT loaddate FROM customer);
Explanation: Using the user-defined data metric function (DMF) called invalid_email_count as an example, it takes a table as input (specifically, a table containing email addresses in the arg_c1 column). The function then counts how many email addresses in the table are invalid by checking if they match a regular expression pattern for valid email formats.
Scheduling DMFs
To automate data quality checks, you can schedule DMFs to run periodically or when changes occur in the table. You can use cron expressions or trigger-based scheduling for flexibility.
- Before attaching DMF to a table or view, ensure that the object (table/view) has a schedule defined for it. Snowflake requires a schedule to be attached to the object before DMF is attached. Here are the various ways to set the schedule:
-- Example of setting run every 5 mins
ALTER TABLE CUSTOMER SET DATA_METRIC_SCHEDULE = '5 Minutes';
-- Example using cron syntax but same pattern.. set schedule and then associate data metric function.
ALTER TABLE CUSTOMER
SET DATA_METRIC_SCHEDULE = 'USING CRON 0 12 * * MON,TUE,WED,THU,FRI EST';
-- Example of trigger on changes.
ALTER TABLE CUSTOMER
SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
Attaching DMFs to Tables and Views
You can associate system and custom DMFs with tables using the ALTER TABLE statement. This allows Snowflake to automatically run these checks based on your schedule:
-- Add/Attach Systems DMF to Tables (These tables have schedules)
ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.DUPLICATE_COUNT ON (email);
ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT ON (email);
ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.UNIQUE_COUNT ON (email);
ALTER TABLE CUSTOMER
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.FRESHNESS ON (loaddate);
Visualizing Data Quality Metrics
Snowflake offers multiple avenues to visualize the results of your DMFs:
- Data Quality Monitoring UI: Snowflake’s built-in interface provides an overview of DMF execution history, results, and trends. This is a convenient way to quickly assess the health of your data.
- Snowsight Dashboards: Leverage Snowsight, Snowflake’s data exploration tool, to create custom dashboards that visually represent DMF results using charts, graphs, and other visualizations.
- Programmatic Access: You can directly query the DATA_QUALITY_MONITORING_RESULTS view using SQL to extract data for external reporting tools or custom applications.
By combining these visualization techniques, you can create a comprehensive data quality dashboard that keeps stakeholders informed and empowered to take action.
Visualizing Overall Data Quality Summary
Let’s craft some SQL queries you can use within your Streamlit app to fetch and visualize the data quality metrics generated by your Snowflake DMFs.
This query aggregates results across all DMFs and tables, providing a high-level overview of data quality issues:
-- Validate Monitoring Results
SELECT *
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
ORDER BY MEASUREMENT_TIME DESC;
-- Validate Specific Database and Schema
SELECT
table_name,
metric_name,
COUNT(*) AS num_issues,
SUM(CASE WHEN value = FALSE THEN 1 ELSE 0 END) AS num_failures
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE
TABLE_DATABASE = 'SALESDB' AND
TABLE_SCHEMA = 'CUSTS'
GROUP BY table_name, metric_name;
Query 2: Detailed View of Failed Records (per Table)
This query drills down into specific tables and DMFs checks:
SELECT *
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE
TABLE_DATABASE = 'SALESDB' AND
table_schema = 'CUSTS' AND
table_name = 'CUSTOMER' AND -- Replace with the desired table
metric_name = 'INVALID_EMAIL_COUNT';
Query 3: Data Quality Trends Over Time
This query aggregates DMF results by date, allowing you to visualize trends and identify patterns:
SELECT
DATE(measurement_time) AS execution_date,
table_name,
metric_name,
SUM(value) AS measure_counts
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE
table_database = 'SALESDB' AND
table_schema = 'CUSTS'
GROUP BY execution_date, table_name, metric_name
ORDER BY execution_date;
Streamlit Implementation
Here’s a conceptual outline of how you might integrate these queries into a Streamlit app:
# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
import pandas as pd
import plotly.express as px
# Initialize Snowflake session
session = get_active_session()
# Function to fetch data from Snowflake using the provided query
def fetch_data(session):
query = """
SELECT
DATE(measurement_time) AS execution_date,
table_name,
metric_name,
SUM(value) AS measure_counts
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE
table_database = 'SALESDB' AND
table_schema = 'CUSTS'
GROUP BY execution_date, table_name, metric_name
ORDER BY execution_date;
"""
return session.sql(query).to_pandas()
# Streamlit app
st.title("Snowflake Data Quality Metric Visualization")
with st.sidebar:
st.subheader("Filter Options")
tables = st.multiselect("Select Tables", ["Customer", "Opportunities"], default=["Customer", "Opportunities"])
# Fetch data from Snowflake
try:
df = fetch_data(session)
st.write("Data fetched successfully.")
except Exception as e:
st.error(f"Error fetching data: {e}")
df = pd.DataFrame() # Ensure df is defined
# Filter data based on selected tables
if not df.empty:
if 'TABLE_NAME' in df.columns:
filtered_df = df[df["TABLE_NAME"].isin(tables)]
# Create line chart for overall trend
fig_trend = px.line(filtered_df, x="EXECUTION_DATE", y="MEASURE_COUNTS", color="TABLE_NAME",
title="Data Quality Metric Trend")
st.plotly_chart(fig_trend)
# Create bar chart for breakdown by metric
fig_breakdown = px.bar(filtered_df, x="METRIC_NAME", y="MEASURE_COUNTS", color="TABLE_NAME", barmode="group",
title="Data Quality Breakdown by Metric")
st.plotly_chart(fig_breakdown)
else:
st.error("The column 'TABLE_NAME' is not present in the data.")
st.write("DataFrame columns:", df.columns)
else:
st.write("No data available for the selected tables.")
Application Output
Monitoring DMF Credit Usage
While Snowflake DMFs are a valuable tool for maintaining data quality, it’s important to be mindful of their resource consumption, as DMF execution consumes Snowflake credits. Fortunately, Snowflake provides visibility into the credits used by DMFs through the DATA_QUALITY_MONITORING_USAGE_HISTORY view in the ACCOUNT_USAGE schema.
Here’s a query to gain insights into your DMF credit usage:
-- Visibility into credits used by DMF
SELECT
DATEDIFF(second, start_time, end_time) AS DURATION,
*
FROM SNOWFLAKE.ACCOUNT_USAGE.DATA_QUALITY_MONITORING_USAGE_HISTORY
WHERE START_TIME >= CURRENT_TIMESTAMP - INTERVAL '3 days'
LIMIT 100;
By actively monitoring your DMF credit usage, you can ensure that your data quality efforts remain cost-effective while providing the insights you need to make informed decisions.
Conclusion
By proactively monitoring and addressing data quality issues with Snowflake DMFs, you build a foundation of trust in your sales data. This enables you to make informed decisions, drive revenue growth, and build a successful sales organization.
Resources
- Introduction to Data Quality and data metric functions | Snowflake Documentation
- System data metric functions | Snowflake Documentation
- Working with data metric functions | Snowflake Documentation
- Tutorial: Getting started with data metric functions | Snowflake Documentation
- ALTER TABLE | Snowflake Documentation
Hello and welcome! Thrilled you’re here. We share insights on Data, AI, Tech trends, and the Future. Thank you for being a part of this community!
🙏 Before you go:
If you found value in this post, please clap and follow to stay updated! 👏
Discover more at: DemoHub.dev (Modern Data Tools) & DaaiC.dev (Data Analytics & AI Conferences), YouTube & LinkedIn