Native & Custom Data Quality Monitoring in Snowflake
Ensuring data quality is crucial for any data-driven organization. Snowflake provides native capabilities to monitor and maintain high data quality standards within your data warehouse. Here’s how you can leverage Snowflake’s features to monitor data quality:
Snowflake offers system Data Monitoring Functions (DMFs) within the CORE schema of the shared SNOWFLAKE database. These DMFs are designed and maintained by Snowflake, ensuring their consistency and reliability. Users are not permitted to modify the names or functionalities of these system DMFs.
Each system DMF is tailored to measure a specific data quality attribute, enabling comprehensive monitoring and management of data quality. You can assign multiple DMFs to a table or view to achieve a more thorough assessment of data quality, addressing various governance and compliance requirements.
Let's do the hands
Access control setup : Create the dq_tutorial_role role to use throughout the tutorial:
USE ROLE ACCOUNTADMIN;
CREATE ROLE IF NOT EXISTS dq_tutorial_role;
-- Grant privileges, and grant the application role and database roles to the dq_tutorial_role:
GRANT CREATE DATABASE ON ACCOUNT TO ROLE dq_tutorial_role;
GRANT EXECUTE DATA METRIC FUNCTION ON ACCOUNT TO ROLE dq_tutorial_role;
GRANT APPLICATION ROLE SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER TO ROLE dq_tutorial_role;
GRANT DATABASE ROLE SNOWFLAKE.USAGE_VIEWER TO ROLE dq_tutorial_role;
GRANT DATABASE ROLE SNOWFLAKE.DATA_METRIC_USER TO ROLE dq_tutorial_role;
--Create a warehouse to query the table that contains the data and grant the USAGE privilege on the role to the dq_tutorial_role role:
CREATE WAREHOUSE IF NOT EXISTS dq_tutorial_wh;
GRANT USAGE ON WAREHOUSE dq_tutorial_wh TO ROLE dq_tutorial_role;
-- Confirm the grants to the dq_tutorial_role role:
SHOW GRANTS TO ROLE dq_tutorial_role;
-- Establish a role hierarchy and grant the role to a user who can complete this tutorial:
GRANT ROLE dq_tutorial_role TO ROLE SYSADMIN;
GRANT ROLE dq_tutorial_role TO USER jsmith;
Data setup
USE ROLE dq_tutorial_role;
CREATE DATABASE IF NOT EXISTS dq_tutorial_db;
CREATE SCHEMA IF NOT EXISTS sch;
Data Setup: To facilitate managing the data and the DMF for this tutorial, create a dedicated database to contain these objects:
CREATE OR REPLACE TABLE Employees (
EmpID INT PRIMARY KEY,
Name VARCHAR(100),
Age INT,
Gender CHAR(1),
EmailID VARCHAR(100),
Location VARCHAR(100),
TIME_FIELD TIMESTAMP_TZ
);
-- here if you can see i have done some intentional dupes in empid, wrong format email id,
-- and null in gender, null in age
INSERT INTO Employees (EmpID, Name, Age, Gender, EmailID, Location, TIME_FIELD) VALUES
(101, 'John Doe', 29, 'M', 'john.doe@', 'New York', '2024-05-30 08:00:00'),
(102, 'Jane Smith', 34, 'F', 'jane.smith@example.com', 'Los Angeles', '2024-05-30 09:00:00'),
(103, 'Sam Johnson', 41, 'M', 'sam.johnson@example.com', 'Chicago', '2024-05-30 10:00:00'),
(103, 'Emily Brown', 27, 'F', 'emily.brown@example.com', 'Houston', '2024-05-30 11:00:00'),
(105, 'Michael Green', NULL, 'M', 'michael.green@example.com', 'Phoenix', '2024-05-30 12:00:00'),
(105, 'Linda White', 30, 'F', 'linda.white@example.com', 'Philadelphia', '2024-05-30 13:00:00'),
(107, 'Robert Black', 45, 'M', 'robert.black@example.com', 'San Antonio', '2024-05-30 14:00:00'),
(108, 'Patricia Harris', 33, '', 'patricia.harris@example.com', 'San Diego', '2024-05-30 15:00:00'),
(109, 'David Clark', 50, 'M', 'david.clark@example.com', 'Dallas', '2024-05-30 16:00:00'),
(110, 'Mary Lewis', 28, 'F', 'mary.lewis@example.com', 'San Jose', '2024-05-30 17:00:00');
Create and work with Data metric functions: In the below box, you can see along with native DMF, I have created a custom DMF for validating emails id’s are in the correct format or wrong, and if wrong then it count, you can write N numbers of DMF similar like UDF
--Scheduling
alter table Employees set DATA_METRIC_SCHEDULE = '5 MINUTE';
-- DUPLICATE_COUNT
alter table Employees
add data metric function SNOWFLAKE.CORE.DUPLICATE_COUNT
on (EmpID);
-- NULL_COUNT
alter table EMPLOYEES
add data metric function SNOWFLAKE.CORE.NULL_COUNT
ON (Age);
-- FRESHNESS
alter table EMPLOYEES
add data metric function SNOWFLAKE.CORE.FRESHNESS
ON (TIME_FIELD);
-- ROW_COUNT
-- UNIQUE_COUNT
alter table EMPLOYEES
add data metric function SNOWFLAKE.CORE.UNIQUE_COUNT
on (EmpID);
-- CUSTOM_DATA_METRIC_FUNCTION
CREATE DATA METRIC FUNCTION IF NOT EXISTS
invalid_email_count (EMPLOYEES table(EMAILID STRING))
RETURNS NUMBER AS
'SELECT COUNT_IF(FALSE = (EMAILID REGEXP ''^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'')) FROM EMPLOYEES';
ALTER TABLE EMPLOYEES ADD DATA METRIC FUNCTION
invalid_email_count ON (EmailID);
Review your data metrics
To check all the Data quality metrics implemented on your data:
SELECT * FROM TABLE(INFORMATION_SCHEMA.DATA_METRIC_FUNCTION_REFERENCES(
REF_ENTITY_NAME => 'employees',
REF_ENTITY_DOMAIN => 'TABLE'));
- Custom DQM for email validation
SELECT scheduled_time, measurement_time, table_name, metric_name, value
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
where table_name = 'EMPLOYEES' order by measurement_time desc LIMIT 1;
2. DUPLICATE_COUNT on empid
SELECT SNOWFLAKE.CORE.DUPLICATE_COUNT
(SELECT empid as NO_OF_DUPE_EMP FROM employees);
3. NULL_COUNT on Age
select SNOWFLAKE.CORE.NULL_COUNT(SELECT AGE as NO_NULL_AGES FROM EMPLOYEES);
4. UNIQUE_COUNT on Empid
select SNOWFLAKE.CORE.UNIQUE_COUNT(SELECT empid as NO_unique_emp_id FROM EMPLOYEES);
Each system DMF is tailored to measure a specific data quality attribute, enabling comprehensive monitoring and management of data quality. You can assign multiple DMFs to a table or view to achieve a more thorough assessment of data quality, addressing various governance and compliance requirements.
Clean up of the schedule of DQM
alter table EMPLOYEES unset DATA_METRIC_SCHEDULE;
alter table employees set DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
Conclusion:
After setting up and scheduling your data metric functions, you’ll likely aim to automate the monitoring process rather than manually checking data quality results regularly. Typically, you’ll establish thresholds for the outputs of these functions and trigger alerts if they’re exceeded.