Data Quality Framework in Snowflake

Divya Rajesh Surana
5 min readFeb 9, 2023

--

Source: https://cdi.xoriant.com/why-does-one-need-clean-correct-and-quality-data/

Introduction:

Data quality is a critical aspect of data management and analysis, as it has a direct impact on the accuracy and reliability of insights derived from the data. Organizations in today’s rapid business environment require rapid decision-making, and having accurate data is crucial for making informed choices.

In order to achieve high-quality data, it is important to implement robust data quality checks that ensure that the data is accurate, complete, consistent, and free from errors. This can be especially challenging when dealing with large datasets, as manual checks can be time-consuming and error-prone.

We can perform data quality checks in Snowflake as well. Snowflake is a cloud-based data warehousing platform that provides a flexible and efficient platform for managing and analyzing your data. With its powerful set of built-in functions and support for stored procedures, you can easily create custom data quality checks to ensure that your data is of the highest quality.

In this blog, we will explore a custom stored procedure in Snowflake that performs data quality checks on a table and stores the results in a metrics table. The stored procedure takes the name of the table as a parameter and uses the information schema to fetch all the columns in the table. The procedure then loops over each column, performing a set of data quality checks. The results of these checks are then stored in a table called metrics, allowing you to easily monitor the quality of your data over time. It allows organizations to identify and address data quality issues. The procedure performs the below list of checks:

  1. Total Count: Total number of rows present in a table.
  2. Not Null Count: The number of rows in the column with a non-null value.
  3. Null Count: The number of rows in the column with a null value.
  4. Blank Count: The number of rows in the column with a blank value.
  5. Distinct Values Count: The number of distinct values in the column.
  6. Max Value: The maximum value in the column.
  7. Min Value: The minimum value in the column.
  8. Max Length: The maximum length of the values in the column.
  9. Min Length: The minimum length of the values in the column.
  10. Numeric Values Count: The number of rows in the column with a numeric value.
  11. Alphabetic Values Count: The number of rows in the column with an alphabetic value.
  12. Alphanumeric Values Count: The number of rows in the column with an alphanumeric value.
  13. Special Characters Values Count: The number of rows in the column which contains special characters.
  14. Top 10 Distinct Values: The top 10 most common distinct values in the column.

Steps to create a framework:

Step 1: Create a table named “metrics”.

CREATE OR REPLACE TABLE STG_DEV.PUBLIC.METRICS(TABLE_NAME VARCHAR, COLUMN_NAME VARCHAR, TOTAL_COUNT INTEGER, NOT_NULL_COUNT INTEGER, NULL_COUNT INTEGER, BLANK_COUNT INTEGER, DISTINCT_VALUES_COUNT INTEGER, MAX_LENGTH VARCHAR, MIN_LENGTH VARCHAR, MAX_VALUE VARCHAR, MIN_VALUE VARCHAR, NUMERIC_ONLY_VALUES_COUNT INTEGER, ALPHABETS_ONLY_VALUES_COUNT INTEGER, ALPHANUMERIC_ONLY_VALUES_COUNT INTEGER, CONTAINS_SPECIAL_CHAR_COUNT INTEGER, TOP_TEN_DISTINCT_VALUES VARCHAR);

Step 2: Create a Stored Procedure

CREATE OR REPLACE PROCEDURE STG_DEV.PUBLIC.DATA_QUALITY(databaseName varchar, schemaName varchar, tableName varchar)

RETURNS VARCHAR(16777216)

LANGUAGE SQL

EXECUTE AS CALLER

AS

$$

BEGIN

BEGIN TRANSACTION;

let fully_qual_table_name varchar := databaseName || ‘.’ || schemaName || ‘.’ || tableName;

let info_schema_columns_view varchar := databaseName || ‘.INFORMATION_SCHEMA.COLUMNS’;

let res resultset := (select COLUMN_NAME from identifier(:info_schema_columns_view) where TABLE_CATALOG = :databaseName and TABLE_SCHEMA = :schemaName and TABLE_NAME = :tableName);

let c1 cursor for res;

for i in c1 do

LET TOTAL_CNT NUMBER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name;

select $1 into :TOTAL_CNT from table(result_scan(last_query_id()));

LET CNT NUMBER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ where ‘||i.COLUMN_NAME||’ is not null’;

select $1 into :CNT from table(result_scan(last_query_id()));

LET NULL_CNT NUMBER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ where ‘||i.COLUMN_NAME||’ is null’;

select $1 into :NULL_CNT from table(result_scan(last_query_id()));

LET BLANK_CNT NUMBER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ where trim(to_varchar(‘ || i.COLUMN_NAME || ‘)) = ‘’’’ ‘;

select $1 into :BLANK_CNT from table(result_scan(last_query_id()));

LET DISTINCT_VALUES_CNT NUMBER;

execute immediate ‘SELECT COUNT(DISTINCT ‘ || i.COLUMN_NAME || ‘) FROM ‘||:fully_qual_table_name;

select $1 into :DISTINCT_VALUES_CNT from table(result_scan(last_query_id()));

LET MAX_LENGTH VARCHAR;

execute immediate ‘SELECT IFF(MAX(LENGTH(‘||i.COLUMN_NAME||’)) IS NULL,’’NULL’’,TO_VARCHAR(MAX(LENGTH(‘||i.COLUMN_NAME||’)))) FROM ‘||:fully_qual_table_name;

select $1 into :MAX_LENGTH from table(result_scan(last_query_id()));

LET MIN_LENGTH VARCHAR;

execute immediate ‘SELECT IFF(MIN(LENGTH(‘||i.COLUMN_NAME||’)) IS NULL,’’NULL’’,TO_VARCHAR(MIN(LENGTH(‘||i.COLUMN_NAME||’)))) FROM ‘||:fully_qual_table_name;

select $1 into :MIN_LENGTH from table(result_scan(last_query_id()));

LET MAX_VALUE VARCHAR;

execute immediate ‘SELECT IFF(MAX(‘||i.COLUMN_NAME||’) IS NULL,’’NULL’’,TO_VARCHAR(MAX(‘||i.COLUMN_NAME||’))) FROM ‘||:fully_qual_table_name;

select $1 into :MAX_VALUE from table(result_scan(last_query_id()));

LET MIN_VALUE VARCHAR;

execute immediate ‘SELECT IFF(MIN(‘||i.COLUMN_NAME||’) IS NULL,’’NULL’’,TO_VARCHAR(MIN(‘||i.COLUMN_NAME||’))) FROM ‘||:fully_qual_table_name;

select $1 into :MIN_VALUE from table(result_scan(last_query_id()));

LET NUMERIC_ONLY_VALUES_CNT INTEGER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ WHERE RLIKE(TRIM(‘||i.COLUMN_NAME||’),’’^[0–9]+$’’) ‘;

select $1 into :NUMERIC_ONLY_VALUES_CNT from table(result_scan(last_query_id()));

LET ALPHABETS_ONLY_VALUES_CNT INTEGER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ WHERE RLIKE(TRIM(‘||i.COLUMN_NAME||’),’’^[a-zA-Z ]+$’’)’;

select $1 into :ALPHABETS_ONLY_VALUES_CNT from table(result_scan(last_query_id()));

LET ALPHANUMERIC_ONLY_VALUES_CNT INTEGER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ WHERE UPPER(TO_VARCHAR(rlike(trim(‘||i.COLUMN_NAME||’),’’.*[0–9].*’’))) and UPPER(TO_VARCHAR(rlike(trim(‘||i.COLUMN_NAME||’),’’.*[a-zA-Z].*’’))) ‘;

select $1 into :ALPHANUMERIC_ONLY_VALUES_CNT from table(result_scan(last_query_id()));

LET CONTAINS_SPECIAL_CHAR_CNT INTEGER;

execute immediate ‘SELECT COUNT(*) FROM ‘||:fully_qual_table_name||’ WHERE RLIKE(TRIM(‘||i.COLUMN_NAME||’),’’.*[^A-Za-z0–9 ].*’’) ‘;

select $1 into :CONTAINS_SPECIAL_CHAR_CNT from table(result_scan(last_query_id()));

LET TOP_TEN_DISTINCT_VALUES VARCHAR;

execute immediate ‘SELECT array_to_string(array_agg(distinct ‘||i.COLUMN_NAME||’) within group (order by ‘||i.COLUMN_NAME||’ asc), ‘’,’’) FROM (select top 10 ‘||i.COLUMN_NAME||’ from ( select distinct ‘||i.COLUMN_NAME||’ from ‘||:fully_qual_table_name||’))’;

select $1 into :TOP_TEN_DISTINCT_VALUES from table(result_scan(last_query_id()));

execute immediate ‘INSERT INTO STG_DEV.PUBLIC.METRICS

(TABLE_NAME,COLUMN_NAME,TOTAL_COUNT,NOT_NULL_COUNT,NULL_COUNT,BLANK_COUNT,DISTINCT_VALUES_COUNT,MAX_LENGTH,MIN_LENGTH,MAX_VALUE,

MIN_VALUE,NUMERIC_ONLY_VALUES_COUNT,ALPHABETS_ONLY_VALUES_COUNT,ALPHANUMERIC_ONLY_VALUES_COUNT,CONTAINS_SPECIAL_CHAR_COUNT,TOP_TEN_DISTINCT_VALUES)

VALUES(‘’’||:fully_qual_table_name||’’’,’’’||i.COLUMN_NAME||’’’,’|| :TOTAL_CNT||’,’|| :CNT||’,’|| :NULL_CNT||’,’|| :BLANK_CNT||’,’|| :DISTINCT_VALUES_CNT||’,’’’|| :MAX_LENGTH||’’’,’’’|| :MIN_LENGTH||’’’,’’’|| :MAX_VALUE||’’’,’’’|| :MIN_VALUE||’’’,’|| :NUMERIC_ONLY_VALUES_CNT ||’,’|| :ALPHABETS_ONLY_VALUES_CNT ||’,’|| :ALPHANUMERIC_ONLY_VALUES_CNT ||’,’|| :CONTAINS_SPECIAL_CHAR_CNT ||’,’’’|| :TOP_TEN_DISTINCT_VALUES||’’’)’ ;

end for;

COMMIT;

RETURN ‘success’;

EXCEPTION

WHEN OTHER THEN

ROLLBACK;

RAISE;

END;

$$;

Step 3: Call SP for the table whose data quality needs to be measured

call STG_DEV.PUBLIC.DATA_QUALITY(‘EDW_DEV’,’EDW’,’MY_TABLE’);

Now to see the data quality report for the table, run the below command:

select * from STG_DEV.PUBLIC.METRICS WHERE TABLE_NAME=’EDW_DEV.EDW.MY_TABLE’;

The stored procedure is also highly customizable, allowing organizations to add or modify checks as needed to meet their specific data quality requirements. With a strong data quality framework in place, organizations can improve their data-driven decision-making and achieve better business outcomes.

One of the key benefits of using a stored procedure for data quality checks is that it provides a centralized and repeatable process for validating your data. This can be especially useful when dealing with large datasets, as it eliminates the need to manually perform these checks each time you load new data into Snowflake. Additionally, the use of a stored procedure ensures that your data quality checks are consistent and reliable, reducing the risk of errors or inaccuracies in your data.

Implementing a stored procedure for data quality checks in Snowflake is straightforward and requires only a basic understanding of Snowflake SQL syntax and stored procedures. With a little bit of code, you can create a powerful and efficient tool for ensuring the quality of your data.

In conclusion, ensuring the quality of your data is essential for making informed decisions, and Snowflake provides an ideal platform for implementing custom data quality checks. By using a stored procedure for data quality checks, you can ensure that your data is accurate, complete, consistent, and free from errors and that your insights are based on high-quality data. So why not give it a try and take your data quality to the next level?

--

--