Performing Data Quality Checks in Snowflake — Part 2

Find out how SnowDQ uses Snowflake’s built-in stored procedures and JavaScript to ensure data quality across tables

Venkat Sekar
Hashmap, an NTT DATA Company
4 min readJun 22, 2021

--

Overview

Part 1 of this series provided a brief overview of a “continuous” data quality and data monitoring approach and also various aspects to consider when evaluating different tools or products to fulfill this need for Snowflake.

The need for performing data quality checks should not be postponed until some random time in the future. In this regard, a really tactical solution to this problem is SnowDQ, which is a simplistic and smart approach to performing data quality checks on data that is hosted in Snowflake. The code is fully open-sourced and can be found in GitLab.

Design Principles

Before we get into implementation details, here are some reasons as to why SnowDQ was developed and is being shared.

Tools such as SODA, which was referenced in Part 1, are developed and targeted to be adopted for multiple platforms i.e., Snowflake, Spark, BigQuery, Synapse, Redshift, etc. Since the underlying code/implementation has to be written in a generic way, the adoption boundary is enterprise-wide.

SnowDQ was developed solely for Snowflake. To be specific, it is meant to be used by developers, analysts, and support teams managing a Snowflake environment, but most importantly it is designed to be simple and Snowflake native.

Operating Environment and Language Boundaries

SnowDQ utilizes what Snowflake has to offer as part of its solution. Since Snowflake’s Data Platform offers a scalable compute environment, there is no need to maintain, host, or administer a separate set of resources like Docker, VM, etc.

The verification/expectation functions are native stored procedures and JavaScript offered by Snowflake. Hence, teams do not have to learn anything new. However, custom verification will need to be implemented as a stored procedure.

Keeping SnowDQ native to Snowflake, there is no need to worry about getting in discussions with infrastructure, security, network, cloud system admin, procurement, etc. In reality, the adoption of this tactical solution can be done as a part of current deliverables.

Data security and vulnerability

JavaScript and stored procedures in Snowflake cannot use any custom libraries because they are not allowed to reach outside of Snowflake. Hence, no data gets pushed outside of the computing environment.

The verification/expectation functions and stored procedures can be declared to be runnable as “Owner” or “caller”, so re-designing new security roles to assess a secure data set is pointless.

Since stored procedures are bounded within the context of Snowflake, the adoption timeline is reduced in the context of acquiring security approvals.

Invocation of Expectation Suite

SnowDQ, verifications can be executed by a single call.

-- Invoke a whole expectations suite
call VERIFY_EXPECTATIONS('SNOWFLAKE_SAMPLE_DATA/TPCDS_SF100TCL', true, parse_json('{ "BATCH":456 }'));
-- Invoke a specific expectation
call VERIFY_NOT_NULL(
'SNOWFLAKE_SAMPLE_DATA','TPCDS_SF100TCL','CATALOG_RETURNS',
PARSE_JSON('{
"COL": "CR_WAREHOUSE_SK"
}')
);

This allows verification from different tools like Matillion, dbt, or custom SQL scripts so that no new tool integrations need to be developed.

Validation result data store and accessibility

SnowDQ’s, verification stored procedures can optionally capture the result in a table as a VARIANT in JSON format.

From there, it is possible to build a view on top of this table and have it served up into business intelligence dashboards, web apps, etc.

Thoughts

Hopefully, these practical examples are convincing enough for a tactical adoption of SnowDQ. Currently, there are only two verifications but it is always possible to add more as needed, especially since the code is open-sourced.

Please access the GitLab repo for the code and further documentation.

Here is another reference to how you can move forward on improving and adopting further : A Comprehensive Framework for Data Quality Management.

Ready to Accelerate Your Digital Transformation?

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with nearly every industry to solve the toughest data challenges, whether it be cloud and data migrations, engineering, architecture, pipelining, automation, CI/CD, etc — we can help you shorten time-to-value!

We offer a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Additional Resources

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Venkat Sekar is a Senior Architect for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--