Snowflake Native Data Quality

A Game Changer for Governance

data.world
Published in
4 min readJun 28, 2023

--

by Jon Loyens, Co-founder and Chief Data & Strategy Officer

Snowflake has made so many exciting new product announcements this year at Summit in the keynote and breakout sessions. However, as the Chief Data Officer of data.world, one thing that really stood out amid all the new capabilities and tools Snowflake is offering was something more immediately practical, yet still game changing, for managing a governance program: Snowflake’s new native data quality features.

Why do I say these features are game changing for governance? At first blush, you might ask how these features compare to the various data quality and observability platforms that are on the market today. Snowflake’s features for data quality are both simpler and more flexible all at the same time because they’re not necessarily a full fledged observability tool but bring the foundations necessary to build a data quality program directly into the data cloud. The core of this new offering are user defined Data Metric Functions (created at the schema level) that can be named and then applied to tables or columns of data in your Snowflake databases. Additionally, Snowflake is providing a library of out-of-the-box System metrics to get you started. Once applied, these metrics are then regularly calculated and written into a table as a time series that quality checks can be run against — checking thresholds and changes over time.

This seemingly simple system can impact governance programs in some awesome ways making quality and certification much easier to manage and deploy consistently. Here’s how I anticipate this feature getting used and why I’m excited to deploy it in data.world:

1) Quality Metrics Can Be Globally Defined and Managed for Consistency

The new global, named metric object in Snowflake is a very powerful concept. This will allow governance teams to create simple, named global functions or tests that all other data teams can use for consistency. This promotes reuse of data quality code and aids in establishing best practices for the data products that teams are publishing in Snowflake. Furthermore this concept extends beyond data quality and edges toward the basics of a semantics layer in Snowflake since core profiling and statistical metrics could be defined and applied across data product and domain.

2) Quality (and Testing) is now a core object in the data warehouse.

As someone who cares deeply about governance, having this concept tied directly to data and available in our data warehouse means that data owners can now run checks and audits to make sure that our most important data assets actually have tests assigned to them and are running. This makes it a lot easier to define and enforce testing requirements as part of a definition-of-done. For example, you could run one query that checks each table and column definitions in your information schema then create a report that lists what percentage of columns are covered by defined, standardized quality metrics. This is similar to code coverage reports in software engineering.

3) Quality Metrics, Rules and Tests can be expressed as SQL Definitions

Part of the magic of Snowflake’s platform, in my opinion, is that everything can be expressed as simple, straightforward, SQL Expressions. This means that the quality metrics you define can be governed and deployed using typical software lifecycle management tools and techniques (quite simply, they can be source controlled and deployed using processes and means. You could very easily, for example deploy your metrics as part of dbt runs).

Since the results of metrics runs are just written to a table in Snowflake, simple SQL can be used to express global tests and rules. You can test for thresholds or deltas and use that to drive reporting, tagging and visualization of data quality. Imagine writing a simple SQL function that tests for thresholds in the quality metric table and then automatically adds tag values to tables when thresholds are out of bounds!

4) Certification Workflows in Governance Programs can be tied to these tests and automatically surfaced in a catalog or governance tool

We’re excited to build native support and management for these features directly into data.world in the coming months. Imagine a world where, once appropriate metrics are configured on critical data assets, Data Stewards and Owners can simply keep an eye on the quality of those Tables — freshness, completeness, validity — all from within the data.world Catalog. Of course, Stewards will also be able to integrate these checks with our just announced Eureka Bots automations as well!

You might be excited to start taking advantage of these features immediately though and the power of our Knowledge Graph Architecture will allow for that too. Since quality metrics are delivered as a table in Snowflake, it’s possible to virtualize that table into the data.world environment and immediately build automations off of the metrics using either SQL or SPARQL (for example, you could automatically set tags or statuses on data assets using automations based on the imported metrics). We’ll be following up with an article on how to do this soon!

At data.world, we’re very excited about this very practical and immediately useful new feature set in Snowflake. To that end — we’re already planning to build features into the data.world catalog to make use of these features and make them easier to manage as part of your governance program. Even before these features are available however, data.world knowledge graph architecture means that as Snowflake introduces new capabilities and objects in their platform, you can immediately take advantage of them in your governance program and data catalog. Make sure to follow us for more how-to’s, updates and news on these capabilities and reach out to us if you’d like to see the extensibility of data.world in action.

If you would like to learn more — click here — to learn more from data.world.

--

--

data.world

Co-founder and Chief Product Officer at data.world, ex-HomeAway-BV-Trilogy, Python and JS nut, Austinite, Canadian, Midgetman, Tennis Player, Geek