Performing Data Quality Checks in Snowflake — Part 1

My learnings from the field for data quality and data monitoring in Snowflake

Venkat Sekar
Hashmap, an NTT DATA Company
7 min readJun 21, 2021

--

Overview

The widespread adoption of Snowflake has accelerated greatly over the last few years and even more so after their initial public offering. Honestly, this is because the product is just that good. Before Snowflake, the major clouds (AWS, Azure, & Google Cloud Platform) dominated the space.

Snowflake has been able to differentiate itself from the market by providing a fully SaaS service handling all of the backend infrastructure services that accompanies the major cloud providers.

The ability to easily deploy Snowflake over any of the major cloud providers, scale up or down as needed, spin up new warehouses, additional multiple workloads, and pay by the second, all through a SQL-friendly interface, makes it the ideal choice for a cloud data platform. Nonetheless, as with any cloud data platform, data engineering is vitally important.

When it comes to data engineering best practices, data quality and data monitoring should be at the forefront. Regardless of the perception around data cataloging and data governance, having “continuous data quality” and “continuous data monitoring” is a definite need.

Defining Data Quality & Monitoring

To start, episode #178 of the “Data Engineering Podcast” provides a really unique perspective as it relates to data quality and monitoring. It clearly articulates the what’s, why’s, and how’s.

In general, performing various unit testing and other scenarios on code to ensure that it can perform in the production environment within Snowflake should be a normal procedure. Failures in the real production environment are normally caused by underlying data changes and other factors that were not accounted for.

For example, a feed from an external data source could change the data type or structure, causing a break somewhere in the data processing pipeline to Snowflake. Being aware of such changes and alerting interested parties when this happens will eliminate future headaches.

Performing “continuous” data quality/monitoring is like having access to all of the dials, controls, and levers in the dashboard of an airplane or train cockpit. When something is wrong a real-time alert is provided, notifying the pilot or driver and directing them to the problem so that it can be reconciled.

What are some of the choices?

The good news is that there are a wide variety of players in this field that can address this need for Snowflake.

  • Monte Carlo: data observability and data reliability while minimizing data downtime
  • Great Expectations: open-source Python-based library for validating, documenting, and profiling data
  • SODA: open-source command-line tool that utilizes user-defined input to prepare SQL Queries that run tests on database tables to find invalid, missing, or unexpected data
  • DataFold: ensure data quality through diffs, profiling, and anomaly, detection within your existing infrastructure & CI/CD
  • Anomalo: validate and document data within a data warehouse all without writing a single piece of code
  • any many more

While each of the above vendors and products are well developed and has seen some industry adoption, there are still some missing aspects that are not accounted for. With that in mind, it makes sense to highlight some of these to enable sound decision-making moving forward as it relates to “continuous” data quality/monitoring. Please note, that all of these opinions are based on previous engagements and experience gained as a data migration consultant covering various sectors, team size, etc.

Not all Teams are Equal

Many teams are just starting out on their cloud adoption and data migration journey. Often they are still learning the ropes and complexities of developing, promoting, and executing data pipelines. To a basic SQL data engineer, Python is a bit of a stretch. Educating and onboarding SQL data engineers with Python does not happen frequently. Hence some clients shy away from Python-based adoption initially.

When adopting tools like Great Expectations, consider onboarding one or two Python-based resources into the data engineering team. Quite often a substantial amount of training is involved in the process. The best way to shorten this is to look for expertise and resources across other departments and see if there is any insight to be gained.

The support team is usually in a separate group that manages the application in production. In most cases, the necessary skillsets for the proper execution of the data quality/monitoring tools already exist. Additionally, they should have rich knowledge on data issues as they have likely encountered many and addressed them.

Involving this team when developing KPIs and metrics will aid in successful adoption. However, the timelines and process involved in training, mentoring, etc. with the support team can be lengthy.

Not all Companies are Equal

Working with smaller startups or medium-sized companies tends to be easier because they are agile, nimble, and open to the adoption of specific tools that can more quickly meet their end goal. However, in larger organizations, the adoption of new tools can be a lengthy and time-consuming process. Some enterprise clients are still pushing back against “dbt” for instance. This is largely due to enterprise standards, architecture approvals, security assessments, the need for commercial support, and the procurement lifecycle.

When looking to adopt new tools, always account for the required timeline of getting a product installed in a dev environment and also the process involved.

Not All Operating Environments are Equal

A tool like “Great Expectations” or “SODAsql” needs a Python-based environment to run. They could be wrapped into a Docker container and deployed in a VM, k8s cluster, etc. However, depending on the industry, (insurance, finance, healthcare, etc.) the choice of Docker/VM image would be restricted, largely due to industry compliance mandates like HIPAA.

The need to host the tool and compute is minimal, but the process involved in acquiring the right type of computing can be considerable. Many meetings and approvals with various stakeholders are required in order to get the infrastructure in place for adoption, so this route could take a substantial amount of time.

Not all Data Security/Compliance(s) are Equal

Depending on whether or not data is verified, data security policies can affect the adoption. For example, if a certain expectation/validation query involves checking PII records, it is important to first understand what is involved in the process of verification.

  • Does the data get extracted out of the data store (e.g. Snowflake) and sent over the wire to the hosting/compute environment where the tool resides?
  • What user role/grants are needed for this verification to be made?
  • Where is the temporary storage for holding the data intermediately as the verification is performed?
  • Once verification is done, how is it possible to ensure that the data has been cleansed?

These are just a few examples that need to be addressed. Even if the data is not PII or compliance-specific, capturing it and documenting it will prevent future headaches.

Final Thoughts

Hopefully, this article was helpful and raised some awareness of the important aspects involved in the adoption journey of “continuous” data quality/monitoring that is beyond the product itself.

Although implementing an open-source vendor product would be a good choice, until such time, these alternative tactical approaches should aid when it comes to designing for “continuous” data quality/monitoring within Snowflake. In the coming weeks, Part 2 of this series will focus on taking this analysis to a more formal level and showcasing how it can be leveraged to address some of the above aspects and help real teams moving forward.

Ready to Accelerate Your Digital Transformation?

At Hashmap an NTT DATA company, we work with our clients to build better, together. We are partnering across industries to solve the toughest data challenges, whether it be cloud and data migrations, engineering, architecture, pipelining, automation, CI/CD, etc. — we can help you shorten time-to-value!

We offer a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our data and cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Additional Resources

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Venkat Sekar is a Senior Architect for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--