Acceldata and Snowflake

Published in

acceldata

5 min readNov 11, 2020

In case you somehow missed it, Snowflake, the leading provider of cloud-based data warehousing solutions, just completed the biggest tech IPO of 2020. Snowflake’s recent market value neared a mind-boggling $75 billion. Since that’s enough to ensure that everyone has Snowflake on the brain, we figured it might be a good time to show data junkies how they can use Acceldata Torch in conjunction with Snowflake to enjoy a huge boost in analyst productivity and data culture.

In order to fully appreciate Snowflake’s value, it’s useful to quickly review the typical enterprise’s data users and their behaviors. The diagram below shows where data engineers, and more broadly the CDO organization, fit in this world.

Where Data Engineers fit in Data Supply Chain

One of the key values of Snowflake is that it democratizes data access. Given that Snowflake is backed by object storage, it allows organizations to treat enterprise data like a data lake with an elastic warehouse on top. Unlike other data warehouses, Snowflake does not support partitions or indexes. Instead, Snowflake automatically divides large tables into micro-partitions, which are used to calculate statistics about the value ranges contained in each column. These statistics then determine which subsets of your data are actually needed to run queries. For most data practitioners, this shift to micro-partitions shouldn’t be an issue. In fact, many people choose to migrate to Snowflake because this approach reduces query latency.

Still, if you have partitions and indexes in your current ecosystem and are migrating to “clustering” models, you need to be able to build complete “trust” in the data. Trust in data requires not only a solid understanding of data schemas and lineage, but the ability to address seemingly minor issues like syntax errors thrown during migration (e.g., Snowflake SQL is case sensitive). Unfortunately, traditional data quality analysis does not catch migration syntax errors. Tools like Looker can be used to spot-check inconsistencies and provide ad-hoc quality tests, but deeper and more systematic tools are often needed to create comprehensive trust in data.

This is where Acceldata Torch comes into play. Acceldata Torch validates data from an existing warehouse, like Teradata, as the data is moved into Snowflake. In the video below, you can see how Torch samples data and provides profiling reports during the migration and reconciliation reports once it’s moved into Snowflake. This ensures data quality and identifies any lingering data quality issues.

Teradata -> Snowflake | Custom ETL validation

Pairing AccelData with Snowflake provides four main benefits for enterprise data users:

Empower Anyone to Find Trusted Data: Acceldata makes it easy for anyone — from data scientists to business users — to find and understand trusted data. Acceldata Torch discovers your data and inventories it into a Catalog to identify critical data and give business context of the data as it flows into your data supply chain. Acceldata Torch provides a complete 360-degree picture or real-time, automated data map of every data asset. In addition, Torch can profile the data and highlight attributes such as unique values, type of data, meaning and statistical properties of the data automatically. Finally, Torch is a continuous data quality solution that can define various policies that can be asserted against the data as it arrives. Torch integrates with popular workflow tools, like Slack and Jira, to alert users of data downtime and various types of errors.
Go Farther, Together: With the rise of self-service analytics, more people are empowered to create reports, dashboards, and visualizations. While these assets are available in Snowflake, data consumers often don’t know they exist and needlessly recreate work. With Acceldata Torch, as data consumers find relevant data in Snowflake, Torch automatically provides pointers to related data assets and the colleagues who have created them. That means that data consumers can build on top of existing work and more easily collaborate with their colleagues, increasing overall productivity around trusted enterprise data. A good example is in a bank where customer Date of Birth is captured at account opening. Typically 5% of birth dates are inaccurate. A simple data quality check across the data pipeline ensures Date of Birth is captured correctly — within valid range and date format — can help increase cross-sell opportunities, since regulations sometimes require products to be sold only to customers above a certain age.
Find Data, No Matter Where it Resides: Acceldata also helps enterprises that are looking to move workloads to Snowflake’s cloud data platform. Acceldata provides a unified view of data, no matter where it resides. This makes it easier for data teams to determine which data should move, while providing data consumers consistent experiences through migration and beyond.
Planning for Data Consumers: Data Modelers can determine the downstream impact of a new release (column addition, deletion, data type change etc) on downstream BI applications.

The Acceldata Data Observability Cloud accelerates enterprise data migration to the cloud with its data observability service. Data reconciliation and quality impede and prevent large scale data migration to public clouds. With Acceldata Torch, enterprises can automatically reconcile large scale one-time migrations along with continuous validation of incoming mutable and immutable datasets generated from on-premise and cloud applications. As enterprises move data across multiple systems, Acceldata’s Torch service enables enterprises to upkeep data quality, enforce conformance to schemas, manage sensitive data, and prevent anomalous data drift.

Appendix

Data Production errors: These are data quality issues that emanate during production of data in, for example, a banking system. Typical causes are Incorrect data entry or Insufficient technical controls to ensure data is entered into the system in the correct format .

Data Transmission errors: These are data quality errors that arise from data that is transmitted from data producer to data consumer via different hops. Some examples could be column data getting truncated due to non-standard field lengths. records (read rows of data) getting dropped due to logic written to discard non-standard data.

Data Consumption errors: This type of error occurs when it gets transmitted properly from producers to consumers, but the consumer doesn’t know how to use the data properly and applies incorrect rules. This is akin to not reading a “user guide” when you buy a new electronic device and destroying it with improper set up. Typical examples include Data Consumer assuming that the column “open date” means account open date, while the data producer may be using another field “acctopn date” as the column to produce account open date.

Acceldata and Snowflake

Written by The Data Observer