Data Virtualization Strategies On Snowflake

Balancing Cost, Performance, and Operational Efficiency

Photo by Joshua Sortino on Unsplash

Data virtualization is a strategy to store as little data in physical structures (i.e., tables) as necessary and instead use virtual/logical structures (i.e., views) to implement data transformation and business logic. There are pros and cons to a highly virtualized model. Snowflake is often asked to guide customers around the practicality of virtualizing data in a multi-layered environment. This article evaluates the impact of virtualization on operational efficiency, management efficiency, cost, and performance.

Virtualization goes beyond simply raw staging schemas and curated reporting schemas. In a modern data architecture, some zones contain data in different states. Each zone serves as the source for how data will be transformed and presented in the next zone. This exercise will focus on a data architecture that contains three zones (staging, domain, and mart). In this model, raw data is placed in the staging layer, then transformed and curated into the domain zone as a data model. Finally, subject matter specific collections are created in the mart zone to arrive at topology such as this:

It is not required to virtualize an entire zone. Also, the cost vs. gains of virtualizing one or more zones or subsets of zones should be considered.

1. Operational Efficiency

For operational efficiency, a highly virtualized model reduces time to transform raw-data to report-ready-data since data is not being moved and physically instantiating during this “raw to report” processing. Using views to implement transformation and business logic rather than physically storing the data in tables promotes a design that enables a high degree of data agility and nimbleness. If an error or unexpected event is encountered in a highly virtualized design, the errant behavior can be addressed by creating or modifying views to correct the behavior rather than re-running transformation jobs. This reduces operational overhead since data is instantly visible and accessible when the views perform the required transformations without the time needed to materialize the data between zones.

In a Snowflake implementation, using views to virtualize data as it moves between zones may save time but may not necessarily provide data resiliency. Snowflake’s Time Travel capability allows objects to be recovered from a point in time before an errant transformation causes inconsistencies. A physical object can be recovered up to a point before the error, allowing for the correction of the error causing condition and resume processing from that point. Another Snowflake feature that could be impacted by extensive virtualization is Zero Copy Cloning. Zero Copy Cloning allows tables, schemas, or an entire database to be copied by cloning those objects’ metadata. While it is not possible to clone an individual view, it is possible to clone a schema or database containing views if the underlying tables that provide the data for those views are cloned.

2. Performance

Because performance is a critical factor in any analytics platform, it is important to understand how Snowflake delivers performance and how performance could be impacted in a virtualized model. A highly virtualized design is likely not going to perform to the same degree from a performance perspective compared to a more physicalized design. Performance can be improved by physicalizing some objects and virtualizing others. A requirement to implement a heavily virtualized design is a highly performant analytics query processing platform. To evaluate the considerations in a “to virtualize or not to virtualize” question using Snowflake’s Cloud Data Platform, it is necessary to understand the principle way Snowflake’s optimizer derives its performance. When Snowflake stores data (physicalized), it stores it in Snowflake’s proprietary structure called a micro-partition. The driving factor in Snowflake’s ability to deliver query performance is through a process called partition pruning

https://docs.snowflake.com/en/user-guide/tables-clustering-micro-partitions.html#query-pruning

Partition pruning depends on statistics that Snowflake collects when storing data in a physical structure. Snowflake also uses these statistics to determine what micro-partitions participate in the query profile and which micro-partitions can be excluded based on the query predicates. Without statistics on physical data, the optimizer must estimate these metrics based on available data points to perform operations in views. As views are nested, these estimations become based less on physical data and more on estimated statistics that may not be as accurate as physical data statistics. Also, because Snowflake does not use indexes, performance tuning is achieved through techniques such as

  • Data Clustering
  • Materialized Views
  • Search Optimization

Data Clustering

Data Clustering (https://docs.snowflake.com/en/user-guide/tables-clustering-micro-partitions.html#what-is-data-clustering) is a process where Snowflake stores micro-partitions in a specified order based on anticipated query predicates. It arranges data in clusters so the optimizer can efficiently scan fewer partitions to satisfy the query much more quickly. Clustering is possible only with data stored in tables, and those tables may be clustered by a column, multiple columns, or by an expression based on a column in that table. Snowflake does not recommend clustering for all tables. Typically it is reserved for multi-terabyte tables (https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html).

Materialized Views

Snowflake’s Materialized Views allow data to be physicalized for purposes such as pre-aggregation of computed values or to specify a different clustering strategy. Another use of materialized views specific to data virtualization is when used with external tables, a virtual relational structure created over files residing in cloud storage and not in Snowflake’s micro-partitions. Placing a materialized view over external tables enables the model to either pre-aggregate data from the underlying raw files or to restrict data elements to those that are relevant to the overall data landscape (https://docs.snowflake.com/en/user-guide/views-materialized.html#when-to-use-materialized-views).

Search Optimization

Search Optimization allows for creating access paths to data to improve performance on point lookups (i.e., targeted/tactical queries) that do not scan large amounts of data. This allows the optimizer to make more granular selections for high cardinality columns (https://community.snowflake.com/s/article/Search-Optimization-When-How-To-Use?r=0&ui-knowledge-aloha-components-aura-components-knowledgeone.ArticleActions.handleEditPublished=1). Search Optimization is enabled only on tables.

Snowflake has several types of tables. All table types fully leverage Snowflake’s statistics collection and partition pruning. A permanent table persists data and benefits from Snowflake’s Time Travel feature for data protection and recovery and Fail-Safe protection beyond the Time Travel period.

Temporary Tables

A temporary table persists only for the duration of a session. A user or application may create a temporary table to hold data for a session's duration and then implicitly drop it when the application terminates its Snowflake session. Using temporary tables, even in a highly virtualized model, allows data to be materialized for a short time.

Transient Tables

A transient table has the properties of a permanent table. Still, it lacks the full data protection and recovery resiliency of a permanent table as it has limited time travel protection and no fail-safe protection. A full comparison of these structures is available in the Snowflake documentation. (https://docs.snowflake.com/en/user-guide/tables-temp-transient.html#comparison-of-table-types).

Because Snowflake’s performance depends on its ability to prune micro-partitions at runtime, at some point, data must be physicalized so that Snowflake can collect the necessary statistics to enable pruning. The goal is to balance physicalizing data for performance to support the overall objective of virtualizing as much data as possible for views that reference tables and queries against views that use predicates based on the physicalized data; Snowflake will push the predicates down in the query profile and prune at the base table. However, if queries use predicates based on attributes derived in a view or a nested view rather than an underlying table, it is not possible to perform partition pruning, and the optimizer will scan all underlying data.

3. Cost

This leads to understanding the cost impact of a highly virtualized design. The cost of physically storing data in Snowflake is inexpensive ($23/compress TB/month). So the cost to store data as it traverses each zone is not likely to be a prohibitive or deciding factor. However, it is important to consider the cost of computational resources in a highly virtualized design.

Consider the case described earlier where pruning happens against a base table. In this scenario, queries to execute on smaller sized warehouses may be possible since the optimizer will scan the minimum number of partitions needed for query processing. For the other scenario in which there is no pruning of base table partitions, larger warehouses will be required to scan the larger number of partitions and maintain an acceptable degree of performance.

It is unnecessary to size a warehouse with sufficient memory to store all data in memory. When Snowflake processes a query, and a single warehouse cannot fit the entire dataset in memory, the warehouse will spill to storage to process the query (https://community.snowflake.com/s/article/Recognizing-Disk-Spilling). Spilling to storage degrades query performance. To avoid this, Snowflake recommends modifying the query predicates to increase partition pruning. Increasing partition pruning reduces the amount of data processed to eliminate or minimize spillage to storage. Another option is to increase the size of the warehouse used to process the query. Each query essentially incurs the cost of transformation processing in a highly virtualized model every time a view is referenced. In cases where views are used sparingly, this is negligible. However, in a substantially nested scenario, the cost increases as data traverse the nested views.

In Conclusion

There are benefits and drawbacks to a highly virtualized design. While operational efficiencies are gained by using views to transform data, the cost can be higher. Time Travel and Cloning features of Snowflake are also not available with views.

It is necessary to balance the gains in saving time to transform data against the implicit cost to develop and maintain scripts or jobs to transform data between zones, the implied cost of query performance, and the explicit cost of larger compute resources to support virtualization. In a practical implementation, there will likely be a combination of physicalized data to allow Snowflake’s optimizer to perform partition pruning for performance and less expensive compute resources along with data that is virtualized to deliver operational efficiency.

Because data virtualization is an evolving topic, no “one size fits all” criteria exist to determine which data zones and structures should be physical and virtual. There will also be workload specific considerations that can only be evaluated during a well-defined development and testing cycle. The decision tree below provides some guidance on this topic. The final decision should be based on the results of use-case-specific testing.

Resources & Links

To learn more about this and other topics, check out a few helpful resources:

--

--