Answers to 5 Common Snowflake Questions

Behind the Scenes at a Hashmap Snowflake 101 Enablement and Training Workshop

--

We’ve been delivering Snowflake 101 Enablement and Training hands-on workshops for the last few years to a variety of clients. The purpose of these sessions is to facilitate a conversation around all things Snowflake while interactively walking through features and capabilities in a hands-on mode. In addition, we review best practices and recommendations that have been honed in 100+ Snowflake implementations.

Our hands-on labs cover a range of topics including:

  • Prepping Snowflake for data load
  • Loading structured data via different sized warehouses
  • Running analytical queries
  • Utilizing resultset caching
  • Cloning tables
  • Loading semi-structured data
  • Creating views on raw semi-structured data
  • Building queries joining structured and semi-structured data
  • Understanding and using time travel
  • Unpacking RBAC and using Snowflake Inspector

We also discuss optimizing for security, costs, and performance in Snowflake and review the overall ecosystem around Snowflake including data pipelining, analytics, and automation.

Additionally, we provide opportunities for open Q&A in order to help our clients get a better overall understanding of Snowflake features. With that in mind, we’ve provided answers to five great questions that get repeated often in our sessions.

Does Snowflake provide a way to selectively hide the column data of a table?

Yes — Snowflake provides features to protect sensitive data at a column level (in a table or a view) from unauthorized access by means of dynamic data masking or external tokenization. With dynamic data masking, data that is available within Snowflake can be secured by the application of masking policies that obfuscate it at query time. When a query is executed by a user, Snowflake checks to see if any masking policy conditions are met to determine whether or not users see masked, partially masked, obfuscated, or tokenized data.

With external tokenization, data that is tokenized outside Snowflake and is onboarded as is, but can be dynamically deciphered to access the actual contents by use of masking policies that use external functions to do the de-tokenization. This is more secure than the first option since the data rests in the tokenized format.

Additionally, one can consider using a view, or series of views, instead of, or in conjunction with, masking to further filter columns and provide more meaningful column aliases.

More information on column-level security can be found in Snowflake’s documentation on column security.

How do Snowflake micro-partitions work?

“Micro-partition” is the term used to describe the manner in which Snowflake partitions/divides tables into blocks of 50–500MB data (when uncompressed) during DML operations (inserts, updates, deletes). However, these are always compressed at rest and store data in a columnar format for faster lookup. Metadata is collected over each micro-partition as it gets created and used for granular pruning of the data during query processing.

Additionally, micro-partitions are never modified. DML operations either create new micro-partitions and/or mark previously used micro-partitions for removal. For example, if an update statement updated 20 values in a micro-partition, a new micro-partition containing the original data plus the 20 updated values will be created and referenced as the “current” micro-partition; whereas the other will be marked for removal post time travel and failsafe expiration.

More information on micro-partitioning can be found in Snowflake’s documentation on clustering and micro-partitions.

Can you explain cloning in Snowflake?

Cloning or “zero-copy” cloning is a feature provided by Snowflake in which a logical copy of a database object is created from a specific snapshot of another database object without duplicating storage. Snowflake achieves this through logical micro-partition metadata separation. This allows one copy of data in storage, but possibly multiple micro-partition references for different database objects to that single copy of data.

Snowflake can create a copy at both the container level and the object level, be it a database, schema, or table. Similarly, cloning a container-level object automatically clones its contained objects.

An important item to note is the fact that a cloned object is writable and completely independent of the cloned source. So DML applied to either the source object or the clone object are not interconnected and therefore, DML applied to a particular copy of the data will create new micro-partitions for that branch only. Please review the considerations for the handling of cloned objects created within a database, schema, or table.

More information on “zero-copy” cloning can be found in Snowflake’s documentation on cloning or this article.

What is Snowflake time travel?

At a basic level, Snowflake’s time travel feature enables historical data to be accessed after it has been either modified or deleted at any point in a defined retention period. Time travel was designed to help restore data-related objects, tables, schemas, and databases that may have had DML accidentally applied; it can also clone data objects at or before specific points in time. Likewise, data can also be queried at or before specific points in time, even if it has been modified.

Additionally, time travel gives visibility into overall data usage and manipulation over different time periods specified by the user. One important item to note is that all-time travel periods vary depending on the Snowflake license. The default retention period is one day. However, users who have the Enterprise Edition (or higher) of Snowflake can set the retention period at any value from 0 to 90 days for all permanent databases, schemas, and tables.

More information can be found in Snowflake’s documentation on time travel.

What is the best use case for Snowpipe?

In its simplest form, Snowpipe is a data ingestion service offered by Snowflake enabling files to be loaded into Snowflake via two mechanisms: automated and REST endpoints. The automated mechanism relies on cloud messaging where the cloud provider notifies Snowflake when a file becomes available in a Snowflake external stage (cloud storage). Snowpipe will then ingest the file into the target table. The REST endpoint mechanism is similar to the automated mechanism except a Snowflake REST endpoint is called from a client with a list of files to be ingested into the target table from a Snowflake external stage (cloud storage).

Snowpipe was designed to be a more automated and cost-effective alternative to the bulk copy method using a designated warehouse. Therefore, Snowpipe is the better choice for handling small fast arriving files or batching small files that need to be processed within a short delay of their arrival or as a stream.

Snowpipe is another tool in the data integration tool bag and differs from other data integration solutions like Fivetran, Matillion, Talend/Stitch, etc. Snowpipe imports data from Snowflake external stages such as AWS S3, Azure Blob/ADLS Gen 2/AGP V2, or Google Cloud Storage. The aforementioned data integration tools connect directly to a multitude of source systems to extract and load data directly into Snowflake (or other target systems).

More information can be found in Snowflake’s documentation on Snowpipe.

The actual data integration use case should be at the forefront of any decision when evaluating options for data pipelines. We do have a workshop dedicated to the topic of data integration.

Closing Thoughts

Hopefully, this information was helpful. The ultimate goal of this post was to provide quick answers to some common questions that we often get asked in our Snowflake 101 Enablement and Training sessions so that you can continue to expand your Snowflake knowledge base. All of the information in this post can be sourced directly from Snowflake’s documentation.

Ready to Accelerate Your Digital Transformation?

At Hashmap an NTT DATA company, we work with our clients to build better, together. We are partnering across industries to solve the toughest data challenges, whether it be cloud and data migrations, new data apps, data engineering, data architecture, data pipelining, automation, CI/CD, etc. — we can help you shorten time-to-value!

We offer a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our data and cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Additional Resources

--

--