Snowflake
Published in

Snowflake

Data Vault on Snowflake: Virtual Warehouses & Charge Back

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. Cloud offers limitless scalability and by understanding how we can leverage Snowflake’s unique cloud-based architecture we can increase platform efficiency to make your Snowflake credits go further with data vault.

Episode catalogue,

1. Immutable Store, Virtual End-Dates

2. Snowsight dashboards for Data Vault

3. Point-in-Time constructs & Join Trees

4. Querying really BIG satellite tables

5. Streams & Tasks on Views

6. Conditional Multi-Table INSERT, and where to use it

7. Row Access Policies + Multi-Tenancy

8. Hub locking on Snowflake

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

A reminder of the data vault table types,

Optionally we can use the query assistance tables to optimise performance and simplify access over a data vault,

Episode 9: Virtual Warehouses & Charge Back

Snowflake’s architecture is based on the Cloud Service Provider’s (CSP) architecture wrapped in automation to simplify your Data Cloud experience. Using the CSP’s global network your experience on Snowflake is the same on any of the big three’s infrastructure and you can share data between Snowflake accounts globally across CSPs or within a CSP.

For our workload in this episode, we will focus on an individual Snowflake account and how to leverage that for a Data Vault.

Simplistically this may be all you need

Let’s describe each layer

  • Cloud services — your access to Snowflake technology. Snowflake is a Software-as-a-Service (SaaS) with security guarantees for your data loaded into Cloud. In this layer Snowflake optimises your cloud experience around your data with the elastic scalability of cloud to offer a dynamic, pay-for-what-you-use service.
  • Compute services — Snowflake wrapped virtual machines labelled as virtual warehouses that are assigned by you to ingest and query data in the CSP of your choice. Essentially computations/work on your data is what Snowflake will charge you for, these are separated from storage which allows for massively parallel processing to occur concurrently and without workload contention.
  • Cloud storage — supported external formats and/or internal storage in Snowflake proprietary table types that are columnar, and row optimised micro-partitions. Snowflake’s tables can store a mix of structured and semi-structured data and you can store unstructured data in Snowflake as well.

Now let’s explore what makes Snowflake and Data Vault similar

Separation of concerns

Instead of tightly coupling architecture objects, Snowflake’s architecture offers a separation of concerns (SoC). Storage is separated from compute and metadata are mere pointers to these objects and their interactions. A single Snowflake object can do nothing, but in combination with other Snowflake objects can do work! A virtual warehouse cannot be accessed without a role, a role is not authorised to use a virtual warehouse unless granted the privilege to do so. And a role cannot use data objects unless it has privileges to do so too.

The same SoC principle is applicable to data vault and its table types, your business is focussed on:

  • Business objects — the centre points of your organisation, these drive your organisation capabilities because in essence the organisation maintains capabilities to serve these business objects with services and/or products. A business object without context is just a business object. We store the immutable business key in hub tables.
  • Unit of work (UoW) –exists between business objects, the transactions (i.e., relationships) between business objects that in essence are the business events that you will record for past or present. UoW describes the relational context and interactions between business objects that in a data vault we store as link tables.
  • State data — what are the active states of the above business objects and UoW? State data describes the nature of the above at an applicable point in time, these facts can be superseded by change but in the data vault we keep that historical context as an audit trail providing the business object and UoW provenance.

SoC is built into Data Vault just as it is built into Snowflake; both offer limitless scalability and elasticity meaning that new objects can be added or updated without impacting other objects.

Now over to understand Snowflake caching and how it can be used to optimise your data vault experience on Snowflake….

Caching

Let’s start with the first query…

Q1
  • First Query — users or services need a Snowflake role to use or manipulate data in Snowflake. A first query will spin up an allocated virtual warehouse that is configured to do so automatically. That first query must traverse from cloud services through compute services to get to the stored data. Along the way data from that query is parked in the virtual warehouse cache as well as the query result is stored in result cache for 24 hours on a continual clock for up to a month.

A simple query over an information mart stored as a view.

Every query run in Snowflake is associated with a unique auto-generated query-id at runtime and represented as a GUID. After running the above query, we can store that query-id in a custom parameter and use that as a base for our queries that need that query result.

Q2 & Q3
  • Second Query — with a role that has access to the same data and the same query result; it does not need to traverse compute services or reach into cloud storage to fetch the data it needs. The second query will access the previously fetched query result in result cache provided that no context functions were used in the result and no underlying data has changed. This optimises your queries because data does not need to traverse up the network! Subsequent queries that use a portion of the already fetched data will need to pull additional data from cloud storage and augment it with the cached content on a virtual warehouse. This latter scenario is why virtual warehouses should not be immediately suspended after use because this action will flush the cache from the virtual warehouse, and you could potentially be missing out on this performance optimisation.

Running the exact same query will produce the exact same result from result cache, but we can also use the query-id we stored earlier to fetch that same query result!

100% from Result Cache!

Q4
  • Querying with another warehouse — A role with different access and associated virtual warehouse will not share virtual warehouse cache which is a consideration for your design whether your enterprise will benefit by combining virtual warehouses or not.

Snowflake charges for the first 60 seconds when a virtual warehouse is spun up and then per-second credits are charged until the virtual warehouse is suspended. Consider this when utilising a virtual warehouse that although you may spin one up and use it for 30 seconds you will still be charged for 60 seconds worth of credits. Hence why you need to consider combining your roles to a warehouse or shorten the time to suspend the warehouse instead of using the default 10-minute suspension time.

Snowflake does support serverless execution of code too, which is charged at a per-second rate from the start of the execution. Serverless compute are those operations that you configure Snowflake to execute on your behalf, like maintaining materialised views, auto-clustering tables and using serverless tasks, to name a few…

Q5
  • Autoscaling — virtual warehouses can also be configured to auto-scale horizontally should the workload on a virtual warehouse begin to queue.

You will be charged for an XSMALL virtual warehouse and for every instance that is spun up for your workloads. Once the queue is cleared Snowflake will suspend the new virtual warehouses automatically and as needed. The question to you would be around the size of virtual warehouse needed when you start, the recommended advice is, always start with an extra small before deciding to scale up or scale out to by upsizing the t-shirt size or configuring autoscaling on your virtual warehouses respectively.

You can set auto-scaling limits by setting lower and upper bounds for Snowflake to spin up new virtual warehouses as needed, and you can set scaling policies based on how you perceive queued queries should be handled. This is an important concept, because each virtual warehouse spun up is its own unit, there is no contention between queries being run on your data on different virtual warehouses. If all your workloads are concentrated into one virtual warehouse, then there could be contention and query queuing, and this is how Snowflake achieves massively parallel processing by being as elastic as the cloud has been designed to be.

Loading
  • Loading data — compute is needed to load data into Snowflake which should be configured on its own virtual warehouse. Because it does not necessarily need as much caching as querying would, these virtual warehouses should be configured to auto-suspend sooner, typically 1–5 minutes is common. The serverless alternative is Snowpipe, however for this episode we will focus on virtual warehouse only!

Caching should be considered when designing your data vault querying and loading needs; and that is to consider,

  • How can you benefit from result cache?
  • At what frequency does the underlying data get updated?
  • What reporting periods do you need to cater for?

Because we default data vault information marts as views any change to the data vault tables supporting that view will be immediately visible in the information mart. Your strategy here could be to persist the information mart as a snapshot of the data vault tables instead, which of course can be optimised by using streams and tasks, or even deploy a stream on the view itself!

Now that we have seen how we can use Snowflake caching to optimise your querying experience, let’s explore some strategies for enabling chargeback on your Snowflake account.

Chargeback

A role is needed to do any work in Snowflake, the role itself will need access to data and need access to a virtual warehouse to perform any work on that data. Metrics for virtual warehouse utilisation is metered and accessible via

The former has a latency of up to 3 hours but is retained for a year. The latter is immediately available and is retrieved from your database’s information_schema with a lower retention period.

As a single Snowflake account owner, you can charge your platform tenants (individual business units) accurately for those using your platform. This combination of virtual warehouse and roles means you can also track the cost to ingest data into your raw and business vault. A single virtual warehouse may be shared by multiple roles too to consolidate credit spend if you find that a virtual warehouse is under-utilised. A single role may be entitled to access and use multiple virtual warehouses as well with differing configurations to allow it use and/or modify associated virtual warehouses. For a given business unit you can isolate usage to data labs and other exploratory workloads and cap virtual warehouse usage with a resource monitor.

Zones and Layers

The capping of virtual warehouse can be set to the frequency of your choosing, for example by setting a credit spend limit per week, or per month and configuring the resource monitor to send alerts and notifications when a business unit is nearing a capacity limit. You can also configure the resource monitor to terminate business unit workloads until the next spending period if you decide! Setting up resource monitors initially is recommended for the novice Snowflake users or those that are especially cautious about their credit spend.

As we have seen in a previous episode, if you have set up a Snowflake organisation and decided to allocate independent Snowflake accounts to business units and/or partners then the credit spend can be universally tracked under the organisation’s

To further analyse your credit spend Snowsight provides some out-the-box charts and graphs to visually depict warehouse usage, like in the below animation,

Billing & Virtual Warehouse graphics!

What’s more is that you can build custom Snowsight dashboards to get to the analysis you need, along with Snowflake’s partner BI tools that come with their own guidance on building these dashboards,

Again, keep in mind that compute charges are also accumulated via serverless processing on your Snowflake account, the above only discusses allocating chargeback by linking virtual warehouses and roles!

Fin!

This concludes the ten-part series on Data Vault techniques for Snowflake! I hope these articles have helped you visualise the possibilities for optimising your data vault on Snowflake experience and that you have found some of these techniques useful. Ultimately what we look for in Data Vault and with any technology or practice is the ability to execute repeatable patterns. If it is repeatable then it is scalable. As your scalability translates into savings on Snowflake you free yourself to spend that budget on new and exciting initiatives to delight your stakeholders and your customers!

Until next time!

References:

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store