In part 3, let’s take a closer look at how to design data sharing framework at scale that is customizable, flexible and has the ability to seamlessly automate the on-boarding of new the customers for your organization.

I initially wanted to write about the Super Provider use case, where an organization can share their data on Snowflake at scale with their partners and clients - but what if there was an even better solution?

How about a solution where an organization can share their data as a provider and/or a consumer with their partners, clients and other non-Snowflake customers. How about if that could be combined with the ability to offer it across different clouds/regions, and how about having a system that is being able to offer client subscriptions to the shared data with the potential to apply it to ANY row, ANY column level security or ANY level of data security.

Seems to me this is the ultimate solution.

Even though this framework is a generic meta-data driven design, it should work for every use case I mentioned in my previous articles.

Data is the most important part of any application system. Good, strong, accurate data structure allows engineers to design any process, UI, reporting or statistical analysis one would ever need. The only other principles more important than a quality of the data structure are the business requirements for the applications. The most sophisticated applications in the world would have to be shelved if they didn’t meet business requirements. Therefore we will start with the data model that combines the two most important components of any system: the business requirements and the quality data structure.

Step 1: Requirements

1.) Share any number of objects to a consumer, whether it’s a view or a table.
2.) Share data only to/from primary/central account.
3.) Be able for consumers of the shared data to subscribe based on the contract defined by the provider.
4.) Be able to share the data with any number of consumers.
5.) Be able to share the data with consumers in other cloud regions, other than the primary account.
6.) Be able to share the data with consumers in other cloud providers e.g. AWS, AZURE, GCP
7.) Be able to share the data with consumers who are not (yet) on Snowflake.
8.) Be able to deactivate certain shared objects for any consumers even when the contract is still active.
9.) Be able to share the data and apply any row level security on top of any shared object for any given consumer.
10.) Be able to share the data and apply any column level security on top of any shared object for any given consumer.
11.) Be able to automate the data sharing process and setup.
12.) Be able to automate the onboarding of new consumers.

Step 2: Reference Architecture

Sharing Data within Same Cloud/Same Region
Sharing Data Cross Region
Sharing DataCross Cloud
Reader Account — Sharing Data with Non-Snowflake Customers

Step 3: Data Model

First, let’s take a look at the possible data model for data sharing framework foundation that makes it all possible.

Data Model for Data Sharing Framework

Initial Setup/Control tables: CLIENT, DATA_SHARE_ACCOUNT, CLIENT_SUBSCRIPTION

Security Control tables:
CLIENT_SUBSCRIPTION_OBJECT, SUBSCRIPTION_OBJECT_CRITERIA ,SUBSCRIPTION_OBJECT_COLUMN, CLIENT_DATA_SHARE_OBJECT

Pre-loaded reference tables: OBJECT, FIELD, OBJECT_FIELD

Step 4: Data Sharing Framework Setup

Client: This table holds the list of all the counter-party/consumer Snowflake and non-Snowflake accounts. It captures the name of the client, account information such as cloud, region and other non-Snowflake information. This information would typically be maintained manually by the provider.

Client Subscription: This table holds the list of all the client subscription contracts. With subscription effective and expiration dates the data share access will automatically expire. This information would typically be maintained manually by the provider.

Data Share Account: This table holds the list of all the provider’s organization accounts. Only one account will be the primary account from where the data will be shared from. There might be other accounts where data will be replicated to, in order to be able to share with consumers in other cloud/regions. This information would typically be maintained manually by the provider.

Object: This table holds the list of all the shared objects, tables, views etc. This information would typically be maintained either manually or it can also be loaded from information schema.

Step 5: Proposed Setup and Workflow

As for the workflow setup, we can use stored procedures using JavaScript, SnowScript (SQL) or Python UDFs (Private Preview)

Proposed Workflow

1.) Load CLIENT_SUBSCRIPTION_OBJECT table. Script to load the the CLIENT_SUBSCRIPTION_OBJECT table based on the CLIENT_SUBSCRIPTION and OBJECT table setup.

2.) Secure View Creation (One-Time or Table Stream Run)
Script to create a list of all the Secure Views based on the Subscription Dates, Client Accounts, List of objects that each client subscribes to, Column Level Security and Row Level Security defined for each object. It will also load the Object table with generated Secure View information, as well as it will load CLIENT_DATA_SHARE_OBJECT table that will be used for GRANTs.

3.) Data Share/Reader Creation. Script to execute the Step 5 and Step 3. (Used for Replication Use Case)

4.) Create Data Share and grant access to objects in the same Cloud/Region. Script to create a new DATA SHARE and GRANT all the necessary objects and assign the ACCOUNTS for that DATA SHARE.

5.) Create Replication process to replicate databases to secondary Cloud/Region. This script to enable the replication process for all the ACCOUNTS that are in a different REGION/CLOUD PROVIDER.

6.) Create Reader Account. Script to create all the necessary READER accounts for each CLIENT that doesn’t have Snowflake Account.

Step 6: Automate the Data Sharing Framework

You can create Streams on top of CLIENT, CLIENT_SUBSCRIPTION, OBJECT, DATA_SHARE_ACCOUNT tables and Tasks to execute the above described scripts accordingly to fully automate the end-to-end process.

Step 7: Scenarios and Test Use Cases

Scenario 1: Same cloud/region with column and row level security. This can be used also to run each of the combined scenarios separately by commenting or skipping different steps.
Scenario 2: EXPIRED SUBSCRIPTION CONTRACT — Test expired contract and see if the access to shared data has been cut.
Scenario 3: READER ACCOUNT — Check if the reader account has been appropriately provisioned.
Scenario 4: ACTIVE_IND = FALSE — Check to see if we can deactivate the access to only a sub-set of objects being shared even if the contract didn’t expire.
Scenario 5: REPLICATION — Check to see if the replication is running appropriately in the secondary account and make sire that we can execute the scripts to create shares for those consumers in that cloud/region.

Day to Day Maintenance and Activities

All you need to do on a daily basis is to monitor the streams and tasks and make sure that the information on setup tables needed for on-boarding consumers and secondary accounts, is up-to-date.

Final Words

In today’s market, data is at the very center of the internet economy. About 66% of the surveyed companies are willing to share internal data with the third parties. “Data is the new oil” and the companies the want to advance their business, need to find the ways to collaborate and improve their customers experiences and their business models.

For example, when insurers are combining claim policies, they can use behavioral data and patterns to try to better understand and predict risk by incorporating all the available data into their analysis. This can be especially useful in cases where employee errors have had a huge impact on the decision making process.

As digitization continues, quality data will have a huge impact on every decision made in an organization. With data sharing, decision-makers will be able to have more of a complete picture and more confidence finding the best solution.

Ideally, every data sharing process should have controls that limit negative consequences, while preserving benefits of data sharing. In any case, data must always be shared responsibly. Existing data sharing technologies rely on robust vetting processes, in order to build trust with potential data partners.

Data sharing with Snowflake is ultimately the path toward greater convenience, better products and services, and significantly cheaper operations for every industry and ecosystem out there. Making sure that companies are aware of these benefits on Snowflake platform will be of great importance when making decisions on how to use or share their data globally. There is no need any longer to deal with physical paperwork, documentation, moving and copying of the files, running complicated calculations just to prevent multi-parties from seeing each other’s data.

Companies will be able to effortlessly draw insights around travel and purchase patterns, consumer habits, hobbies, preferences and behaviors, and do all this with semi-structured and unstructured data and be able to join any other 3rd party data that’s available on the marketplace… the inability to combine data information will become a thing of the past.

If you’re interested in implementation of a customizable, flexible data sharing or data cleanroom framework for your organization, please schedule a call with Snowflake Workload Services or Professional Services organizations and find out how you can transform and elevate your business and data strategy to another level.

Next, I will shed more light on the power of Data Cleanrooms and Data Monetization.

--

--

Marcin Kulakowski
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Don't solve a problem, offer a better solution and show the art of the possible. Currently @ Snowflake.