Sharing and Monetising Data from your Data Lake with Snowflake

Data sharing is becoming increasingly prevalent as businesses are realising the competitive edge it gives to outperforming peers. In fact, Gartner conducted a survey of Chief Data Officers, and found that organisations that increased data sharing led to data and analytics teams being 1.7 times more effective at showing demonstrable, verifiable value to their stakeholders.

These stakeholders could be their own business units, other business units within the same organisation, or even business partners that mutually benefit from sharing data. For the latter, a reliable, trustworthy and secure sharing mechanism could translate to revenue as partners craving your data are willing to pay for the value it will provide them in turn.

Snowflake has released a series of features to assist organisations in realising business value through collaboration. Specifically, the Snowflake for Collaboration workload allows:

  • Direct access to Data, without delay. Consumers of data shares are able to access ready-to-query data without the costs of setting up ETL processes and data pipelines. This means that value can be realised in minutes, not months.
  • Secure, Governed and Revocable. Providers of data are able to monitor data offering and usage, control discoverability, whilst leveraging built-in Snowflake security features (Row Access Policies, Conditional Masking, etc).
  • Monetisation. Snowflake streamlines the data selling experience by providing a marketplace, flexible pricing options, and “try before you buy” experiences. This allows businesses to reduce the time-to-market for their assets. As an example, FactSet has driven an 88% YoY increase in the total number of clients utilising their financial services data in the Snowflake Data Cloud.

The success of Snowflake providing a a reliable, trustworthy and secure sharing mechanism through its collaboration was evident in the most recent quarterly results. Stable edges, defined as continuous data-sharing connections between two or more Snowflake accounts, had grown 112% year over year.

Visualisation of Collaboration via the Snowflake Data Cloud. An edge between nodes represents a “Stable Edge” defined as at least 20 transactions in which compute resources are consumed and such consumption results in recognised product revenue over two successive three-week periods (with at least 20 transactions in each period).

Monetising your existing data lake

With the recent announcement of Iceberg Support in Snowflake (now in Private Preview), the time-to-value to monetising assets that must exist in cloud storage has significantly reduced.

There are some organisations that are constrained to having data in external storage. For example, they may have different teams needing to access the raw files from different engines (Spark, Trino, Flink, etc). For these organisations, Snowflake can act as a robust SaaS consumption layer that can serve (and monetise) data from your cloud storage to customers.

The following demo shows how we can use Snowflake for Collaboration and the new Iceberg support to share data directly from the cloud storage layer, with row level access to a Reader account (non-Snowflake customer).

In this scenario, the data provider has a dataset in S3 that is curated by another Spark based system. For this demo, I have used the AMEX kaggle dataset, and EMR as the Spark-based compute. Databricks could also be used.

We have realised a revenue potential to share a dataset in our Data Lake to consumers. The dataset we have also has transactions from multiple different consumers, and we would like to apply governance features such as Row Level Access so that particular customers only see the rows they have been granted access to. This can easily be achieved through Snowflakes collaboration infrastructure.

Step one is to mimic a Spark based system writing to an object store. For this example I used EMR to write to S3. As mentioned above, other tools like Databricks could be used for this step.

First upload the kaggle dataset to s3. Remember this URL as you will need it later.

Next, configure EMR. Please note, the EMR role needs write access to your bucket. You will need to add a policy to the role found in the Security section of the Summary tab of your EMR cluster.

In an EMR notebook, I run the following code:

-- The following sets up the config for the spark cluster. For this example I am using a catalog named "demo".%%configure -f
{
"conf": {
"spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.demo.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.demo.warehouse": "s3://<location of table>/",
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
}
}
-- Create db in catalogspark.sql("CREATE DATABASE IF NOT EXISTS demo.amex_demo_db")-- Read kaggle dataset (uploaded in to s3 bucket below)
df = spark.read.parquet("s3://<location of kaggle dataset>/")
-- Now the dataset is in a DataFrame, write it out to an Iceberg tabledf.writeTo("demo.amex_demo_db.transactions").create()

Once this is completed, we can switch to Snowflake. The next step is to create a storage integration and an external stage to the S3 location where EMR has written the Iceberg files.

-- Create storage integration
CREATE or REPLACE STORAGE INTEGRATION s3_amex_sandbox_iceberg_ext
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<iam role>:role/<>'
STORAGE_ALLOWED_LOCATIONS = ('<s3 location>');
-- Create external stage
create or replace stage s3_amex_stage_iceberg_ext
storage_integration = s3_amex_sandbox_iceberg_ext
url = '<s3 location>'
file_format = (type = parquet);

Next, I create an external table using the following syntax. Note the use of the ‘table_format’ parameter.

create or replace external table amex_transaction_data_iceberg_ext
table_format='iceberg'
location=@<stage from above>/amex_demo_db.db/transactions/
file_format=(type=PARQUET)
snapshot_location= 'metadata/<id of file>.metadata.json';

Once the above is executed, we are able to use the power of the Snowflake platform to read directly from the data lake, and share it with our customers.

Since my customer is not on Snowflake, I can set up a reader account for them. This could be done through the Snowsight UI, or by running the following SQL.

use role accountadmin;

create managed account reader_acct1
admin_name = user1 , admin_password = 'Sdfed43da!44' ,
type = reader;
+-----------------------------------------------------+
| status |
|------------------------------------------------------|
| {"accountName":"RE47190","loginUrl":"https://re47190.snowflakecomputing.com"}
|
+-----------------------------------------------------+

Now lets enrich the data with a mapping table I have in Snowflake to match transactions with their account.

-- Create Mapping Table for Share. Next we are going to create a reference table, so that we can apply row level access with our final sharing partnerscreate or replace table customer_list (
customer string,
account_identifier string);
insert into customer_list values
('0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a','<snowflake account locator 1 e.g. RE47190>'),
('00000fd6641609c6ece5454664794f0340ad84dddce9a267a310b5ae68e9d8e5', ''<snowflake account locator 2>');

Next, let’s create a view linking the mapping table to the data in our data lake. Since it is a view, no data is leaving the s3 bucket.

-- Create a secure view for sharing to other accounts that leverages the CURRENT_ACCOUNT metadata. In this case, we just want to share a subset of columns, as well as the account key. No underlying data is changed, as this is just a view.create or replace secure view customer_transaction_shared_iceberg_ext as (
select CUSTOMER_ID, c.account_identifier, B_3, D_66, S_12, D_120 from amex_transaction_data_iceberg_ext a
LEFT JOIN customer_list c ON a.CUSTOMER_ID = c.customer
WHERE c.account_identifier = current_account()
);

Now we have created a view, we are going to simulate what is looks like

-- Now lets test what our data consumer would see 
alter session set simulated_data_sharing_consumer=<consumer account e.g. RE47190>;
select * from customer_transaction_shared_iceberg limit 100;

Great! We can see the data limited to the 13 rows shared with that account. Now we finish be sharing the data to the consumer (reader account).

-- Create Share
create or replace share amex_demo;
grant usage on database test to share amex_demo;
grant usage on schema test.public to share amex_demo;
grant select on view customer_transaction_shared_iceberg_ext to share amex_demo;
alter share amex_demo add account=HS12630;

I can now log in to my reader account, accept the share, and query the table!

From this demo, we can see how easy it is to share data from a data lake, while keeping really important governance features like row level access over the data. In this example we used Iceberg, but this demo would also work over Delta tables, and plain parquet files.

More information can be found below:

--

--