Privacy-safe analytics with Snowflake and Skyflow

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readMay 3, 2023

Customer data is undoubtedly one of the most valuable assets for today’s businesses. Organizations are constantly collecting data to better understand their customers so that they can improve their offerings and their operations. However, increasing concerns regarding data privacy and protection, along with security and compliance requirements, have led businesses to prioritize the careful handling of PII and other sensitive data.

But what does this mean for analytics? Companies need to leverage customer information for their analytics, but how can they do this without risking PII exposure to the wrong people either internally or externally?

This is where privacy-safe analytics comes into play.

In this post, we’ll explore how Snowflake can be used alongside a data privacy vault to facilitate privacy-safe analytics. We’ll look at the technical considerations for working with tokenized PII data in Snowflake, and how to detokenize that data through Snowflake external functions and Skyflow’s data privacy vault API. By adopting this tech stack and approach, organizations can leverage the power of data analytics to drive business value — while still protecting sensitive customer data.

The problem: PII sprawl

While Snowflake has many great security and privacy features built-in, PII has a habit of finding its way into many different locations. This not only includes your analytics store, but your application database, backups, and log files. The more places that PII can end up, the bigger the challenge is to protect it.

With PII sprawl, you increase the potential attack surfaces and increase your compliance scope. As you can see in the image below, I need to not only be concerned about the PII in my warehouse, but in many other locations.

Storing sensitive data in a data privacy vault for better protection

One approach to removing PII sprawl is to store the PII within a zero-trust data privacy vault. The vault isolates, protects, and governs your customers’ most sensitive data.

Data privacy vault architecture to eliminate PII sprawl.

Rather than storing customer PII in your warehouse or any other location, it’s only stored in the vault, becoming a single source of truth.

Within each data store outside of the vault, the customer PII is replaced by tokenized data, acting like a pointer.

Tokenization replaces sensitive data with unique, opaque values. For example, you could tokenize a customer’s social security number by replacing it with a UUID like a9c2a22a-7bb7–4bca-bc80–81d01fa724da.

Tokenizing sensitive data makes it unreadable to unauthorized users. Even if someone were to gain access to the data, they would not be able to identify the individuals it belongs to.

Detokenizing data for reporting

In a previous post, I covered the process of tokenization with Skyflow. A web application would send sensitive data to a Skyflow data privacy vault so that it never touched the backend architecture. Instead, the backend would get a dataset of tokenized PII data, and I use Snowpipe to ingest that tokenized data into Snowflake.

In many cases, tokenized data is all you need for customer analytics. The tokens can be deterministic, meaning you can still execute analytical operations by aggregating values. For example, if we were grouping customer orders by zip code, all of the values for the same zip code would result in the same tokenized value.

However, there may be occasions when detokenized values are necessary — for example, if you needed to produce a report that showed aggregate statistics with actual zip code values. Without the correct tools, this can be a hard problem to solve. The solution lies in implementing a secure, just-in-time mechanism for detokenizing only the data we need for analytics.

An example: customer orders by zip code

Let’s consider a common example of an important customer analytics question. We’d like to count the number of customer orders by zip code so that we can find the top ten zip codes where orders originate from.

The actual customer zip codes are stored off-site, in a data privacy vault. The customer data was ingested by Snowflake only after Skyflow Data Privacy Vault tokenizes it. How will we answer our “top ten zip codes” question now?

The following diagram shows the Snowflake customers table, along with its mapping to the original, sensitive data stored separately in a Skyflow data privacy vault.

It’s clear that answering the “top ten zip codes” analytics question in a dashboard for a business stakeholder would only be helpful if we display the detokenized values — the actual zip codes.

This is where Snowflake external functions come into play. A Snowflake external function can be used in combination with Skyflow’s APIs to securely detokenize a limited set of data in whatever format we need — plaintext or partially redacted.

Snowflake external functions to the rescue

Snowflake external functions let you extend Snowflake with the ability to call APIs, execute external code, and invoke other processes that are not typically supported by Snowflake. This is a powerful feature that greatly enhances the functionality of Snowflake.

External functions can be written in almost any language, and they can be called from within Snowflake SQL statements! This makes them a convenient way to access data and functionality that is not ordinarily available in Snowflake.

External functions can be used for a variety of tasks, including:

Calling external APIs to retrieve data
Executing code to perform calculations or transformations
Generating reports or dashboards
Automating tasks

This is an example of calling a Snowflake external function from a SQL query to detokenize the token representation of a zip code.

select production.data_warehouse.detokenize(zip_code) from customer_orders;

Snowflake external function + AWS Lambda + Skyflow

We’ll solve our analytics problem by using a Snowflake external function to interact with Skyflow Data Privacy Vault. However, we need to build a data pipeline to facilitate this communication and detokenization. In our example case, we’ll use a Snowflake external function to trigger an AWS Lambda function that will detokenize data on the fly.

At a high level, the logic flow for detokenization looks like this:

Data application calling detokenize to reveal protected data in Snowflake

The logical steps of the data flow are as follows:

Snowflake SQL calls the Snowflake external function
The Snowflake external function sends a request to AWS Lambda with the tokenized value
AWS Lambda authenticates against the vault based on the Snowflake user to AWS IAM mapping and securely sends a detokenization request to Skyflow’s API
Skyflow returns detokenized data
AWS Lambda returns detokenized data
Snowflake External Function injects detokenized data into SQL result

Snowflake external function code

The Snowflake SQL statement invokes an external function called detokenize.

Here’s how to implement that function and the corresponding Snowflake API integration:

CREATE OR REPLACE API INTEGRATION external_api_integration_lambda
    api_provider=aws_api_gateway
    api_aws_role_arn='arn:aws:iam::123456789012:role/my_cloud_account_role'
    api_allowed_prefixes=('https://xyz.execute-api.us-west-2.amazonaws.com/production/')
    enabled=true;

CREATE OR REPLACE EXTERNAL FUNCTION detokenize(field string)
  RETURNS variant
  api_integration = external_api_integration_lambda
  AS 'https://xyz.execute-api.us-west-2.amazonaws.com/production/';

AWS Lambda code

Of course, the next step would be to write the AWS Lambda that would reach out to Skyflow’s API to do the detokenization. The example below uses Python, but this could be any language supported by AWS Lambda.

import json
import boto3
import base64
import requests

session = boto3.session.Session()
client = session.client(
    service_name='secretsmanager',
    region_name='us-east-1'
)

def lambda_handler(event, context):
    tokenized_value = event['data'][0][1]
    
    skyflow_service_account = client.get_secret_value(SecretId='SECRET_ID')
    skyflow_vault_id = client.get_secret_value(SecretId='VAULT_ID')
    bearer_token = generate_bearer_token_from_creds(
            credentials=skyflow_service_account)
    
    url = f"https://prod.api.skyflowapis.com/v1/vaults/{skyflow_vault_id}/detokenize"
    
    payload = {
        "detokenizationParameters": [{
            "token": tokenized_value
        }]
    }
    
    headers = {
        "Authorization": f"Bearer {bearer_token}",
        "Content-Type": "application/json"
    }
    
    response = requests.post(url, headers=headers, data=json.dumps(payload))
    response_json = response.json()
    
    detokenized_value = response_json['records'][0]['value']
    return detokenized_value

It’s easy to see how this strategy — which only involves a data privacy vault and a few function calls — can be used to address two often-challenging problems.

Keep PII and sensitive data protected by not storing it in Snowflake or any other location within our application infrastructure
Use Snowflake external functions and AWS Lambda in conjunction with Skyflow’s API to retrieve detokenized data for analytics when needed

Bonus benefits: security and compliance

This approach of combining Snowflake with a data privacy vault also reduces an organization’s security and compliance footprint in the following ways:

Isolating sensitive data in a data privacy vault reduces the risk of data breaches and compliance violations
Encrypting sensitive data in a data privacy vault further protects it from unauthorized access
Tokenizing sensitive data ensures that sensitive data will not be leaked even in the case of a breach
Enforcement of access control is in place. Access to the vault is controlled using zero-trust architecture and policy-based access control, all of which help to reduce the risk of unauthorized access

Using Snowflake external functions in conjunction with a data privacy vault is an attractive strategy for modern data stack analytics. You keep your security and compliance risk low, while maintaining access to important customer data and analytics.

Conclusion

The challenge of storing tokenized PII in Snowflake is that the tokenized data can take away important context, thereby reducing the usefulness of customer analytics. Skyflow Data Privacy Vault solves this customer analytics PII problem by isolating and securing the tokenized data while also providing a way to request on-the-fly detokenization.

The vault architecture also helps simplify challenges like data residency while maintaining a single global analytics store. I will cover this in detail in a future post.