Keeping sensitive customer data out of Snowflake with Skyflow and Snowpipe

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readMar 24, 2023

Businesses need to use customer data to perform analytics and business intelligence. But how can companies analyze and report on detailed customer data while securely protecting that data, limiting the security scope, and maintaining compliance.

In this post, we’ll walk through how to solve this problem by creating a clean data pipeline of de-identified data using a combination of Skyflow Data Privacy Vault, Snowpipe, and Snowflake’s Data Cloud. We’ll highlight how the pieces work together, focusing on the technical implementation of integrating Snowpipe with Skyflow Data Privacy Vault. For our example, we’ll track how customer and order data progresses through our pipeline.

Let’s start by reviewing the problem we need to solve.

The problem: customer analytics with sensitive PII

Companies that need detailed customer information for their analytics face a common obstacle: how to handle PII (Personally Identifiable Information).

Snowflake has become the de facto tool used by modern data teams for data apps, data warehousing, and analytics. While Snowflake provides many built in data protection features to help seal off PII types of data from unauthorized access, data often originates in other parts of your stack from multiple sources and passes through a variety of locations in your infrastructure before landing in Snowflake.

Through the data lifecycle, the data will pass through different downstream services. This is problematic when handling sensitive data as each service may create log files or have their own storage mechanism, leading to the user PII ending up being replicated to many locations.

When this happens, your secure storage concerns aren’t restricted to just Snowflake, but all locations the data may end up. This drastically increases your compliance and security scope, increasing the footprint for a potential data breach.

How can organizations take full advantage of customer analytics in Snowflake safely and securely, without dramatically increasing their data sprawl and compliance scope?

This is where a data privacy vault supporting polymorphic encryption, deterministic tokenization, and policy-based data governance step in to fill the technical gap, providing a secure solution to sensitive data storage needs, while also supporting downstream customer analytics.

De-identifying sensitive data with a data privacy vault

A data privacy vault isolates data, secures it, and tightly controls access to it. A well designed data privacy vault combines the principles of isolation, fine-grained access control, and encapsulates privacy-enhancing technologies like polymorphic encryption and tokenization through a simple common interface like an API.

Polymorphic encryption encrypts data in multiple forms, with multiple keys, with specific functions for the data associated with each encryption set. Polymorphic encryption makes it possible to support encrypted query operations.

Tokenization de-identifies sensitive data, replacing it as a unique and opaque token. Tokenization is an effective strategy for ensuring security and compliance when handling PII and other sensitive data. By using tokens, you reduce the chance of inadvertent data compromise.

When you combine polymorphic encryption with tokenization, you’re able to produce meaningful tokens that preserve data utility for analytics while maintaining user anonymity. And when combined with fine-grained access control, you have control over who and what can see the original detokenized values.

With a data privacy vault, sensitive data is stored within the vault, tokens are used as references or pointers for the original values and no sensitive data is stored within the data warehouse or any downstream service. When no sensitive data is stored in your data warehouse and analytics store, the possibility of fallout from internal data misuse or a data breach is dramatically reduced. Additionally, your compliance scope is restricted to the vault, effectively descoping your systems from that responsibility.

Example of sensitive data flowing through a data privacy vault into Snowflake

What does the process of using a data privacy vault with Snowflake look like in practice?

Instead of storing PII directly in Snowflake, the data is tokenized prior to ingestion and stored securely in a data privacy vault. Only the corresponding tokens are stored in the data warehouse.

No sensitive data makes its way to Snowflake whatsoever.

Using a combination of deterministically generated and format preserving tokens, most analytical operations can be performed directly against the tokenized data. For special forms of customer analytics, the tokens can be detokenized on the fly via API calls by users with proper permissions.

Snowpipe for automated ingestion

Snowpipe is Snowflake’s solution for automated loading of data from a stage. It lets you load data into tables via micro-batches that are triggered whenever new data appears in a staging area. This makes Snowpipe an ideal component in the automated, end-to-end process of ingesting tokenized data into the data warehouse for downstream analytics.

Snowpipe uses COPY statements familiar to most Snowflake users. Using data loading through Snowpipe is fast and convenient, reducing complexity.

A Snowflake “stage” might be a cloud storage bucket like AWS S3 or Google Cloud Storage.

Now that we’ve given a brief overview of data privacy vaults and Snowpipe, let’s step through the technical implementation of a secure data pipeline that can process highly sensitive data.

An example implementation: Skyflow + Snowpipe

Our example will deal with order purchase information gathered from a business’s website. We’ll step through how this data is tokenized and eventually brought with Snowpipe into a Snowflake Data Warehouse for downstream analytics.

De-identifying sensitive data through Skyflow into Snowfalke

In the above diagram, the Skyflow Data Privacy Vault sits at the entry point for sensitive customer data that is generated from a website. Immediately, the sensitive data is stored securely in the data privacy vault. That data is encrypted and de-identified through tokenization, and the tokenized data is stored as a file in S3. Snowpipe detects the presence of a new file in S3 and then loads the (now de-identified) data into Snowflake, where it can be used for analytics.

The website as the original source of data is just one example, the data could be coming from other locations and stored securely within the vault prior to moving into Snowflake.

With this high-level flow mapped out, let’s look at what the implementation code might actually look like.

Inserting PII data into Skyflow Data Privacy Vault

As the first point of entry in your data flow, the data privacy vault is immediately handed sensitive PII data — before anything is stored on the company’s systems.

With Skyflow Data Privacy Vault, the order information, information about the customer, and any other sensitive data you’re collecting is sent directly from the frontend application into the vault.

const skyflowClient = Skyflow.init({
  vaultID: process.env.vaultID,
  vaultURL: process.env.vaultURL,
  getBearerToken: getBearerToken
});

let response = await skyflowClient.insert({
  records: [
    {
      fields: {
        order_number: "1",
        order_date: "2023-01-01"
        user_info: {
          id: "usrijG9SC4EQlq5cm",
          email: "john.doe@gmail.com",
          name: "John Doe"
        },
        status: "Fulfilled",
        priority: "Medium",
        amount: "10000"
        country: "US",
        state: "California"
      },
      table: "orders"
    }
  ]
}, { tokens: true });

The following tokenized response is received from the Skyflow API.

{
    "records": [
        {
            "fields": {
                "skyflow_id": "ae7k43l4-596f-189a-lj45-mk3l3e79df71md",
                "order_number": 1,
                "order_date": "2023-01-01",
                "user_info": {
                    "id": "usrijG9SC4EQlq5cm",
                    "email": "03c1db9d3d63@4e56a7f5.com",
                    "name": "bd4b0dee-d4cb-446f-a2e6-dab0e25d702f"
                },
                "status": "Fulfilled",
                "priority": "Medium",
                "amount": "10000",
                "country": "bce3fabd-8364-441c-9dc2-49139b25b75c",
                "state": "4199e308-5b22-44c1-9236-032ddff646e0"
            }
        }
    ]
}

In this example above, name, email, country, and state have been replaced by deterministic tokens. This means the same input value will generate the same token within this specific vault. Additionally, the email is a format preserving token. The data that gets tokenized and how the tokens are generated are completely configurable and depends on your goals and use case.

Storing tokenized data in S3

The next step is simple: store this tokenized data in an S3 bucket.

import json
import boto3

# Define the S3 bucket name and key for the file
bucket_name = "customer-data-bucket"
file_key = "{timestamp}-{generated-uuid}.json"

# Convert the data to a JSON string
json_data = json.dumps(response)

# Write the JSON data to a file in memory
file_contents = bytes(json_data.encode("UTF-8"))

# Upload the file to S3
s3 = boto3.client("s3")
s3.put_object(Bucket=bucket_name, Key=file_key, Body=file_contents)

So far, we have used an API to store and tokenize sensitive data and then store the tokenized data in S3. This is the point where Snowpipe comes into play.

Snowpipe detects new data and performs ingestion

In AWS, we configure an S3 event notification to tell Snowpipe when new data arrives in the S3 bucket. When Snowpipe is triggered, it uses a COPY statement to load tokenized data from the S3 stage directly into Snowflake.

The following code in Snowflake covers these steps:

Create a new table.
Create a file format and a stage.
Create a new pipe with a COPY INTO statement.

CREATE OR REPLACE TABLE customer_orders (
  order_number INT,
  id STRING,
  email STRING,
  name STRING,
  status STRING,
  order_date DATE,
  priority STRING,
  amount FLOAT,
  country STRING,
  state STRING
  );

CREATE OR REPLACE FILE FORMAT customer_json_format
  TYPE = JSON;

CREATE OR REPLACE TEMPORARY STAGE customer_stage
 FILE_FORMAT = customer_json_format;

CREATE PIPE customer_orders
  AUTO_INGEST = TRUE
  AS
  COPY INTO customer_orders
    FROM (SELECT $1:records.fields.order_number,
                 $1:records.fields.user_info.id,
                 $1:records.fields.user_info.email,
                 $1:records.fields.user_info.name,
                 $1:records.fields.status,
                 $1:records.fields.order_date,
                 $1:records.fields.priority,
                 $1:records.fields.amount,
                 $1:records.fields.country,
                 $1:records.fields.state
          FROM @customer_stage t)
    ON_ERROR = 'continue';

With the above pipeline running in Snowflake, ingestion of the de-identified data is automated. Data teams can use customer data for downstream analytics as normal — without worrying about data privacy and compliance.

Because the tokens are deterministic, standard analytical queries using joins, counts, and group bys will work as if the original sensitive values are being stored in Snowflake.

Detokenize as late as possible and only when necessary

There may be occasions when your data team needs the original (detokenized) data values to provide context for analytics. For example, detokenizing the country value to display on a report or chart.

When detokenized data is necessary, you can make calls to Skyflow’s detokenization API endpoint. With a valid token and the right policies in place, your team can access this data as needed. In Snowflake, you can make these detokenization calls on the fly with External Functions. We’ll look at how to implement this in a future post.

Wrap Up

The challenge of dealing with PII while still needing to perform analytics within Snowflake is not a new problem. With data breaches and security incidents becoming commonplace, organizations need to take proper measures to protect customer data. However, that doesn’t mean sacrificing the wealth of business insights that can come from customer data analytics.

By using Skyflow Data Privacy Vault alongside Snowpipe for automated ingestion of tokenized data, we have a solution for securing sensitive data to ensure customer privacy while keeping it accessible to enable customer analytics.