Connect the Dots: External APIs with Snowflake’s External Network Access

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

10 min readOct 31, 2023

In the not-so-distant past, specifically earlier this year, calling an external API from Snowflake entailed a somewhat cumbersome process. At that time, the go-to method was relying on external functions.

While these functions did serve a purpose, they required a fair amount of effort to set up. This involved the prerequisite use of an approved API gateway such as AWS, Azure, or GCP, along with several additional steps for configuring roles, granting access, linking API integrations, and developing and hosting code to act as a proxy for the external function call to the API.

Furthermore, this approach introduced unnecessary latency into the API call flow, mainly due to the creation of multiple network hops. The journey typically involved the external function calling your chosen API gateway, and quite possibly an intermediary component like a Lambda function, before finally reaching the intended external API, as illustrated in the diagram below.

*Example of calling an external API with Snowflake’s external functions.*

Jump ahead to the present, and Snowflake has taken a significant leap by introducing native support for external network access, currently available in public preview. This groundbreaking feature simplifies the process of connecting to external APIs, sparing you from the overhead of extensive cloud setup and the detours caused by network hops.

In this post, I’ll guide you through the process of setting up external network access. If you’re eager to dive straight into the code, you can also find the repository on GitHub.

Note: At the time of this writing, external network access is only available for Snowflake accounts running on AWS.

Why call an external API from Snowflake?

You can extend Snowflake’s default functionality by defining user-defined functions (UDF) that perform an operation, complete some business logic, or transform data. You can extend this even further by leveraging third party APIs that you call from within the UDF.

For instance, by combining a UDF with an external API, you can perform tasks like language translation, sentiment analysis, text classification, or inference using an LLM model. This integration with external APIs is crucial as Snowflake evolves into a platform for not only data management but also application development.

As an example of Snowflake’s external API support, we’ll use third-party APIs available through Skyflow. We’ll show you how to use these APIs to transform sensitive data into de-identified information, suitable for privacy-preserving analytics. We’ll also cover re-identifying sensitive data when necessary. Before diving into the technical details, let’s explore Skyflow and understand the significance of de-identifying sensitive data in modern data management.

What is Skyflow?

Skyflow is a data privacy platform that provides a data privacy vault as a service. A data privacy vault isolates, protects, and governs access to sensitive customer data (e.g. PII). The vault not only secures the data, but through polymorphic encryption, actually makes data useful while keeping it fully encrypted. You can use of the data in a privacy-safe way to support workflows like PII management, financial data management and use, data residency, secure data sharing, analytics, and AI and machine learning.

*Traditional data security versus a data privacy vault (source:* *IEEE*).

When data is inserted into a vault, the vault returns de-identified data in the form of vault-generated tokens. Unlike encryption, there is no mathematical connection between the original data and tokenized data. Thus, even if you have the token, you can’t reverse engineer it and it says nothing about how other tokens are generated.

In the image below, sensitive data collected from the frontend is stored within the vault, creating a single source of truth. De-identified data is returned and sent downstream, eventually landing in Snowflake.

*Simple de-identification pipeline using Skyflow and Snowflake.*

The vault design is schema-based, similar to a database. Each column in a table represents sensitive customer data like a home address, phone number, credit card, or SSN. The column can be configured to support different de-identification schemes to support different workflows as well as different masking rules for partially redacted data.

Example of a simple vault schema (left) and configuring a redaction rule for a bank account number (right). Note, this is the default for a bank account number, any common PII is pre-configured with out of the box de-identification and redaction rules to support common use cases.

For example, a person’s name could be consistently de-identified as a UUID, meaning that the same input always generates the same UUID. A credit card could be consistently de-identified such that it preserves the BIN and last 4 digits so you know the issuer and have the last 4 digits for customer verification.

The de-identification options helps you support analytics workflows without needing access to the raw data, effectively de-scoping Snowflake from data privacy and compliance.

The vault supports policy-based access control based around the principles of zero trust. You give explicit access to the columns and rows of data to a user or service along with how they should be able to see the underlying data, plain text, partially redacted, or fully redacted. This gives you control over who sees what, when, where, and even for how long.

ALLOW READ ON persons.full_name, persons.email WITH REDACTION = PLAIN_TEXT
ALLOW READ ON persons.dob WITH REDACTION = MASKED
ALLOW READ ON persons.ssn WITH REDACTION = REDACTED

In the simple example above, the policies below give access to read a person’s name and email in plaintext, the date of birth is masked, but the SSN is completely redacted.

Re-identifying data based on different roles and policies.

De-identified data can be exchanged with the vault to re-identify it. The assigned role and policies create different representations of the sensitive data, allowing you to control who sees what, when, where, and even for how long.

Why keep sensitive data out of Snowflake?

There are a variety of reasons you or your customer may want to keep sensitive customer data out of Snowflake.

For example, by creating another copy of your customer data, you’re increasing the overall security and compliance footprint for your organization. In the world of GDPR and the right to be forgotten, creating another copy of your regulated data creates additional operational burden and amplifies your risk of losing track of what you’re storing and who has access.

Additionally, if you operate globally, as I described in a prior article, complying with data residency regulations around the world can be expensive and operationally complex, hampering global analytics on Snowflake.

These issues, plus perhaps internal company rules around the sharing of regulated data may stop a company from moving more data and compute to Snowflake as they modernize. The vault architecture helps de-scope Snowflake and other downstream services, speeding up cloud modernization.

Calling the Skyflow APIs

The architecture we’re looking to achieve is shown below. The UDF, reidentify, makes the API call directly to Skyflow from Snowflake, by-passing the earlier setup with AWS.

Simplified external API call architecture using Snowflake’s external network access feature.

Although the example is based around Skyflow’s APIs, the steps outlined below should work for any standard third party API.

Step 1: Creating an API secret

In the old world of external functions, you may have had to rely on something like AWS Secrets Manager to store your API authentication key (as shown in the first image of this post). But a simple way to avoid having to call out to AWS is to leverage Snowflake’s native support for secrets management.

Snowflake secrets supports basic authentication, OAuth with code grant flow, and OAuth with client credentials flow.

Once you create a secret, only specific Snowflake components such as API integrations and external functions can read the sensitive information.

Skyflow uses service accounts to provide programmatic access to APIs. A service account is associated with a role and a role has a group of policies that controls access and functionality to the vault data.

In Skyflow’s web-based management tool, Skyflow Studio, once you create a service account, a service account file is automatically downloaded to your machine.

For this example, I’ve copied the contents of the service account file and created a secret of type GENERIC_STRING in Snowflake as shown below.

-- Store Skyflow service account key with Snowflake Secrets Manager
CREATE OR REPLACE SECRET skyflow_vault_secret
    TYPE = GENERIC_STRING
    SECRET_STRING = '{"clientID":"c07a6d60d1424c41b1d052cbe3035db7",
      "clientName":"Snowflake Service Account",
      "tokenURI":"https://manage.skyflowapis.com/v1/auth/sa/oauth/token",
      "keyID":"g59972af789d499291fd8106226ea31e",
      "privateKey":"-----BEGIN PRIVATE KEY-----\nMIIEvA…}';

Step 2: Create a network rule

To access an external API, you must create a network rule defining the domains you want to allow your account to access.

For example, in the code below, I’m creating an egress rule for the vault data API domain (i.e. ebfc9bee4242.vault.skyflowapis.com) and the Skyflow authentication domain, manage.skyflowapis.com.

-- Grant access to the Skyflow API endpoints for authentication and vault APIs
CREATE OR REPLACE NETWORK RULE skyflow_apis_network_rule
 MODE = EGRESS
 TYPE = HOST_PORT
 VALUE_LIST = ('ebfc9bee4242.vault.skyflowapis.com', 'manage.skyflowapis.com');

Step 3: Create an external access integration

The next pre-configuration step is to aggregate the secret and network rule for use with the UDF.

-- Create an integration using the network rule and secret
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION skyflow_external_access_integration
 ALLOWED_NETWORK_RULES = (skyflow_apis_network_rule)
 ALLOWED_AUTHENTICATION_SECRETS = (skyflow_vault_secret)
 ENABLED = true;

Step 4: Create a UDF

For this example, we’re going to create two UDFs. One that de-identifies sensitive data, storing it within the vault and returning a vault-tokenized value and one that re-identifies vault-tokenized data.

Create a UDF to de-identify data

To use the external API, we need to create a UDF or procedure that will actually make the API call. When creating the UDF, we need to set the EXTERNAL_ACCESS_INTEGRATIONS parameter to the external access integration name, skyflow_external_access_integration, that we created previously.

To keep things simple, we’ll define a UDF that stores a single value in the vault and returns the de-identified value. Within the UDF, the generate_auth_token uses the Snowflake stored secret to sign a JWT token and generate an auth bearer token with Skyflow’s auth service.

-- Create a UDF to de-identify a single value
CREATE OR REPLACE FUNCTION skyflow_deidentify(vault_id text, table_name text,
  column_name text, value text)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'skyflow_deidentify'
EXTERNAL_ACCESS_INTEGRATIONS = (skyflow_external_access_integration)
PACKAGES = ('pyjwt', 'cryptography', 'requests', 'simplejson')
SECRETS = ('cred' = skyflow_vault_secret)
AS
$$
import _snowflake
import simplejson as json
import jwt
import requests 
import time

def generate_auth_token():
    credentials = json.loads(_snowflake.get_generic_secret_string('cred'), strict=False)
    
    # Create the claims object with the data in the creds object
    claims = {
       "iss": credentials["clientID"],
       "key": credentials["keyID"], 
       "aud": credentials["tokenURI"], 
       "exp": int(time.time()) + (3600), # JWT expires in Now + 60 minutes
       "sub": credentials["clientID"], 
    }
    # Sign the claims object with the private key contained in the creds object
    signedJWT = jwt.encode(claims, credentials["privateKey"], algorithm='RS256')

    body = {
       'grant_type': 'urn:ietf:params:oauth:grant-type:jwt-bearer',
       'assertion': signedJWT,
    }
    tokenURI = credentials["tokenURI"]

    session = requests.Session()
    r = session.post(tokenURI, json=body)
    auth = json.loads(r.text)
    
    return auth["accessToken"]

def skyflow_deidentify(vault_id, table_name, column_name, value):
    auth_token = generate_auth_token()
  
    body = {
       "tokenization": True,
       "records": [
           {
               "fields": {
                   column_name: value
               }
           }
       ]
    }

    url = "https://ebfc9bee4242.vault.skyflowapis.com/v1/vaults/" + vault_id + "/" + table_name
    headers = {
        "Authorization": "Bearer " + auth_token
    }

    session = requests.Session()
    response = session.post(url, json=body, headers=headers)
    
    response_as_json = json.loads(response.text)
    
    return response_as_json["records"][0]["tokens"][column_name]
$$;

Create a UDF to re-identify data

We’ll also create a UDF to re-identify data. This way if the user in Snowflake has the right permissions, they can retrieve, in-memory, the original data for reporting purposes when re-identification is needed.

-- Create a UDF to re-identify a de-identified value
CREATE OR REPLACE FUNCTION skyflow_reidentify(vault_id text, value text)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'skyflow_reidentify'
EXTERNAL_ACCESS_INTEGRATIONS = (skyflow_external_access_integration)
PACKAGES = ('pyjwt', 'cryptography', 'requests', 'simplejson')
SECRETS = ('cred' = skyflow_vault_secret)
AS
$$
import _snowflake
import simplejson as json
import jwt
import requests 
import time

def generate_auth_token():
    credentials = json.loads(_snowflake.get_generic_secret_string('cred'), strict=False)
    
    # Create the claims object with the data in the creds object
    claims = {
       "iss": credentials["clientID"],
       "key": credentials["keyID"], 
       "aud": credentials["tokenURI"], 
       "exp": int(time.time()) + (3600), # JWT expires in Now + 60 minutes
       "sub": credentials["clientID"], 
    }
    # Sign the claims object with the private key contained in the creds object
    signedJWT = jwt.encode(claims, credentials["privateKey"], algorithm='RS256')

    body = {
       'grant_type': 'urn:ietf:params:oauth:grant-type:jwt-bearer',
       'assertion': signedJWT,
    }
    tokenURI = credentials["tokenURI"]

    session = requests.Session()
    r = session.post(tokenURI, json=body)
    auth = json.loads(r.text)
    
    return auth["accessToken"]

def skyflow_reidentify(vault_id, value):
    auth_token = generate_auth_token()
  
    body = {
       "detokenizationParameters": [
           {
               "token": value
           }
       ]
    }

    url = "https://ebfc9bee4242.vault.skyflowapis.com/v1/vaults/" + vault_id + "/detokenize"
    headers = {
        "Authorization": "Bearer " + auth_token
    }

    session = requests.Session()
    response = session.post(url, json=body, headers=headers)
    
    response_as_json = json.loads(response.text)

    return response_as_json["records"][0]["value"]
$$;

Testing it out

With both UDFs in place, we can now de-identify data during insertion directly from Snowflake and re-identify when pulling data.

The query below inserts my name into the vault and the de–identified version into Snowflake.

INSERT INTO customers (
  name
)
SELECT skyflow_deidentify('i0e83ad6af494b9bac01245a52523e90',
  'persons', 'name', 'Sean Falconer') as name;

Inserting a record into the vault and Snowflake.

Below, I check to see what’s stored in Snowflake. There’s a single UUID representing my name. For analytics purposes, Snowflake doesn’t really need to know my name, as long as the de-identified version is consistently generated you can still do counts, joins, and group bys.

Querying the list of customers to show the de-identified customer names.

This is a simple example, to do this in bulk, you’d likely use a more sophisticated approach similar to what I’ve written about previously.

Next, let’s test re-identifying data. In the query below, I’m re-identifying all names in the table.

-- Re-identify all names in the customer table
with customer_list as (
    select skyflow_reidentify('i0e83ad6af494b9bac01245a52523e90', name)
      from customers
)
select * from customer_list;

Re-identify all customer names from the customers table.

You always want to re-identify data as late in the data lifecycle as possible. You’d want to do your analytics using the de-identified data and then only re-identify during the rendering of a report. Also, you could modify the UDF to do this in bulk rather than one at a time as in this example.

Final thoughts

In this post, we took a deep dive into Snowflake’s new support for external network access. This vastly simplifies the process of integrating with third party APIs.

In this example implementation, we looked at de-identifying and re-identifying sensitive customer data with Skyflow’s APIs. This is just one example, you could use a similar approach to integrate with any standard API, significantly extending the base functionality available in Snowflake.

Check out the repo for the codebase and happy building!