Securely Using Snowflake’s Python Connector within an Azure Function

Why and How to use Key Vault

John Aven
Hashmap, an NTT DATA Company
8 min readOct 31, 2019

--

Securing access credentials is very important in any situation. And using the right approach is always the best practice. However, at times, the best practice isn’t possible — and so we are forced to find the next best thing while maintaining the integrity of a secret.

At the time of this writing, it is not possible to follow the best practice of referencing a key vault secret within a Python Azure function — for reasons, we will discuss further down the page. The work here reflects our experience to manage secrets within Azure, and in particular, access secrets for Snowflake’s Cloud Data Warehouse from within Azure Functions.

How to Get Started

So that we can use Managed Service Identities (MSI) within an azure function to access secrets in an Azure key vault, we will go through the steps below:

# Create Temporary Environment Variables
RGN='your-resource-group-name'
LOC='eastus'
SAN='my-storage-account-name'
FPN='function_project_name'
SUB='subscription-name
KVN='your-key-vault-name'
secret_value='12j1234hp'
AppName='test-app-23478u923'
# Below we turn off history temporarily so that the password is not recorded (since it is sensitive information)
USER='some_user'
set +o history
PASSWORD='some_password'
set -o history
ACCOUNT='your_account'
WAREHOUSE='warehouse'
DATABASE='test'
SCHEMA='test'
SFUSER=secretuser# Set the appropriate subscription
az account set --subscription "$SUB"
# Create a resource group
az group create --name "$RGN" --location "$LOC"
# Create a storage account
az storage account create --name "$SAN" --location "$LOC" --resource-group "$RGN" --sku Standard_LRS
# Create the function
az functionapp create --resource-group "$RGN" --os-type Linux --consumption-plan-location eastus --runtime python --name "$AppName" --storage-account "$SAN"
# Install Azure Functions Python library (assume using python 3.6)
pip install azure-functions
# Create your azure function and navigate to its source root
func init $FPN
cd $FPN
# Create a new function
func new

At this point, we will leave the script and interactively create the function — make a choice to create a HttpTrigger. If you wanted to test, you could always start the function locally as

func host start

Before creating the code for the function, there is still some work that needs to be completed

# Create the new keyvault
az keyvault create --name "$KVN" --resource-group "$RGN" --location "$LOC"
az keyvault secret set --vault-name $KVN --name $SF_USER --value $USER:$PASSWORD:$ACCOUNT:$WAREHOUSE:$DATABASE:$SCHEMA

Now create the function in azure and create the managed service identity and the necessary privileges (permissions)

REMOVE THE FOLLOWING COMMAND
# Create identity
az identity create --resource-group $RGN --name
KEEP THIS ONE
# Assign identity
az functionapp identity assign --name $AppName --resource-group $RGN
# Retrieve the principal ID
PID=$(az functionapp identity show --name $AppName --query principalId --output tsv)
# Set the get policy for azure keyvault
az keyvault set-policy --name "$KVN" --object-id "$PID" --secret-permissions get

At this point, the application (within Azure) should be configured so that it can reference the KeyVault values as environment variables — as documentation states. This last part, while documented, is not clear or apparent on a first, second, or third reading — but makes sense logically. Even though the MSI has KeyVault usage, it needs to have permission to read the secrets.

Let’s Build a RESTful API for Snowflake Usage Monitoring

So, next is to write code to do something useful. A simple approach is to build a RESTful API. The API will need to be something useful — so let’s assume that we are building a tool to monitor warehouse usage within Snowflake — which uses the following view.

CREATE VIEW "Database"."Schema"."USAGE_V" AS SELECT * FROM TABLE(Database.information_schema.WAREHOUSE_METERING_HISTORY(dateadd('days', -10, current_date())));

Doing so you would be interested in knowing what average usage is over some period of time, the largest usage in a one hour window and the lowest — maybe even a histogram (hour resolution) of the usage.

Anyway, we will use the native python connector published by Snowflake and use it through snowflake-connector + pandas. The results will be packaged into a JSON document and returned.

import os
import traceback
import azure.functions as func
import pandas as pd
import snowflake.connector
from msrestazure.azure_active_directory import MSIAuthentication
from azure.keyvault import KeyVaultClient
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
#------------------------------------------------------------------#
def main(req: func.HttpRequest) -> func.HttpResponse:
try:
df = get_dataframe()
aggregated_df = get_aggregated_dataframe(df=df)return func.HttpResponse(f"{aggregated_df.T.to_json()}!")except Exception as ex:err_msg = "Error while retrieving and processing data from. Please try again later. " + traceback.format_exc()
return func.HttpResponse( err_msg, status_code=400)
#------------------------------------------------------------------#
def parse_connection_secret(secret: str) -> dict:
split_secret = secret.split("|")return {
"user": split_secret[0],
"password": split_secret[1],
"account": split_secret[2],
"warehouse": split_secret[3],
"database": split_secret[4],
"schema": split_secret[5],
"role": split_secret[6],
}
#------------------------------------------------------------------#
def get_dataframe() -> pd.DataFrame:
sec = parse_connection_secret(get_secret('secretuser'))
con = snowflake.connector.connect(
user=sec['user'],
password=sec['password'],
account=sec['account'],
warehouse=sec['warehouse'],
database=sec['database'],
schema=sec['schema'],
role=sec['role'],
)
df = pd.read_sql('select * from usage_v', con)
df.columns = [col.lower() for col in df.columns]
return df#------------------------------------------------------------------#
def get_aggregated_dataframe(df: pd.DataFrame) -> pd.DataFrame:
aggregated_df = df.groupby('warehouse_name').agg({
'credits_used': {'mean_over_hours_of_usage': 'mean',
'sum_usage': 'sum',
'min_usage': 'min',
'max_usage': 'max',
'hours_with_usage': 'count'}
})
aggregated_df.columns = aggregated_df.columns.droplevel(0)return aggregated_df#------------------------------------------------------------------#
def get_secret(conf_value) -> str:
secret = os.getenv(conf_value)

See the repo for more details (comments) on the code — also, it does a fair amount of logging. Beyond all the cool interaction with Snowflake, it demonstrates a simple GET API as an Azure Function that executes a query, manipulates the data, and returns it as a ‘report’ usable dataset.

You will note that we use the Snowflake maintained connector that is part of Snowflake’s service. While there is a sqlalchemy extension for Snowflake, and it did work locally, it did not install properly when the service was published to Azure. Likely this is due to the Azure maintained base Docker image not updating to the latest version of pip, which would download the appropriate version of the connector.

The Dilemma

Now, here is the thing. While on Windows-based containers, it is possible to create an app-setting and reference the vaulted secret via reference to an environment variable, it is not possible to do so for Linux images running on a consumption plan. This has been a feature request for some time — and to be honest, I can’ t see why this wouldn’t be accomplished so far out. The effect is that this would appear to block the best practice.

However, not all is lost. Managed Service Identities are enabled on Linux consumption plan images. Therefore, RESTful API integrations are also possible. But to avoid the mess of dealing directly with the API, we would recommend using the SDK that wraps the API — as seen below. In doing so, we can create an additional wrapper to use in place of referencing an environment variable — get_secret — to help preserve some of this simplicity.

def get_secret(conf_value) -> str:secret = os.getenv(conf_value)if len(secret.split("|")) == 7:
return secret
elif len(secret.split("|")) == 3:
return get_secret_from_keyvault(resource_packet=secret)
else:
raise ValueError('secret was not in application settings')
#------------------------------------------------------------------#
def get_secret_from_keyvault(resource_packet: str) -> str:
credentials = MSIAuthentication()client = KeyVaultClient(credentials)resource_uri, secret_id, secret_version = resource_packet.split("|")secret_bundle = client.get_secret(resource_uri, secret_id, secret_version)return secret_bundle.value

Making It Seamless and Reuseable

This is great, but it doesn’ t provide the same elegance that referencing an environment variable does — but is a close second. The extra code to allow for a ‘seamless’ and reusable process does require a standardized process of constructing application settings for secrets. But this consistency should not be more than a little inconvenience so that it is no different than what you have to do to reference a KeyVault secret — if it were to work — anyway.

So, by setting our local.app.settings to have the secrets as follows:

"Values": {
...
"secretuser": "user|password|account|warehouse|database|schema|role"
}

we can execute the code and test it locally as if it were running in Azure. Furthermore, by setting the app.settings for the function in Azure, we can run the application and get identical results by setting the app.settings value to,

secretuser="SecretBaseURI|SecretName|SecretVersion"

we can run the same code in both places seamlessly.

Note we would not suggest doing this identically. It would be much better to separate the password into another secret at a minimum and to obfuscate the name so that they don’t immediately appear to coincide.

To deploy the Azure function we would then execute the following

func azure functionapp publish $AppName --build remote

Once this code is executed, something of the following is returned:

"WH_0":{"mean_over_hours_of_usage":0.0516634981,"sum_usage":13.587500001,"min_usage":0.03,"max_usage":0.274722222,"hours_with_usage":263.0},
"WH_1":{"mean_over_hours_of_usage":0.1447286822,"sum_usage":6.223333335,"min_usage":0.019722222,"max_usage":0.592222222,"hours_with_usage":43.0}

Successfully Maintaining the Validity of Your Secrets

What’s great about this is that you can maintain the validity of your secrets being secrets. You won’t have to embed them in your app.settings or fake it in a worse way by embedding your secrets in a ‘base64’ encoded string — encoding is not encryption.

Here at Hashmap, we believe it is crucial to always to make sure that your software remains secure. Loose and inconsistent practices, especially around security, lead to inadequate and unsafe software.

Ready to accelerate your digital transformation?

At Hashmap, we work with our clients to build better, together.

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

How does Snowflake compare to other data warehouses? Our technical experts have implemented over 250 cloud/data projects in the last 3 years and conducted unbiased, detailed analyses across 34 business and technical dimensions, ranking each cloud data warehouse.

Other Tools and Content You Might Like

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

John Aven, PhD, is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”