Snowflake: Unlock the Power of OpenAI using Snowpark

Querying “Your” Data in Snowflake is made easier than ever using LLM

Snowflake allows you to securely store nearly unlimited data, structured and unstructured, in the Snowflake Data Platform. However, to derive insight from this data, some time is challenging, one needs to understand the data model, and know SQL or Python. So, how about if you ask Snowflake questions in natural language and give you answers?

Snowflake made it easier to integrate with third-party hosted models such as Azure OpenAI Service, ChatGPT, etc using the new feature called External Network access from UDF and Stored Procedure in Snowpark (Private Preview). This feature reduces the complexity of setting up compared to the external function for this use case and is also cost-effective.

Of course, you can bring and train your own LLM model in Snowflake using Snowpark Container Service, check this and this out. But this article shows how you integrate your Snowflake deployment with third-party hosted models. You have to take caution about what data you are sending to the hosted models.

Snowflake Data Cloud — Single Secure Platform

To get started, set up your environment (one time) by creating various Snowflake objects such as NETWORK RULE, SECRET, and EXTERNAL ACCESS INTEGRATION, as below. Please note that we used SECRET to store OpenAPI Key. So you will need an API key to use it. You can get the API key for ChatGPT by following this, and for Azure OpenAI by following this. It is not free!

USE ROLE ACCOUNTADMIN;
-- change following to use your own db.schema, warehouse
USE DEMODB.LLM;
USE WAREHOUSE DEMO_WH;

--create seprate role for network admin as a part of best practices
CREATE ROLE if not exists network_admin ;
GRANT CREATE INTEGRATION ON ACCOUNT TO ROLE network_admin;
GRANT CREATE NETWORK RULE ON SCHEMA demodb.llm TO ROLE network_admin;
GRANT CREATE SECRET ON SCHEMA demodb.llm TO ROLE network_admin;

GRANT USAGE on DATABASE demodb to ROLE network_admin ;
GRANT USAGE on SCHEMA demodb.llm to ROLE network_admin ;
GRANT USAGE on WAREHOUSE demo_wh to ROLE network_admin;

-- create hirarchey
GRANT ROLE network_admin to ROLE SYSADMIN;

USE ROLE network_admin;
USE WAREHOUSE demo_wh;

-- crete networking rules so only those sites are allowed (compliance)
CREATE OR REPLACE NETWORK RULE web_access_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('drive.google.com',
'raw.githubusercontent.com',
'api.exchangerate-api.com',
'api.openai.com',
'openai-southus.openai.azure.com');

USE ROLE SYSADMIN;
USE DEMODB.LLM;
USE WAREHOUSE DEMO_WH;


CREATE OR REPLACE SECRET sf_openapi_key
TYPE = password
USERNAME = 'gpt-3.5-turbo' -- model to use
PASSWORD = 'sk-ABC123.....'; -- API key

CREATE OR REPLACE SECRET sf_azure_openapi_key
TYPE = password
USERNAME = 'your model engine'
PASSWORD = 'your api key' ;

-- create exernal access integration with networking rule created above
-- along with the secret to access API keys

CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION external_access_int
ALLOWED_NETWORK_RULES = (web_access_rule)
ALLOWED_AUTHENTICATION_SECRETS = (sf_openapi_key, sf_azure_openapi_key )
ENABLED = true;

Once you created the above objects, you can write Python UDF to include external integration and secret to access OpenAI models. Here is the code for such python UDF for ChatpGPT and Azure OpenAI. As you can see we use the SECRET object to get the key; and the INTEGRATION object to restrict access to the network.

USE ROLE SYSADMIN;
USE DEMODB.LLM;
USE WAREHOUSE DEMO_WH;

-- you can add other parameters such as model etc, in udf as you see fit
CREATE OR REPLACE FUNCTION chatgpt(query varchar)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'getanswer'
EXTERNAL_ACCESS_INTEGRATIONS = (openai_access_int)
SECRETS = ('openai_key' = openapi_key_secret )
PACKAGES = ('openai')
AS
$$
import _snowflake
from openai import OpenAI


def getanswer(QUERY):
sec_object = _snowflake.get_username_password('openai_key');
messages = [
{"role": "user",
"content": QUERY}
]
model="gpt-3.5-turbo"
client = OpenAI(
api_key=sec_object.password,
)
response = client.chat.completions.create(messages=messages,model=model)
return response.choices[0].message.content.strip()
return response.choices[0].message["content"]
$$;

-- now send it to model to generate SQL
SELECT chatgpt('### Snowflake database SQL tables, with their properties ###\
schema SNOWFLAKE.ACCOUNT_USAGE, QUERY_HISTORY table QUERY_TEXT VARCHAR, \
WAREHOUSE_SIZE VARCHAR, START_TIME TIMESTAMP,TOTAL_ELAPSED_TIME NUMBER \
### A sql to show top ten query based on total elapsed time \
for last 24 hours');

-- OUTPUT
SELECT QUERY_TEXT, WAREHOUSE_SIZE, START_TIME, TOTAL_ELAPSED_TIME
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD(HOUR, -24, CURRENT_TIMESTAMP())
ORDER BY TOTAL_ELAPSED_TIME DESC
LIMIT 10;

-- you may need to elimate some text using REGEXP_SUBSTR
select REGEXP_SUBSTR(
chatgpt('### Snowflake SQL tables, with their properties:### \
database WORKSHOP01_DB in schema WORKSHOP_SCHEMA, CAMPAIGN_SPEND table \
CAMPAIGN VARCHAR,CHANNEL VARCHAR,DATE DATE,TOTAL_CLICKS NUMBER, \
TOTAL_COST NUMBER,ADS_SERVED NUMBER \
### A query to the data so we can see total cost per year/month per channel '),
'```sql([^```]+)',1,1,'e');

So you got a query now, great, do you want to execute it right away, you can, but you may not want to do that. What if the query has a bad plan or is not derived correctly? You have to put some guardrails to avoid it. For example, you can run EXPLAIN PLAN function before you allow the query to run.

Similar to ChatGPT, you can also use Azure OpenAI API using Python UDF, see below. You have to change the API Base and API version from the following code.

CREATE OR REPLACE FUNCTION azure_openai(query varchar)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'getanswer'
EXTERNAL_ACCESS_INTEGRATIONS = (external_access_int)
SECRETS = ('openai_key' = sf_azure_openapi_key )
PACKAGES = ('openai')
AS
$$
import _snowflake
import openai
from openai import AzureOpenAI
def getanswer(QUERY):
sec_object = _snowflake.get_username_password('openai_key');
model = sec_object.username
messages = [{"role": "user", "content": QUERY}]
client = AzureOpenAI(
api_key=sec_object.password,
api_version="2023-05-15",
azure_endpoint="https://openai-southus.openai.azure.com/"
)
response = client.chat.completions.create(messages=messages,model=model)
return response.choices[0].message.content.strip()
return response.choices[0].message["content"]
$$;


--or for openapi ver 1.0

CREATE OR REPLACE FUNCTION azure_openai(query varchar)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'getanswer'
EXTERNAL_ACCESS_INTEGRATIONS = (external_access_int)
SECRETS = ('openai_key' = sf_azure_openapi_key )
PACKAGES = ('openai')
AS
$$
import _snowflake
import openai
from openai import AzureOpenAI
def getanswer(QUERY):
sec_object = _snowflake.get_username_password('openai_key');
model = sec_object.username
api_key=sec_object.password
api_base = "https://openai-southus.openai.azure.com/"
deployment_name = model
api_version = '2023-12-01-preview'
# or 2022-12-01

client = AzureOpenAI(
api_key=api_key,
api_version=api_version,
azure_endpoint = api_base
)
response = client.completions.create(prompt=QUERY,model=deployment_name,max_tokens=1000)
return response.choices[0].text
$$;


-- QUERY Examples
select azure_openai('Write a tagline for Snowflake');
select azure_openai('Classify this sentiment: Snowflake Data Platform \
is extremely cost effective!');

--QUERY
select azure_openai('### Snowflake SQL tables, with their properties:### \
SNOWFLAKE_SAMPLE_DATA database in schema TPCH_SF1 in ORDERS table \
O_TOTALPRICE NUMBER \
### What is average price rounded of orders');

-- OUTPUT
SELECT ROUND(AVG(O_TOTALPRICE), 2) FROM SAMPLE_DATA.TPCH_SF1.ORDERS;

-- QUERY
select azure_openai('### Snowflake SQL tables, with their properties:### \
SNOWFLAKE_SAMPLE_DATA database in schema TPCH_SF1 in ORDERS table \
O_CUSTKEY NUMBER, O_TOTALPRICE NUMBER\
CUSTOMER TABLE, C_CUSTKEY NUMBER, C_NAME VARCHAR \
### Who is my top ten customers?');

-- OUTPUT
SELECT c.C_NAME, SUM(o.O_TOTALPRICE) as TOTAL_ORDERS
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS o
inner join SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER c
on o.O_CUSTKEY = c.C_CUSTKEY
group by c.C_NAME order by TOTAL_ORDERS desc LIMIT 10;
Output of LLM UDF

These are just a few examples of how to use this, but there are many more possibilities, use your own creativity.

If you want to deliver this functionality in an application to your end user, you can quickly develop an application using Streamlit in Snowflake (Private Preview), complete workflow i.e. questions from the user to execution of the SQL in tabular and chart. , see examples here, here, and here.

LLM is very productive for the team as it allows another way to interact with data, but you should concern with security, privacy (you have to send the metadata to hosted model), and quality (not all outputs are perfect). A company needs good structure and governance around how they manage data, and Snowflake exactly provides that.

Good luck with your machine-learning projects!

Disclaimer: The opinions expressed in this post are my own and not necessarily those of my employer (Snowflake).

--

--