Snowflake External Access & OpenAI ChatGPT for Zero-Shot Classification

Michael Gorkow
3 min readAug 29, 2023

--

Images created by AI

Who would love to do something like this?

possible_categories = ['CATEGORY_1','CATEGORY_2','CATEGORY_N']
classifications_df = df.with_column('PRED', chatgpt_classify(F.col('TEXT'), F.lit(possible_categories)))

Or this?

SELECT chatgpt_classify(TEXT, ['CATEGORY_1','CATEGORY_2','CATEGORY_N']) AS PRED;

This blog article delves into how Snowflake leverages Scikit-LLM and OpenAI’s ChatGPT for zero-shot classification through its novel “External Access” feature, showcasing the synergy of Large Language Models and Snowflake.

Disclaimer: Not every use case requires using LLMs and text classification is probably one that could be solved easily with traditional techniques.

What is Zero Shot Classification?

Zero-Shot classification is a natural language processing technique where a model can categorize text into classes it hasn’t seen during training. It does this by understanding the relationship between the given text and a set of predefined class descriptions. This allows the model to make predictions for classes it hasn’t been explicitly trained on, making it a versatile approach for classifying text into various categories without the need for extensive retraining.

What is Snowflake External Access?

With Snowflake External Access it is possible to establish safe connections to external network spots beyond Snowflake and subsequently employ those connections within the handler code for custom user functions (UDFs) and stored procedures.

The setup is straight forward and involves defining the URL we want to access (api.openai.com) as well as defining a secret to safely store our OpenAI key. Afterwards we can apply the integration easily to a function.

-- Create network role
CREATE OR REPLACE NETWORK RULE OPENAI_API_NETWORK_RULE
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('api.openai.com')

-- Create secret with OpenAI Key
CREATE OR REPLACE SECRET OPENAI_API_KEY
TYPE = GENERIC_STRING
SECRET_STRING = '<your-openai-key>'

-- Create External Access Integration
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION OPENAI_API_EXT_INT
ALLOWED_NETWORK_RULES = (OPENAI_API_NETWORK_RULE)
ALLOWED_AUTHENTICATION_SECRETS = (OPENAI_API_KEY)
ENABLED = TRUE

-- Apply integration to a function
ALTER FUNCTION chatgpt_classify(VARCHAR, ARRAY) SET SECRETS = ('CRED' = OPENAI_API_KEY)
ALTER FUNCTION chatgpt_classify(VARCHAR, ARRAY) SET EXTERNAL_ACCESS_INTEGRATIONS = (OPENAI_API_EXT_INT)

Once External Access is in PuPr, I’ll link to the corresponding documentation.

Update: External Access is now in PuPr for AWS. You can find the documentation here.

What is ChatGPT

ChatGPT is an advanced AI language model developed by OpenAI. Powered by the GPT (Generative Pre-trained Transformer) architecture, it comprehends context and generates human-like text. It’s employed for tasks like answering questions, providing explanations, content creation, and natural conversations, demonstrating its proficiency in diverse language-related applications.

In this blog we’ll focus on it’s ability to classify texts without any prior model training.

What is Scikit-LLM?

Scikit-LLM enables the seamless integration of Large Language Models, such as ChatGPT, into scikit-learn, facilitating improved text analysis capabilities.

What does the code look like?

Most of the code is very standard (you can find it here) so we’ll focus on the most important part, the UDF, that is called to classify text:

# UDF to perform multi-class classification with ChatGPT
@udf(name='chatgpt_classify',
is_permanent=True,
stage_location='@FUNCTIONS',
replace=True,
packages=['openai','scikit-llm'])
def chatgpt_classify(text: str, possible_categories: list) -> str:
import _snowflake
from skllm.config import SKLLMConfig
from skllm import ZeroShotGPTClassifier
SKLLMConfig.set_openai_key(_snowflake.get_generic_secret_string('CRED'))
clf = ZeroShotGPTClassifier()
clf.fit(None, possible_categories)
return clf.predict([text])[0]

# snow_df is a Snowpark DataFrame that points to our text data
possible_categories = [
'BUSINESS',
'POLITICS',
'TRAVEL'
]
classifications_df = snow_df.with_column('PRED', chatgpt_classify(F.col('TEXT'), F.lit(possible_categories)))
classifications_df.show()

# Output:
# -----------------------------------------------------------------------------------
# |"TEXT" |"CATEGORY" |"PRED" |
# -----------------------------------------------------------------------------------
# |Political Hotels: A Tour Of 2012's Campaign Fun... |TRAVEL |TRAVEL |
# |State Department Protest Of Donald Trump’s Immi... |POLITICS |POLITICS |

As you can see, in this UDF we are using Scikit-LLM to “fit” a ZeroShotGPTClassifier with our provided categories on no training data. Afterwards we can simply use predict() to get our predicted category.

Our OpenAI-Key is retrieved securely from a secret that is stored and governed in Snowflake. There is no need to include credentials in your code.

I tested ChatGPTs classification capabilities on 69 random texts taken from this Kaggle dataset and the results are not that bad:

Predictions are not that bad on news headlines.

All the code is located in this Github repository.

Michael Gorkow | Field CTO Datascience @ Snowflake

--

--

Michael Gorkow

Field CTO Datascience @Snowflake Germany | Passionate about all things related to data science! | Personal Account, Contents are my own thoughts!