From Data to LLMs: Snowflake Made it Easy

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readApr 11, 2024

AI holds tremendous promise for solving complex problems and enhancing various aspects of our lives, but it has been challenging due to three reasons: 1. Complexity (One needs to know Algorithms, Statistics, Computer science principles, etc), 2. Data (Acquisition, Training, testing, security), 3. Computational Resources (GPUs, distributed computing systems, etc). As a result, it is difficult to implement and use. Even, though the company has all of the above, it is not easy to use as a result limited to certain persons “experts”, within the company.

Snowflake Data Cloud Platform, is taking this to the next level making it easily available to everyone within the data governance boundary. Following the footsteps of Data Democratization, Snowflake now embarks on a journey to AI Democratization. This helps companies to solve complex problems faster, drive innovation, and create value for various stakeholders. So yes, even if anyone in the organization does not know AI/ML/LLM or computer science, still they can easily use AI in Snowflake with the feature called Snowflake Cortex. Snowflake Cortex is a fully managed intelligent service that empowers users to securely tap into the power of generative AI regardless of their technical expertise. All you need is data in the Snowflake Platform. There are two categories of Cortex, ML-powered and LLM Functions. Earlier I wrote about ML-powered function for Forecasting, you can find the article here.

There are three common use cases that you can implement right away to provide value to the company using Snowflake LLM Functions with a choice of model (mistral-large, mixtral-8x7b, llama2-chat-70b, mistral-7b, and gemma-7b) :

Generative SQL or Conversational Data Analytics: Users without SQL skills can do analytics on data. With this, end users, no more need to wait for IT to provide the reports that they are looking for, which sometimes takes months (trust me on this one on the time it takes for IT :-) )
Text Analytics: Streamline analytics of textual data such as reviews, content generation like summarization of documents, sentiment analysis, translation, etc. This feature is used to build a data engineering pipeline to get relevant information from LLMs.
Domain-specific Chatbot: Provide ChatGPT kind of interface to the employee and customers for various use cases such as knowledge of products to support customers, HR Benefits to quickly address questions employees have, and Sales Products Knowledge to empower the Sales Team to effectively work with their customers.

Snowflake Cortex — All within Customer Snowflake Account

These use cases add significant productivity and value creation for the company. Let’s see how you can implement these use cases.

First, Snowflake Cortex is protected by the role so you can control who can use it to control cost. For that, you need to grant access as follows before you start using.

CREATE ROLE cortex_role;
-- grant the database role to to the cortex user role (best practices)
GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE cortex_role;
-- grant cortex user role to user
GRANT ROLE CORTEX_ROLE TO USER UPATEL; 
-- add role to role heirarchey (best practices)
GRANT ROLE cortex_role TO role sysadmin;

Generative SQL :

How to generate and execute SQL from Snowflake easily

There are multiple ways this can be done:
1. Snowflake CoPilot, as a part of Snowsight you can take to data using Snowflake CoPilot, here you can get more details at https://www.snowflake.com/blog/copilot-ai-powered-sql-assistant/ Today it is only available in AWS East and West regions.
2. Text-to-SQL REST API: Using CoPilot you can only use in Snowsight, what if you want to integrate it into your application, or reporting tools such as Sigma, Tableau, or PowerBI? Snowflake is working on Text to SQL REST API that can be easily integrated and will use the world's most powerful LLM model.
3. Prompt Engineering using Snowflake Cortex, let’s explore how you can do this. All you need to do is call the Cortex complete function, no API, no complex coding. This part snowflake made it very easy, see below:

-- example 1. - query your database, c
select REGEXP_SUBSTR(
SNOWFLAKE.CORTEX.COMPLETE('mistral-7b','### Snowflake SQL tables, with their 
properties:### database WORKSHOP01_DB in 
schema WORKSHOP_SCHEMA,   KAFKA_DB.KAFKA_SCHEMA.CAMPAIGN_SPEND 
table CAMPAIGN VARCHAR,CHANNEL VARCHAR,DATE DATE,TOTAL_CLICKS NUMBER,
TOTAL_COST NUMBER,ADS_SERVED NUMBER 
### A query to the data so we can see total cost per year/month per channel '),
'```sql([^```]+)',1,1,'e');

--example 2
select REGEXP_SUBSTR(
SNOWFLAKE.CORTEX.COMPLETE('llama2-70b-chat','
### Snowflake database SQL tables, with the properties ### \
schema SNOWFLAKE.ACCOUNT_USAGE,  QUERY_HISTORY table \
QUERY_TEXT VARCHAR,WAREHOUSE_SIZE VARCHAR, START_TIME  TIMESTAMP,TOTAL_ELAPSED_TIME   NUMBER  \
### A sql to show top ten query  based  on total elapsed time for last day'),
'```sql([^```]+)',1,1,'e');

Once you get SQL, you may want to provide some guardrails to execute it. The first guardrail is to validate generated SQL. You can do that easily with the EXPLAIN command. Explain command shows the operations to execute the query if the query is valid. If any syntax error it will throw out an error.

-- above query generated sql like below that can be use in Explain like this
explain  using json
SELECT QUERY_TEXT, WAREHOUSE_SIZE, START_TIME, TOTAL_ELAPSED_TIME, 
       RANK() OVER (ORDER BY TOTAL_ELAPSED_TIME DESC) AS Rank
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATE_TRUNC('day', CURRENT_TIMESTAMP) - INTERVAL '1 day'
ORDER BY TOTAL_ELAPSED_TIME DESC LIMIT 100;

/* for this sql it give following output

{"GlobalStats":{"partitionsTotal":0,"partitionsAssigned":0,"bytesAssigned":0},
"Operations":[[{"id":0,"operation":"Result","expressions":
["QUERY_HISTORY.QUERY_TEXT","QUERY_HISTORY.WAREHOUSE_SIZE",
"QUERY_HISTORY.START_TIME","QUERY_HISTORY.TOTAL_ELAPSED_TIME",
"RANK() OVER (ORDER BY QUERY_HISTORY.TOTAL_ELAPSED_TIME DESC NULLS FIRST)"]},
{"id":1,"operation":"Sort","expressions":
["QUERY_HISTORY.TOTAL_ELAPSED_TIME DESC NULLS FIRST"],"parentOperators":[0]},
{"id":2,"operation":"WindowFunction","expressions":
["RANK() OVER (ORDER BY QUERY_HISTORY.TOTAL_ELAPSED_TIME DESC NULLS FIRST)"],
"parentOperators":[1]},{"id":3,"operation":"SecureView","objects":
["SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY"],"parentOperators":[2]}]]}

*/

If you get output then SQL is valid to execute in snowflake, if not, you can try using a different model (there are five choices as of today) , and repeat the process with different model.

So the next guardrail is two-fold , one is to determine if we need to allow executing the query or reject it. For that, inspect one of the parameters from the above output for example: partitionsAssigned. If that number is in 100s of thousands you may choose to reject the request as it scans many micropartitions and might take longer! There are many other things you can look at in the explained plan to establish the guardrail to avoid long-running queries. The second guardrail is how much time you want to allocate to run the query. This can be done by changing the value of the parameter “STATEMENT_TIMOUT_IN_SECOND” at the session level.

With this, you can deploy small applications using Streamlit and give access to analysts or nontechnical users.

Text Analytics during Data Pipeline:

Oftentimes, when you ingest the data, you may want to get related data based on ingestion. For example, if you have a zip code in your data pipeline as one of the columns in the table, you can find the average annual income for that city and update the value as data is ingested. So your marketing can use this derived value to target advertisements based on income in that city.

Here is a sample code on how to achieve this:

CREATE OR REPLACE TABLE CUSTOMERS (
 ACCOUNT_NUMBER NUMBER(38,0),
 FIRST_NAME VARCHAR(16777216),
 LAST_NAME VARCHAR(16777216),
 EMAIL VARCHAR(16777216),
 PHONE VARCHAR(16777216),
 CITY VARCHAR(16777216),
 STATE VARCHAR(16777216),
 ZIP_CODE NUMBER(38,0),
    avg_income number
);

INSERT INTO CUSTOMERS (ACCOUNT_NUMBER, CITY, STATE, EMAIL, FIRST_NAME, LAST_NAME, PHONE, ZIP_CODE) 
VALUES (1589420, 'SAN FRANCISCO', 'CA', 'john.doe@yahoo.com', 'JOHN', 'DOE', 1234567890,  94102),
       (8028387, 'New York', 'NY', 'bart.simpson@email.com', 'BART', 'SIMPSON', 1012023030, NULL),
       (2834123, 'SAN MATEO', 'CA', 'jane.doe@email.com', 'JANE', 'DOE', 3641252911, 94402),
       (9821802, 'Palo Alto', 'CA', 'susan.smith@email.com', 'SUSAN', 'SMITH', 1234567891,  null)
       ;

-- leverage llm to find the average income
SELECT to_number(regexp_substr(SNOWFLAKE.CORTEX.COMPLETE('mistral-large', 
' need estimate data for data analysis, 
what is average income in '||concat_ws(',',city,state)),'\\$[0-9]+[,]*[0-9]*[,]*[0-9]*', 2),'$999,999,999') avgincome from customers;

-- update those values in table, better yet keep seprate table for city,state
-- so you do not repeate for same city
update customers set avg_income=to_number(regexp_substr(SNOWFLAKE.CORTEX.COMPLETE('mistral-large', 
' need estimate data for data analysis, 
what is average income in '||concat_ws(',',city,state)),'\\$[0-9]+[,]*[0-9]*[,]*[0-9]*', 2),'$999,999,999');

select * from customers;
-- voila now you have income data based on census, 
-- other alternative is to use Snowflake Marketplace

Domain-Specific Chatbot and RAG :

Snowflake allows you to store vector data and vector search out of the box, so you do not need to have a separate vector database. This will break silo once again. These features are in private preview. If you want to build a very nice chatbot for your Sales or Support organization then you can do it easily by using vector data type, and vector search. There are multiple medium posts and quickstart guide written for this to get you started on this.

Conclusion:

Snowflake allows you to use LLM functionality easily in the most secure and cost-effective way, no upfront investment, no computer science degree, is needed and time to market is faster to deliver those functionalities to business. You can get up and running with the above use cases in a day if not hours. Of course, if you want to bring your own LLM model and train those models you can do it in Snowflake using Snowpark Container Services, but this would be for more advanced use cases as you need to justify the cost of training the model. With Snowflake Cortex you will able to deploy most use cases. You can do it in SQL or Python.

Here is the most important FAQ one needs to address to use LLM https://www.snowflake.com/en/legal/snowflake-ai-trust-and-safety/

Disclaimer: The opinions expressed in this post are my own and not necessarily those of my employer (Snowflake).

From Data to LLMs: Snowflake Made it Easy

Generative SQL :

Text Analytics during Data Pipeline:

Domain-Specific Chatbot and RAG :

Written by Umesh Patel