Snowflake Bring Apps, IA and Generative IA & LLMs to your Data — July 2023

The Daita Cloud

Snowflake Data Cloud has revolutionized the way organizations process and analyze data. But did you know that Snowflake does more than just analyze data? Thanks to the integration of generative artificial intelligence (AI), Snowflake opens up a whole new field of creative possibilities. In this blog post, we’ll delve into the fascinating world of generative AI in Snowflake and explore how it can unlock innovation and improve decision-making.

Understanding Generative AI

Generative AI refers to the use of machine learning models to create new content, such as images, text, music and even entire virtual worlds. Unlike traditional AI, which focuses on analyzing existing data, generative AI goes further and generates new results based on models and knowledge drawn from vast datasets. This technology has immense potential for a variety of applications, including content creation, personalized recommendations and simulation.

Snowflake’s Acquisitions

Streamlit

Streamlit’s open-source framework enables developers and data scientists to create and share interactive data applications very quickly, without the need to be an expert in front-end development.

Neeva

Snowflake acquires Neeva to accelerate search in the Data Cloud through generative AI

Applica

About 80% of the world’s data is unstructured. Unstructured data within documents, emails, web pages, images, comments on blogs and social media sites. So, Snowflake acquires Applica and launches Document IA to simply extract documents.

Strategic Partnerships

Nvidia

Snowflake has announced a new container service and partnership with Nvidia to facilitate the creation of Generative AI applications using all data and running them on GPUs using Nvidia NeMo Framework.

Microsoft

Snowflake extends its partnership with Microsoft to bring large-scale enhanced and simplify the usage of Generative AI models and machine learning capabilities to the Data Cloud.

IA & Snowpark: Bring ML to your data

As a foundation, there is central, governed access to data (from any structure) in your organization with the flexibility to easily access data from other external sources using Snowflake Marketplace.

To be able to reliably process features, models, and applications, Snowflake has a flexible and elastic compute engine that multiple teams can use to process their data without having to move it to ungoverned external environments, allowing to bring AI/ML to data.

Snowpark DataFrame API (Global Availability)

What’s it?

Enables users to manipulate and process the data stored in Snowflake by leveraging the unique Snowflake Engine without the need to move the data outside.

Example :

from snowflake.snowpark import functions as fn
from snowflake.snowpark import version
from snowflake.snowpark.functions import col, sproc, udf
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import (BooleanType, FloatType, IntegerType,
StringType, Variant)

print(f"Snowflake snowpark version is : {version.VERSION}")

import json

session = Session.builder.configs(json.load(open("../connection.json"))).create()
print(session.sql('select current_role(), current_warehouse(), current_database(), current_schema()').collect())

train_dataset_name="TRAIN_DATASET"
train_dataset = session.table(train_dataset_name)
train_dataset.show()

train_dataset.where(train_dataset["SENTIMENT"].isNotNull() == False).show()
train_dataset_flag = train_dataset.withColumn("SENTIMENT_FLAG", fn.when(train_dataset.SENTIMENT == "positive", 1)
.otherwise(2))
train_dataset.show()

Why does it matter (added value)?

  • Simply interact and process data stored in Snowflake
  • No need to move and transfert data outside of Snowflake
  • No data duplication
  • Leverage the unique Snowflake Engine
  • Unlimited scalability
  • Increase performance of data preparation
  • Save money

Snowpark-optimized warehouses (Public Preview)

What’s it?

It’s a Snowflake Virtual Warehouse which provide 16x memory and 10X cache per node compared to a standard Snowflake virtual warehouse.

Why does it matter (added value)?

  • Training models that have large memory requirements
  • Provide speedup when cached artifacts (Python packages, intermediate results, JARs, etc) are reused on subsequent runs
  • Use more resources in terms of memory and cache in the same product

Snowpark ML Modeling API (Public Preview)

What’s it?

Snowpark ML is a new library snowflake-ml for faster and more intuitive end-to-end ML development in Snowflake. The Snowpark ML Modeling API scales out feature engineering and simplifies ML training execution.

Example (Python) of preprocessing with snowflake.ml.modeling.preprocessing :

import snowflake.ml.modeling.preprocessing as snowmlpp

def fit_scaler(session, df):
mm_target_columns = []
mm_target_cols_out = []
snowml_mms = snowmlpp.MinMaxScaler(input_cols=mm_target_columns, output_cols=mm_target_cols_out)
snowml_mms.fit(df)
return snowml_mms

def fit_ohe(session, df):
target_cols = []
output_cols = []
snowml_ohe = snowmlpp.OneHotEncoder(input_cols=target_cols, output_cols=output_cols)
snowml_ohe.fit(df)
return snowml_ohe

snowml_mms = fit_scaler(session, final_transactions)
normed_df = snowml_mms.transform(final_transactions)
snowml_ohe = fit_ohe(session, normed_df)
ohe_df = snowml_ohe.transform(normed_df)

Example (Python) of training with snowflake.ml.modeling.xgboost :

from snowflake.ml.modeling.xgboost import XGBClassifier

def fit_models(session, input_df, feature_column_names, label_column_name, output_cols):
train_df, test_df, extra_df = input_df.random_split(weights=[0.5, 0.2, 0.3], seed=42)
models = []
for n_estimators in [50,100,150]:
for learning_rate in [0.1, 0.2]:
model = XGBClassifier(input_cols=feature_column_names, label_cols=label_column_name, output_cols=output_cols, n_estimators=n_estimators, learning_rate=learning_rate)
model.fit(train_df)
models.append(model)
return models, train_df, test_df, extra_df

models, train_df, test_df, extra_df = fit_models(session, input_df, feature_column_names, label_column_name, output_cols)

Why does it matter (added value)?

  • Use popular ML tools with Snowpark
  • Improve performance and scalability with distributed, multi-node execution for common feature engineering functions
  • Run training for the most common scikit-learn and xgboost models without manual creation of stored procedures or UDFs

Snowpark Model Registry : MLOPs (Private Preview)

What’s it?

Snowpark ML Operations for model deployment. Unified, governed repository for an organization’s ML models to help streamline and scale MLOps.

Example (Python) of using snowflake.ml.registry (MLOps) :

from snowflake.ml.registry import model_registry


def create_registry(session):
from snowflake.ml.registry import model_registry
create = model_registry.create_model_registry(session=session)
registry = model_registry.ModelRegistry(session=session)
return registry

def log_snowml_xgb_model(session, registry, model, train_df, model_version, f1, accuracy, cm):

X = train_df.select(feature_column_names).limit(100).to_pandas()
model_id = registry.log_model(model=model, model_name="FRAUD_DETECTION_XGB", model_version=model_version, sample_input_data=X[:10])

registry_model = model_registry.ModelReference(registry=registry, model_name="FRAUD_DETECTION_XGB", model_version=model_version)
registry_model.set_metric(metric_name='f1', metric_value=f1)
registry_model.set_metric(metric_name='accuracy', metric_value=accuracy)
registry_model.set_metric(metric_name='cm', metric_value=cm)
# registry_model.set_metric(metric_name='params', metric_value=model_tags)
registry_model.set_model_description(description="This is the demo XGBClassifier trained using SnowflakeML to predict whether a transaction is fraudulent or not")
return registry_model

registry = create_registry(session)
modelv = 0
for model in models:
MODEL_VERSION = "v{}".format(modelv)
accuracy, f1, cm = calculate_test_metrics(session, model, test_df)
registry_model = log_snowml_xgb_model(session, registry, model, train_df, MODEL_VERSION, f1, accuracy, cm)
modelv += 1

registry.list_models().to_pandas()

Why does it matter (added value)?

  • Publish, discover and share models with a central, governed view of model artifacts and metadata
  • Democratize access by exposing models and results to coders and SQL users
  • Simplify and scale MLOps approach

Snowpark Container Services (Private Preview)

What’s it?

Additional Snowpark runtime that allows developers to effortlessly register, deploy and run containerized data applications using a secure Snowflake-managed infrastructure with configurable hardware options such as compute acceleration with NVIDIA GPUs.

Here some examples of launched Partners :

Why does it matter (added value)?

  • Build in and deploy a container image of any programming language, code, package.
  • Train and execute inference for large scale models using GPUs with any language including Python and R.
  • Run open source LLMs, and third-party LLMs tuned to your data.
  • Deploy effortlessly with an integrated image registry, elastic compute infrastructure and Kubernetes-based managed cluster.
  • Bring customized applications to the data.

ML-Powered Functions (Public Preview)

What’s it?

SQL functions that abstract complexity of ML frameworks with algorithms for Forecasting, Anomaly Detection, and Contribution Explorer.

Forecasting

Build more reliable time series forecasts with automated handling of seasonality, missing values and more. The forecasting algorithm is powered by a gradient boosting machine (GBM). Like an ARIMA model, it uses a differentiation transformation to model data with a non-stationary trend, and uses auto-regressive lags of historical target data as model features.

Example (SQL) :

// This shows training & prediction for revenues in daily_revenue_v
create snowflake.ml.forecast revenue_projector(
input_data => SYSTEM$REFERENCE('VIEW', 'daily_revenue_v'),
timestamp_colname => 'ts',
target_colname => 'revenue'
);
// The model is now ready for prediction.
call revenue_projector!forecast(
forecasting_periods => 30, // how far out to project!
config_object => {'prediction_interval': 0.9} // optional range of values with this probability
);

Anomaly Detection

Identify outliers and trigger alerts or help find unlikely-to-happen-again situations that should be excluded from analysis. The anomaly detection algorithm is powered by a gradient boosting machine (GBM).

Example (SQL) :

- Set up a task to train your model on a weekly basis.
create or replace task train_anomaly_detection_task
warehouse = LARGE_WAREHOUSE
SCHEDULE = 'USING CRON 0 0 * * 0 America/Los_Angeles' - Run at midnight every Sunday.
as EXECUTE IMMEDIATE
$$
begin
create or replace snowflake.ml.ANOMALY_DETECTION my_model(input_data => SYSTEM$REFERENCE('VIEW', 'view_of_your_input_data'),
timestamp_colname => 'ts',
target_colname => 'y',
label_colname => '');
end;
$$;
- Start your task's execution.
alter task train_anomaly_detection_task resume;
- Create a table to store your anomaly detection results.
create or replace table anomaly_detection_results (
ts timestamp_ntz,
y float,
forecast float,
lb float,
ub float,
is_anomaly boolean,
percentile float,
distance float
);
- Call your model to detect anomalies on a daily basis.
create or replace task detect_anomalies_task
warehouse = LARGE_WAREHOUSE
SCHEDULE = 'USING CRON 0 0 * * * America/Los_Angeles' - Run at midnight, daily.
as EXECUTE IMMEDIATE
$$
begin
call my_model!detect_anomalies(
input_data => SYSTEM$REFERENCE('VIEW', 'view_of_your_data_to_monitor'),
timestamp_colname =>'ts',
target_colname => 'y',
config_object => {'prediction_interval': 0.99});
insert into anomaly_detection_results (ts, y, forecast, lb, ub, is_anomaly, percentile, distance)
select * from table(result_scan(last_query_id()));
end;
$$;
- Start your task's execution.
alter task detect_anomalies_task resume;
- Setup alert based on the results from anomaly detection
CREATE OR REPLACE ALERT anomaly_detection_alert
WAREHOUSE = LARGE_WAREHOUSE
SCHEDULE = 'USING CRON 0 1 * * * America/Los_Angeles' - Run at 1 am, daily.
IF (EXISTS (select * from anomaly_detection_results where is_anomaly=True and ts > dateadd('day',-1,current_timestamp()))
THEN
call SYSTEM$SEND_EMAIL(
'SNOWML_ANOMALY_DETECTION_ALERTS',
'last.first@youremail.com',
'Anomaly Detected in data stream',
concat(
'Anomaly Detected in data stream',
'Value outside of confidence interval detected'
)
);
- Start your alert's execution.
alter alert anomaly_detection_alert resume;

Contribution Explorer

Quickly identify drivers contributing to the change of a given metric across user defined time intervals.

Example (SQL) :

with input as (
select
// Select dimensions to "mine"
{ 'country': input_table.dim_country,
'vertical': input_table.dim_vertical
} as categorical_dimensions,
{
} as continuous_dimensions, // less common but available
// This is the metric for comparison
input_table.kpi,
// Label control & test periods for comparison
iff(ds between '2020–08–01' and '2020–08–20', TRUE, FALSE) as label
from input_table
where (ds between '2020–05–01' and '2020–05–20')
or (ds between '2020–08–01' and '2020–08–20')
)
// Now use the above data to compare contributions of dimension segments
select res.* from input, table(
snowflake.ml.top_insights(
input.categorical_dimensions,
input.continuous_dimensions,
CAST(input.kpi as float),
input.label
)
over (partition by 0)
) res order by res.relative_change desc;

Why does it matter (added value)?

  • Abstract complexity of ML frameworks and algorithms for forecasting, anomaly detection & more with SQL functions
  • Share insights into analytics/BI tools built into Snowflake’s consistent data governance across model inputs and outputs
  • Enable and open access to ML Model for business analysts without having ML skills
  • Scale from one to millions of ML-based insights with Snowflake’s elasticity and near-zero operations engine

Application: Bring Apps to your data

Following the same vision and mindset, Snowflake brings Apps to the data.

Snowflake Native Apps Framework (Public Preview on AWS)

What’s it?

The aim is to create, distribute, deploy, operate and monetize applications natively in the data cloud. It could be logic, intelligence, code, etc that we package in an Object called an Application, which can be created simply using SQL.

So, customers will be able to access leading LLMs directly via the Snowflake marketplace, and install them to run entirely within their Snowflake accounts.

Why does it matter (added value)?

  • Secure data and IP
  • Earn money by monetizing App on the Snowflake Marketplace
  • Get instant access to the applications that put your data to work
  • Reduce security and provisioning hurdles by installing the application in your Snowflake without moving the data

Streamlit in Snowflake (Public Preview soon)

What’s it?

Benoît Dageville, co-founder and president of products at Snowflake, said that we have both the same vision — Streamlit and Snowflake — which is all about democratizing access to data. I would describe it very simply as making it super easy to interact with data.

Example (Python) :

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session
# Write directly to the app
st.title("Example Streamlit App :balloon:")
st.write(
"""Replace this example with your own code!
**And if you're new to Streamlit,** check
out our easy-to-follow guides at
[docs.streamlit.io](https://docs.streamlit.io).
"""
)
# Get the current credentials
session = get_active_session()
# Use an interactive slider to get user input
hifives_val = st.slider(
"Number of high-fives in Q3",
min_value=0,
max_value=90,
value=60,
help="Use this to enter the number of high-fives you gave in Q3",
)
# Create an example dataframe
# Note: this is just some dummy data, but you can easily connect to your Snowflake data
# It is also possible to query data using raw SQL using session.sql() e.g. session.sql("select * from table")
created_dataframe = session.create_dataframe(
[[50, 25, "Q1"], [20, 35, "Q2"], [hifives_val, 30, "Q3"]],
schema=["HIGH_FIVES", "FIST_BUMPS", "QUARTER"],
)
# Execute the query and convert it into a Pandas dataframe
queried_data = created_dataframe.to_pandas()
# Create a simple bar chart
# See docs.streamlit.io for more types of charts
st.subheader("Number of high-fives")
st.bar_chart(data=queried_data, x="QUARTER", y="HIGH_FIVES")
st.subheader("Underlying data")
st.dataframe(queried_data, use_container_width=True)

Why does it matter (added value)?

  • Build App near the data
  • Fast time to insights
  • Fast Dev to Prod
  • Simplify App development : make and preview changes with side-by-side editor

Generative IA: Bring LLMs to your data

Generative Artificial Intelligence or generative AI is a type of Artificial Intelligence system capable of generating new content that resembles human-generated data such as text, images or other media in response to prompts. Unlike traditional AI models that are trained to recognize patterns in existing data (such as image recognition or language translation), generative AI models can generate new data that has not been explicitly seen during its training phase.

LLMs means a transform-based neural network. These basic models use Generative AI (and more specifically Deep Learning) for natural language processing (NLP) and natural language generation (NLG).

Snowflake provides severals ways to bring LLMs to you data :

Native Embedded LLMs: Document IA (Applica)

What’s it?

Document IA is a new interface to easily analyze and extract content from Unstructured Data (documents, emails, social media, …) via a built-in Large Language Model.

Users can just ask questions in natural language and automatically get answers. It is possible to train the LLM behind as well with your own Data.

Why does it matter (added value)?

Easier to extract analytical value from documents without requiring machine learning expertise.

Native Embedded LLMs (In Dev): Conversational Text-to-Code

What’s it?

Conversational Text-to-Code aims to make SQL query creation easier and more user friendly. It will essentially act as coding assistant in Snowflake Worksheets.

Why does it matter (added value)?

  • Open access and process the data by business without the need to know SQL
  • Accelerate extracting insights from the data

LLM-powered Streamlit apps

Here’s is a quick example of SnowChat with Streamlit :

Snowpark Container Services

LLMs running inside Snowpark Containerized Service like from Hugging Face, etc.

LLM-Powered Marketplace Search

What’s it?

Use natural language to discover data and applications on the Snowflake marketplace based on business questions.

Why does it matter (added value)?

  • Easily search content (datasets, apps, …) on the Snowflake Marketplace using natural language.
  • Quickly find conten on the Marketplace

External LLMs

What’s it?

Connect to externally hosted LLMs via Snowpark External Access

Why does it matter (added value)?

Ability to use External LLMs using API such as OpenAI, Cohere, Anthropic, etc.

Conclusion

Snowflake brings Apps IA and LLMs near the data. The combination of generative AI and Snowflake’s data capabilities enables organizations to make data-driven decisions, uncover hidden insights and unlock new business opportunities. By continuing to harness the power of generative AI within Snowflake, the possibilities for intelligent and creative applications are limitless.

Resources

--

--