So You Want to… Build an ML Model As Quickly As Possible in Snowflake
Addendum/Caveat: Snowpark ML makes model creation even quicker, see my colleague Mats Stellwall’s blog on it. the below is still useful, e.g. if creating pipelines or building models that aren’t covered by Snowpark ML, so read on if that’s you
There are some great resources about building some very cool things in Snowflake, but sometimes they can be intimidating. Maybe you’re not ready to parallelise a grid search over thousands of parameters, or looking for ways to transfer learn on a deep neural net. Maybe you just want to get started. If this is you then you’re in luck!
This post shows you two ways to get a model into Snowflake as quickly as possible. The actual code is minimal, in fact most of your time will be setting up Jupyter if you don’t have it, or logging in to Snowflake. There’s nothing fancy or even “proper” here to distract so no train test splits, no feature generation, no transformations etc. etc.). I’ve even synthesised the data using scikit-learn’s make_classification data, so we’re just taking an arbitrary set of input features (X) and and prediction target (y). Naturally, in the real world that would look something like loan details vs loan default, or customer details vs churn, you can use your imagination to fill in those blanks.
So without further ado, two ways to get started:
- In Jupyter : my preferred option, but assumes you already have access to a notebook (or are comfortable sorting it out)
- In a Snowflake Worksheet: the route less travelled but arguably easier if you don’t have easy access to a notebook environment
Note I assume you have a Snowflake account, if you don’t, go here!
1. In Jupyter (or any IDE really)
1.1 First, Make a JSON With Your Credentials
Go into you terminal, or file browser, or however you like to make files and create a creds.json file with your snowflake details. It should look something like this
{
"account": "your_account_name",
"user": "your_user_name",
"password": "insert_your_pwd_here",
"role": "ACCOUNTADMIN"
}
1.2 Next, Make Sure You Have Anaconda Installed
Easiest way to do this is to download Anaconda from here, install and open it and then click on the Jupyter Notebook tile
1.3 Next, Get Yourself Set Up Right
The base python environment should have most of the packages, but probably won;’t have the Snowflake specific stuff, so follow these instructions:
Open up a command line terminal and type in the following from the same folder that you created the environment.yml file
>> conda create --name pysnowpark_ml -c https://repo.anaconda.com/pkgs/snowflake python=3.9
>> conda activate pysnowpark_ml
>> conda install snowflake-snowpark-python scikit-learn pandas joblib cachetools jupyterlab
Then finally start your Jupyter lab (or notebook, or really wherever you intend to write the python code)
jupyter lab
Which should create a sheet that looks something like:
And then click on the New button like this and select a Python3 Notebook
Which gets you to here
1.4 Next, Copy and Execute These Cells In Order
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import sproc
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
import pandas as pd
import json
import os
import sys
from sklearn.linear_model import LogisticRegression
from cachetools import cached
snowflake_connection_cfg = json.loads(open("creds.json").read())
session = Session.builder.configs(snowflake_connection_cfg).create()
session.sql("CREATE OR REPLACE WAREHOUSE VB_VH WITH WAREHOUSE_SIZE='X-SMALL'").collect()
session.sql("CREATE OR REPLACE DATABASE VB_DB").collect()
session.sql("CREATE OR REPLACE STAGE ML_MODELS").collect()
from sklearn.datasets import make_classification
import pandas as pd
columns = [str(i) for i in range(0,10)]
X,y = make_classification(n_samples=100000, n_features=10, n_classes=2)
df = pd.DataFrame(X, columns=columns)
feature_cols = df.columns
df['Y'] = y
session.write_pandas(df, table_name='ML_DATASET', auto_create_table=True, overwrite=True)
@sproc(session=session, name='get_scores_sf', stage_location='@ML_MODELS',
packages=['snowflake-snowpark-python', 'scikit-learn', 'numpy'],
is_permanent=True,
replace=True)
def train_model(session: Session, train_table: str, feature_cols: list, target_col: str, model_name: str) -> str:
from sklearn.linear_model import LogisticRegression
from joblib import dump
local_training_data = session.table(train_table).to_pandas()
X,y = local_training_data[feature_cols], local_training_data[target_col]
lrmodel = LogisticRegression().fit(X, y)
dump(lrmodel, '/tmp/'+model_name)
session.file.put('/tmp/'+model_name, '@ML_MODELS', auto_compress=False, overwrite=True)
return('ML Model trained and saved')
train_model("ML_DATASET", list(feature_cols), 'Y', 'lr_model.sav')
@cached(cache={})
def load_model(model_path: str) -> object:
from joblib import load
model = load(model_path)
return model
def udf_model(df: pd.DataFrame) -> pd.Series:
model = load_model(sys._xoptions["snowflake_import_directory"]+'lr_model.sav')
scored_data = pd.Series(model.predict(df))
return scored_data
udf_model = session.udf.register(session=session,func=udf_model, name="udf_score_lr_model", stage_location='@ML_MODELS',
input_types=[T.FloatType()]*len(feature_cols),
return_type=T.FloatType(),
replace=True, is_permanent=True,
max_batch_size=1000,
imports=['@ML_MODELS/lr_model.sav'],
packages=['scikit-learn==1.1.1','pandas','joblib','cachetools'])
test_sdf = session.table('ML_DATASET')
test_sdf.with_column('Y_PREDICTION', udf_model(*feature_cols)).to_pandas()
1.5 Admire Your Handiwork
If you’ve follwed all the steps it should result in something like this
In Snowflake Worksheet
This will look pretty much the same, but naturally there will be some tweaks that reflect the different IDE
First, Log in to Snowflake
Next, Create a Python Worksheet
Next, Choose a Virtual Warehouse and Database
Next Make Sure Your Dependency Libraries are Imported
In the packages drop down search for and tick scikit-learn, cachetools and joblib
Next Copy This Code into the Sheet
import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import sproc
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
import pandas as pd
import os
import sys
from sklearn.linear_model import LogisticRegression
from cachetools import cached
from joblib import load
from sklearn.datasets import make_classification
def main(session: snowpark.Session):
# First Create Some Synthetic Data
columns = [str(i) for i in range(0,10)]
X,y = make_classification(n_samples=100000, n_features=10, n_classes=2)
df = pd.DataFrame(X, columns=columns)
feature_cols = df.columns
df['Y'] = y
# Next, Save It To Snowflake
session.write_pandas(df, table_name='ML_DATASET',
auto_create_table=True, overwrite=True)
# Next, Create a Sproc (here we register it with a decorator @sproc)
session.sql("CREATE OR REPLACE STAGE ML_MODELS").collect()
@sproc(session=session, name='get_scores_sf', stage_location='@ML_MODELS',
packages=['snowflake-snowpark-python', 'scikit-learn', 'numpy'],
is_permanent=True,
replace=True)
def train_model(session: snowpark.Session, train_table: str, feature_cols: list,
target_col: str, model_name: str) -> str:
from sklearn.linear_model import LogisticRegression
from joblib import dump
local_training_data = session.table(train_table).to_pandas()
X,y = local_training_data[feature_cols], local_training_data[target_col]
lrmodel = LogisticRegression().fit(X, y)
dump(lrmodel, '/tmp/'+model_name)
session.file.put('/tmp/'+model_name, '@ML_MODELS', auto_compress=False, overwrite=True)
# Next, Run the Sproc (and save the model as part of the process)
train_model("ML_DATASET", list(feature_cols),
"Y", "lr_model.sav")
# Next, Create a UDF using the Model we just trained
@cached(cache={})
def load_model(model_path: str) -> object:
from joblib import load
model = load(model_path)
return model
def udf_model(df: pd.DataFrame) -> pd.Series:
model = load_model(sys._xoptions["snowflake_import_directory"]+
'lr_model.sav')
scored_data = pd.Series(model.predict(df))
return scored_data
udf_model = session.udf.register(session=session,func=udf_model,
name="udf_score_lr_model",
stage_location='@ML_MODELS',
input_types=[T.FloatType()]*len(feature_cols),
return_type=T.FloatType(),
replace=True, is_permanent=True,
max_batch_size=1000,
imports=['@ML_MODELS/lr_model.sav'],
packages=['scikit-learn==1.1.1','pandas',
'joblib','cachetools'])
# Finally, Execute the UDF and get your results
test_sdf = session.table('ML_DATASET')
results = test_sdf.with_column('Y_PREDICTION', udf_model(*feature_cols))
results.write.mode("overwrite").save_as_table('ML_DATASET_PREDICTION')
return results
Next Run (click that blue box in the top right)
Conclusion
So there we have it, two different ways to get data a machine learning model into snowflake with the minimum of fuss. Of course, the robust production grade equivalent would look quite different, with github, ml ops, more comments and docstrings and so forth in place, but this should give you a sense of just how easy it is to deploy a model to Snowflake. More importantly, once it’s deployed you have near limitless scalability at your fingertips because using a UDF to host a model makes for effortless batch processing.