So You Want to… Build an ML Model As Quickly As Possible in Snowflake

Photo by Pablo Arroyo on Unsplash

Addendum/Caveat: Snowpark ML makes model creation even quicker, see my colleague Mats Stellwall’s blog on it. the below is still useful, e.g. if creating pipelines or building models that aren’t covered by Snowpark ML, so read on if that’s you

There are some great resources about building some very cool things in Snowflake, but sometimes they can be intimidating. Maybe you’re not ready to parallelise a grid search over thousands of parameters, or looking for ways to transfer learn on a deep neural net. Maybe you just want to get started. If this is you then you’re in luck!

This post shows you two ways to get a model into Snowflake as quickly as possible. The actual code is minimal, in fact most of your time will be setting up Jupyter if you don’t have it, or logging in to Snowflake. There’s nothing fancy or even “proper” here to distract so no train test splits, no feature generation, no transformations etc. etc.). I’ve even synthesised the data using scikit-learn’s make_classification data, so we’re just taking an arbitrary set of input features (X) and and prediction target (y). Naturally, in the real world that would look something like loan details vs loan default, or customer details vs churn, you can use your imagination to fill in those blanks.

So without further ado, two ways to get started:

  1. In Jupyter : my preferred option, but assumes you already have access to a notebook (or are comfortable sorting it out)
  2. In a Snowflake Worksheet: the route less travelled but arguably easier if you don’t have easy access to a notebook environment

Note I assume you have a Snowflake account, if you don’t, go here!

1. In Jupyter (or any IDE really)

1.1 First, Make a JSON With Your Credentials

Go into you terminal, or file browser, or however you like to make files and create a creds.json file with your snowflake details. It should look something like this

{
"account": "your_account_name",
"user": "your_user_name",
"password": "insert_your_pwd_here",
"role": "ACCOUNTADMIN"
}

1.2 Next, Make Sure You Have Anaconda Installed

Easiest way to do this is to download Anaconda from here, install and open it and then click on the Jupyter Notebook tile

1.3 Next, Get Yourself Set Up Right

The base python environment should have most of the packages, but probably won;’t have the Snowflake specific stuff, so follow these instructions:

Open up a command line terminal and type in the following from the same folder that you created the environment.yml file

>> conda create --name pysnowpark_ml -c https://repo.anaconda.com/pkgs/snowflake python=3.9
>> conda activate pysnowpark_ml
>> conda install snowflake-snowpark-python scikit-learn pandas joblib cachetools jupyterlab

Then finally start your Jupyter lab (or notebook, or really wherever you intend to write the python code)

jupyter lab

Which should create a sheet that looks something like:

And then click on the New button like this and select a Python3 Notebook

Which gets you to here

1.4 Next, Copy and Execute These Cells In Order

from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import sproc
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
import pandas as pd
import json
import os
import sys
from sklearn.linear_model import LogisticRegression
from cachetools import cached
snowflake_connection_cfg = json.loads(open("creds.json").read())
session = Session.builder.configs(snowflake_connection_cfg).create()
session.sql("CREATE OR REPLACE WAREHOUSE VB_VH WITH WAREHOUSE_SIZE='X-SMALL'").collect()
session.sql("CREATE OR REPLACE DATABASE VB_DB").collect()
session.sql("CREATE OR REPLACE STAGE ML_MODELS").collect()
from sklearn.datasets import make_classification
import pandas as pd
columns = [str(i) for i in range(0,10)]
X,y = make_classification(n_samples=100000, n_features=10, n_classes=2)
df = pd.DataFrame(X, columns=columns)
feature_cols = df.columns
df['Y'] = y
session.write_pandas(df, table_name='ML_DATASET', auto_create_table=True, overwrite=True)
@sproc(session=session, name='get_scores_sf', stage_location='@ML_MODELS',  
packages=['snowflake-snowpark-python', 'scikit-learn', 'numpy'],
is_permanent=True,
replace=True)
def train_model(session: Session, train_table: str, feature_cols: list, target_col: str, model_name: str) -> str:
from sklearn.linear_model import LogisticRegression
from joblib import dump
local_training_data = session.table(train_table).to_pandas()
X,y = local_training_data[feature_cols], local_training_data[target_col]
lrmodel = LogisticRegression().fit(X, y)
dump(lrmodel, '/tmp/'+model_name)
session.file.put('/tmp/'+model_name, '@ML_MODELS', auto_compress=False, overwrite=True)
return('ML Model trained and saved')

train_model("ML_DATASET", list(feature_cols), 'Y', 'lr_model.sav')
@cached(cache={})
def load_model(model_path: str) -> object:
from joblib import load
model = load(model_path)
return model

def udf_model(df: pd.DataFrame) -> pd.Series:
model = load_model(sys._xoptions["snowflake_import_directory"]+'lr_model.sav')
scored_data = pd.Series(model.predict(df))
return scored_data

udf_model = session.udf.register(session=session,func=udf_model, name="udf_score_lr_model", stage_location='@ML_MODELS',
input_types=[T.FloatType()]*len(feature_cols),
return_type=T.FloatType(),
replace=True, is_permanent=True,
max_batch_size=1000,
imports=['@ML_MODELS/lr_model.sav'],
packages=['scikit-learn==1.1.1','pandas','joblib','cachetools'])
test_sdf = session.table('ML_DATASET')
test_sdf.with_column('Y_PREDICTION', udf_model(*feature_cols)).to_pandas()

1.5 Admire Your Handiwork

If you’ve follwed all the steps it should result in something like this

Note I added in the sub headings using Markdown, they aren’t necessary, purely aesthetic

In Snowflake Worksheet

This will look pretty much the same, but naturally there will be some tweaks that reflect the different IDE

First, Log in to Snowflake

Next, Create a Python Worksheet

that blue

Next, Choose a Virtual Warehouse and Database

Next Make Sure Your Dependency Libraries are Imported

In the packages drop down search for and tick scikit-learn, cachetools and joblib

Note some come as standard, like Snowpark, Pandas and Numpy

Next Copy This Code into the Sheet

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import sproc
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
import pandas as pd
import os
import sys
from sklearn.linear_model import LogisticRegression
from cachetools import cached
from joblib import load
from sklearn.datasets import make_classification
def main(session: snowpark.Session):

# First Create Some Synthetic Data
columns = [str(i) for i in range(0,10)]
X,y = make_classification(n_samples=100000, n_features=10, n_classes=2)
df = pd.DataFrame(X, columns=columns)
feature_cols = df.columns
df['Y'] = y

# Next, Save It To Snowflake
session.write_pandas(df, table_name='ML_DATASET',
auto_create_table=True, overwrite=True)

# Next, Create a Sproc (here we register it with a decorator @sproc)
session.sql("CREATE OR REPLACE STAGE ML_MODELS").collect()

@sproc(session=session, name='get_scores_sf', stage_location='@ML_MODELS',
packages=['snowflake-snowpark-python', 'scikit-learn', 'numpy'],
is_permanent=True,
replace=True)
def train_model(session: snowpark.Session, train_table: str, feature_cols: list,
target_col: str, model_name: str) -> str:
from sklearn.linear_model import LogisticRegression
from joblib import dump
local_training_data = session.table(train_table).to_pandas()
X,y = local_training_data[feature_cols], local_training_data[target_col]
lrmodel = LogisticRegression().fit(X, y)
dump(lrmodel, '/tmp/'+model_name)
session.file.put('/tmp/'+model_name, '@ML_MODELS', auto_compress=False, overwrite=True)

# Next, Run the Sproc (and save the model as part of the process)
train_model("ML_DATASET", list(feature_cols),
"Y", "lr_model.sav")


# Next, Create a UDF using the Model we just trained
@cached(cache={})
def load_model(model_path: str) -> object:
from joblib import load
model = load(model_path)
return model

def udf_model(df: pd.DataFrame) -> pd.Series:
model = load_model(sys._xoptions["snowflake_import_directory"]+
'lr_model.sav')
scored_data = pd.Series(model.predict(df))
return scored_data

udf_model = session.udf.register(session=session,func=udf_model,
name="udf_score_lr_model",
stage_location='@ML_MODELS',
input_types=[T.FloatType()]*len(feature_cols),
return_type=T.FloatType(),
replace=True, is_permanent=True,
max_batch_size=1000,
imports=['@ML_MODELS/lr_model.sav'],
packages=['scikit-learn==1.1.1','pandas',
'joblib','cachetools'])

# Finally, Execute the UDF and get your results
test_sdf = session.table('ML_DATASET')
results = test_sdf.with_column('Y_PREDICTION', udf_model(*feature_cols))
results.write.mode("overwrite").save_as_table('ML_DATASET_PREDICTION')
return results

Next Run (click that blue box in the top right)

You should see the results in the table below the code

Conclusion

So there we have it, two different ways to get data a machine learning model into snowflake with the minimum of fuss. Of course, the robust production grade equivalent would look quite different, with github, ml ops, more comments and docstrings and so forth in place, but this should give you a sense of just how easy it is to deploy a model to Snowflake. More importantly, once it’s deployed you have near limitless scalability at your fingertips because using a UDF to host a model makes for effortless batch processing.

--

--

Michael Taylor
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Equal parts data scientist, consultant, data privacy wonk, animal lover, basketball coach/player and cook. Thoughts are my own, not my employers