Snowflake
Published in

Snowflake

Snowpark for Data Science Playbook, FAQ style — Part 1

This playbook will walk you through the data science best practices on snowpark in a frequently asked questions(FAQs) style from our snowflake community. There are so many references available to build your own notebook for data science in snowpark and the intent behind this playbook is to simplify your understanding on how to build machine learning models in snowpark for various use cases.

Data science workflow provides a quick reference to refine your understanding on how to build machine learning models in snowpark starting from set up, data ingestion, data engineering, feature engineering, build model pipeline, train models , deploy model pipeline, monitor model performance and finally infer using deployed model.

What notebook IDEs are supported by snowpark?

Snowpark supports many notebook IDEs such as Jupyter Notebook, Spyder, HEX. If you want a rich experience building solutions in snowpark, we highly recommend HEX which has both SQL and Scripting languages kernels available. It can also help you visualize from SQL or scripting languages flamboyantly. Try it yourself here

How do I know what packages are supported by snowpark for data science?

You can easily get the packages that are available in snowpark by

Method 1: Scanning through this list for all snowflake snowpark supported packages for python

https://repo.anaconda.com/pkgs/snowflake/

Method 2: Execute the following SQL in your snowflake account accessing the Information Schema. Any package that is installed as part of default anaconda installation such as scikit-learn should be available.

How do I access new packages (not listed above) in snowpark for machine learning use case?

Step 1 : Download the whl or tar.gz file of the package that you want from pypi and other distribution.

Step 2 : Unzip and upload the tar.gz file to internal stage where the model files are going to be saved

Step 3 : add the session.add_import(import_path) to the function where the package is accessed by the model

Remember that you can only add/include pure-Python packages and not those that have or depend on C extensions.

How Do I check if my snowpark installation is set up correctly for Data Science?

If you have already installed anaconda through navigator and installed snowpark then try

And you should see the following output or something similar

Then, all it takes is importing the following basic snowpark libraries. Snowpark.functions and types are integral to snowpark primarily used for data/feature engineering activities within snowpark.

If the above throws error then you have not installed snowpark correctly, use this snowpark QuickStart for installation guide.

Or if you are a starter, Install snowpark with these simple Steps,

1. install anaconda for macOS or windows from https://www.anaconda.com/products/distribution

2. After anaconda is fully installed, open anaconda navigator to create a new environment and call it snowpark

3. Once the environment is set up, click the play button to activate the environment

4. Then click home on the left pane. In the home screen, select channel “snowpark” environment which you created

5. Finally click launch under jupyterlab to launch the notebook for model building. If this is your first time, then click install under jupyterlab before you click launch.

6. Go back and run the snowpark libraries to check if the install was successful

  • If you have conda installed and rather want to use miniconda instead and Python 3.8 version. Please follow these Steps to install.

Remember, currently snowpark support python 3.8.

How Do I connect to snowflake for Data Science from the notebook?

Step 1 Build credentials cred.json using the following snowflake parameters

Step 2 : Now using the above cred.json, create a new snowpark session by connecting to snowflake

Can I validate my connection to snowflake?

Validate using following session calls to verify

You should get the following output with the database, schema and warehouse name you used in cred.json if you are able to connect

Can I have multiple active sessions simultaneously?

Yes, you can create and activate multiple sessions (snowflake connections) simultaneously similar to snowflake worksheets. Make sure you are NOT mixing sessions across notebooks.

How do I access data the best way in snowpark if my data is stored as files?

The best way to access your adhoc files is to ingest your data to snowflake as a table(permanent, transient, temporary).. If you have files stored in your local drive, you can load the data using snowsql. SnowSQL quick-start to install snowsql and get started . PUT command will work only with snowsql on command prompt and not on snowflake worksheet.

Other ways to execute the PUT command is on your notebook cell as explained in step 2 below.

Step 1: Create a stage location using the snowflake session created above

You should see the following output

Step 2: Let’s try to load the file from a local folder to the internal stage created above using put command

With SQL call

With Snowpark Dataframe

Step 3 : Execute list on internal stage

It should return file loaded successfully.

Note: snowflake compresses by default and you can choose NOT to compress by adding COMPRESSION=NONE.

Step 4 : Now let’s ingest the data in stage to a snowflake but before that review the structure of the input file before you create the table in snowflake

With SQL based table creation

Returns

With Snowpark Dataframe

Step 5 : Finally let’s load the data to snowflake table you created

With SQL

Returns

With Snowpark Dataframe

If you loaded the file in a pandas dataframe you can also create table in snowflake without having to define a table structure.

Step 6 : Let’s check some stats on the table loaded before we conclude on data load.

How do I profile data the best way to understand the variables?

The most popular profiling package available is Pandas-profiling . You can achieve something similar by using snowpark dataframe reference such as

Snowpark Dataframe Stats

Snowpark Dataframe find missing values

Snowpark Dataframe count

For other references , use Snowpark API reference for python

I will soon share a notebook that does similar to pandas-profiling with streamlit. There is work in progress.

How do I find correlations and covariance between X variables and Y in snowpark?

Yes

Correlation using following

Snowpark reference for Correlation

Covariance using following

Snowpark reference for Covariance

Thereafter, You can build a correlation and covariance matrices to visualize using HEX or matplotlib. I will soon share a notebook that does similar to sklearn to build covariance matrix using streamlit. There is work in progress.

How do I impute, scale or normalize the X variables in a snowpark?

This can be achieved by building a model pipeline where you can impute, normalize, scale and define model classifiers with parameters.

Can we train models in snowpark and use snowflake credits?

You can train ML models in snowflake by registering the model pipeline function built above as a stored procedure and then calling the stored procedure. Refer to the diagram if you feel confused.

This will utilize the snowflake compute power and credits.

Step 1: Define the Model Function

Step 2: Register Stored Proc

Step 3: Execute Stored Proc for training

How do I confirm if the training was pushed down to snowflake?

After you execute the notebook cell which calls the stored procedure (above),

Step 1 : login to the snowflake instance you are using for the notebook session

Step 2 : Click on “History tab” and click refresh on right side.

Step 3 : You should see a new query submitted in snowflake where the stored procedure will be running and query id will be assigned.

The history tab shows the query id, the stored procedure CALL, Warehouse, size and most importantly the duration it took to complete the training.

You can also narrow down all your training using stored proc and the duration it took with

And you get the output as this

Can I register individual pieces in the model pipeline as separate stored proc?

Yes, you can build an independent piece from a model pipeline such as the scaler shown below and register it as a stored procedure which will be pushed down to utilize snowflake compute resources and credits.

Step 1: define the function min_max_scaler and establish the input table and columns for scaling. Most importantly you “preprocessing” is a helper function that are built-in functions for imputation, encoding and scaling modules. This package needs to be imported before the min_max_scaler is registered as stored proc. The “preprocessing” utility scripts link here . Here is how to use preprocessing utility and registering as individual stored procedure for reusability across.

Step 2: register the function as stored proc

Step 3: Call the stored procedure by passing the input

Similarly you can build independent pieces from model pipelines such as encoding, correlation analysis, imputation etc as independent stored procedures so that you can reuse those stored procedures for different use cases.

Can I train multiple models in snowpark simultaneously?

Yes, you can train multiple models and there are multiple options to get there.

Option 1 : by registering multiple stored procedures and executing them independently which generates different model files. Use timestamp as suffix to get unique model filenames.

Option 2 : by registering one stored proc as above and then execute with different parameters based on your model pipeline definition and saving different model files.

Can I increase compute power while the model training is in progress?

No, after the training has begun, the query is submitted to snowflake and warehouse that was assigned earlier will be used to train the model. You can change the warehouse anytime but this will not take effect on the model training which is already in progress.

Can I increase compute power before I train the model?

You can always cancel the current execution and change the warehouse to a new size. Once the warehouse has been resized, you can execute the stored proc for model training.

If you want to resize the warehouse you currently are using.

If you want to use a different warehouse that is already configured with medium size then

Can I evaluate trained models in snowpark?

Yes, By defining a function that can write to snowflake. Here is an example of deriving classification report as follows and calling in the model definition above. If you are building other kinds of model, you should be able to build a different function. This works only for classification use case.

To save the model parameters as a snowflake table, you can define a function that captures all you input to the model and save it in the model output.

How do I save the trained model ?

Before we use the model for inference, the trained model should be saved in snowflake internal stage where all the model dependencies are saved.

Step 1 : Define the function to save the model

Step 2 : Execute the function by passing the stage directory where you want the model saved

Step 3 : Now after the training is completed, check the saved model file as below and see the results

Returns, this shows the size of the model file that was saved in internal stage

How do I go back and check all the models I have trained in snowflake?

Now that we did all the heavy lifting ahead, by building the pipeline and training the model, It is very simple to evaluate all the models you trained using SQL queries. But first remember, you should have saved the model output in a snowflake table by defining the table input parameters and then model reports part of the stored procedure.

Step 1 : Remember to create a model_output snowflake table

Step 2: Define these 2 function calls while registering stored proc above. This should write the model output to the snowflake table MODEL_OUTPUT.

Step 3: Use this SQL to query model output.

Now we can check the model output here where you can query like regular tables and build visualizations for model performance.

There is a streamlit app that is currently being worked out which will be available to evaluate trained models.

How do I perform inference on a model that was trained and saved?

Step 1 Define the inference function

Step 2 Register the model inference

Step 3 Execute the inference udf through simple SQL statements

And the results will look something like this when you call the UDFs (user defined function) to predict the labels using input features calling 2 different UDFs (random forest and Decision Tree Classifiers) simultaneously.

Can I use the inference outside of snowflake?

Yes, you can pretty much use the inference udf from any app, microservice, lambda function or analytics tools such as tableau, powerBI etc.

You can use streamlit to host an app within snowflake and share it across user base. More to come in this space in my next parts.

What are some of the clean ups available in snowpark?

To clear the packages or imports

To close a session that was activated for snowpark

What are some readily available notebooks available to get started?

From our snowpark python demos repository, please use these projects to get familiarized with snowpark and use the above playbook to decode your understanding. These are just some starter notebooks for your understanding.

Regression Models to predict housing prices

Classification Model for credit card fraud detection

Customer Churn Analytics

Credit Scoring Model

Advertisement Spend ROI prediction

Sentiment Analytics

Advertisement ROI Prediction

What are you waiting for? Get started with snowpark. Share your snowpark experience in your blogspot and I will be very happy to learn from your experience.

Happy holiday everyone and I wish you all a very happy new year 2023.

--

--

Snowflake articles from engineers using Snowflake to power their data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karuna Nadadur

Karuna is a Sr Data Cloud Architect at snowflake with rich experience in data analytics and data science to make insightful decisions for your business.