A New ML Development Experience with Amazon Sagemaker Studio Lab for Snowpark Python Developers

Amazon recently has announced a brand new Machine Learning development experience called Amazon Sagemaker Studio Lab. Sagemaker Studio Lab (In Public Preview as of June 2022) provides a seamless development experience that includes a compute (CPU/GPU runtimes), persistent EBS-based storage (up to 15GB), and GitHub integration and all of these are FREE for all developers for learning and experimentation.

Photo by Christin Hume on Unsplash

ML Developers typically stand up their Jupyter environment locally or use one of the proprietary notebook offerings for their experimentation/development. Amazon Sagemaker Studio Lab is another free alternative that is a hosted version of open source JupyterLab and it is specifically designed for learning and experimentation, not for production.(See FAQs) Also it aims to make developers more productive in many ways, especially if AWS Sagemaker is the AI/ML service they are using for productionalization. AWS Sagemaker is a great choice to complement Snowflake Data Cloud for AI/ML workloads as it seamlessly integrate with Snowflake.

I heard about the announcement of Sagemaker Studio Lab in an AWS podcast and decided to try it with the Snowpark Python. If you’d like to know more about Snowpark Python, please make sure to check out a recent webinar here.

I chose one of the Snowflake Data Marketplace datasets and implemented ML code to do feature extraction with NLTK packages.

How to get started with Sagemaker Studio Lab

All you need to get started is a valid email address. You do not even need an AWS account to use this product. You can simply go to https://studiolab.sagemaker.aws/ and Sign up! You will receive a confirmation email within hours and you can start developing. The development experience is very similar to AWS Sagemaker Studio and the product is intended to make the transition to Sagemaker Studio very easy. I really like the neat UI:

All you need to do is select a compute type, click on the Open Project button and then you can get to the Jupyter UI.

Sagemaker Studio Lab enables collaboration amongst developers and it comes with a lot of sample notebooks in Github repositories to get you started. You can also clone or integrate with any Github repository for source control.

Installing Snowpark Python package

I created a new Python 3.8 conda environment by clicking on the + button at the top left and clicking on the Terminal tile to start a terminal.

I created a yml file like below (e.g. envsdc.yml) for my new Python 3.8 environment

name: sdc
dependencies:
— python=3.8
— pip

and ran the commands below to create a new conda environment called ‘sdc’.

(base)$ conda env create -f envsdc.yml
(base)$ conda activate sdc
(sdc)$ conda install ipykernel

After that, I installed the python packages I needed including snowpark python package as well as nltk, numpy and sklearn using a requirements.txt in my sdc environment.

Any packages you install are stored in your persistent project storage in Amazon Sagemaker Studio Lab, so there is no need to reinstall packages everytime you return to your project.

Getting the Product Reviews data from the Snowflake Marketplace

Next, I logged into my Snowflake account, searched for “reviews” in the Snowflake Data Marketplace UI and found the Online Retailer Product Ratings and Reviews dataset by PowerReviews.

In one click, the dataset was ready to query in my Snowflake Data Cloud account.

Online Retailer Product Reviews Text Feature Extraction with Snowpark

As the last step, I created a notebook in Amazon Sagemaker Studio Lab and in my notebook , I selected my ‘sdc’ kernel I created previously.

After that, I wrote some sample code to connect to my Snowflake account using the Snowpark Python package, and loaded the reviews dataset from my Snowflake table into a data frame.

I used the NLTK packages to clean and vectorize reviews text as below:

Finally, I ran my notebook to print the Vocabulary and Bag of Words binary features from the reviews dataset in the Snowflake Data Marketplace. Easy!

You can find the code here.

When you shut down your runtime your notebooks will not be lost, thanks to the persistent storage of Sagemaker Studio Lab. You can always come back to them. Also, CPU/GPU runtimes can be used for the multiple notebooks in a project.

Sagemaker Studio Lab is a new free, collaborative development tool environment with a built-in Github integration for ML developers. Hope you enjoy Sagemaker Studio Lab for your experimentation with ML models and Snowpark Python as much as I did.

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Data Cloud Architect