Unlocking Machine Learning Potential: Running R Inside Snowpark Container Services

R Integration via Snowpark Containers Empowered by Posit

There is a big community for R globally, and for Snowflake it is growing fast. Snowpark Container services opens new opportunities for Snowflake customers to run natively, without data ever leaving the data cloud. Snowpark Container Services (SPCS) was launched at Snowflake Summit. More details can found here. There was some work done earlier on running R inside Hex environment as well here

In this blog you will see how easy it to spin up a instance, upload a docker image and run R inside snowflake. I will be building an end to end ML basic project for demonstartion. We will be using a basic iris dataset for this demo. Snowpark Container Services (SPCS) which is now available to a subset of customers in Private Preview. It is a fully managed container offering that allows you to easily deploy, manage, and scale containerized services, jobs, and functions, all within the security and governance boundaries of Snowflake, and requiring zero data movement. More-Info

How can you run this demo ?

A Snowflake Account that supports Snowflake Services: This private preview works with Snowflake accounts that are created specifically for the purposes of evaluating Snowpark Container Services.

These accounts have Snowflake Image Registryenabled so that you can create one or more image repositories. In addition, these accounts also have access to new Snowflake features, such as services and jobs, managed by Snowpark Container Services. Please talk to your local account represaentative to activate this flag.Lastly, I have used an enterprise version of R studio which requires a license keys.

Steps for the Demo :

  1. Create the Infrastructure required
  2. Downloading the docker image
  3. Creating specification yaml
  4. Create & Run the service
  5. Create the R Project
  6. Connecting R-Studio with Snowflake
  7. Bringing the data set
  8. Creating ML project & writing back to Snowflake

1. Create the Infrastructure required

Lets create a some basic objects, such as compute pool, WH, database, stages & image repository

CREATE COMPUTE POOL yourComputepool 
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = standard_1;

CREATE OR REPLACE WAREHOUSE yourWH
WITH WAREHOUSE_SIZE='X-SMALL'
AUTO_SUSPEND = 180
AUTO_RESUME = true
INITIALLY_SUSPENDED=false;

CREATE DATABASE yourdatabase;

CREATE SCHEMA yourschema;

CREATE OR REPLACE IMAGE REPOSITORY yourimage;

CREATE STAGE yourstage DIRECTORY = ( ENABLE = true );

2. Downloading the docker image

Lets download the docker image, I am using a licensed R-Studio enterprise edition. You can very well download opensource R container. To run an enterprise edition you will be needing a license keys. It is pretty starting forward.

Now lets do some docker — Logging, tagging & pushing the container in SPCS. I have named the image as rstudio:v1

https://github.com/orgs/rstudio/packages/container/package/rstudio-workbench-preview

docker login youraccount.registry.snowflakecomputing.com

docker pull image

docker tag image.registry.snowflakecomputing.com/yourdatabase/yourschema/yourimage/rstudio:v1

docker push image.registry.snowflakecomputing.com/yourdatabase/yourschema/yourimage/rstudio:v1

3. Creating specification R.yaml

Creating the yaml file, pointing to the image registry & uploading this to stage. You can specify multiple environment variables, but this is a basic example.

spec:
container:
- name: rstudio
image: image.registry.snowflakecomputing.com/yourdatabase/yourschema/yourimage/rstudio:v1
env:
RSW_LICENSE: "XXXXXXXX"
resources:
limits:
memory: 16Gi
requests:
memory: 4Gi
endpoint:
- name: e1
port: 8080
public: true

4. Create the service

We are creating R_notebook service here.

CREATE SERVICE R_notebook
MIN_INSTANCES=1
MAX_INSTANCES=3
COMPUTE_POOL=yourComputepool
SPEC=@yourstage/R.yaml;
Containers are running

5. End Point/ Running the service

Using the end point, use the credential for rstudio. Now we have a rstudio running within Snowflake. Using the end point, use the credential for rstudio. Now we have a rstudio running within Snowflake.

End point created

Creating a new session of R-studio using posit workbench.

Lets use the credentials
Lets create a RStudio <> SPCS Session
✊ success !! you can see how easy it was to launch R Studio <> SPCS

6. Connecting R-Studio with Snowflake

Now we have R-studio running inside SPCS lets connect it to Snowflake. It can either be done by UI or by code,

### Downlaod this library - which has dplyr 
install.packages('tidyverse')
library(tidyverse)
 con <- dbConnect(odbc::odbc(), Driver="snowflake",
Server = "account.snowflakecomputing.com",
UID = "user", PWD = rstudioapi::askForPassword("Database password:"),
Database = "yourDatabase, Warehouse = "yourWarehouse", Schema = "YourSchema")

You can put these credentials, in UI/ or define it in connection.

Server : account.snowflakecomputing.com
User : User
Database : yourDatabase
Warehouse : yourWarehouse
Schema : yourSchema
Server | Database | connection | WH | Schema

7. Bringing the data set

We will be using the iris data, and writing back training & test dataset back to Snowflake. I used the project from Source-Credit

# Importing libraries
library(tidyverse)
library(datasets)
library(caret)
# Importing the Iris data set
data(iris)
# Check to see if there are missing data?
sum(is.na(iris))
# To achieve reproducible model; set the random seed number
set.seed(16)
# Split dataset 80/20
TrainingIndex <- createDataPartition(iris$Species, p=0.8, list = FALSE)
TrainingSet <- iris[TrainingIndex,] # Training Set
TestingSet <- iris[-TrainingIndex,] # Test Set

Defining the objects/ and writing back to Snowflake

#### Defining the objects
Database - r_studio
Schema - Public
Name - TrainingSet,TestingSet (in df)

# Writing Training / Test to Snowflake
dbWriteTable(con, SQL("r_studio.public.TrainingSet"), TrainingSet)
dbWriteTable(con, SQL("r_studio.public.TestingSet"), TestingSet)
Dataset- Written back on Snowflake

8. Creating ML project & writing back in Snowflake

Lets build a quick classification model on predicting iris dataset. For this example I have used SVM. Again goal of this blog is not to build the best possible model but to see show how R can run in Snowflake using SPCS.

###############################
# SVM model (polynomial kernel)

# Build Training model
Model <- train(Species ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)

# Build CV model
Model.cv <- train(Species ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="cv", number=10),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)

# Apply model for prediction
Model.training <-predict(Model, TrainingSet) # Apply model to make prediction on Training set
Model.testing <-predict(Model, TestingSet) # Apply model to make prediction on Testing set
Model.cv <-predict(Model.cv, TrainingSet) # Perform cross-validation
# Model performance (Displays confusion matrix and statistics)
Model.training.confusion <-confusionMatrix(Model.training, TrainingSet$Species)
Model.testing.confusion <-confusionMatrix(Model.testing, TestingSet$Species)
Model.cv.confusion <-confusionMatrix(Model.cv, TrainingSet$Species)

print(Model.training.confusion)
print(Model.testing.confusion)
print(Model.cv.confusion)

# Feature importance
Importance <- varImp(Model)
plot(Importance)
plot(Importance, col = "red")
Basic ML project

Conclusion

With Snowpark Container Services, Snowflake customers are now able to host and run arbitrary containerized applications, within the data cloud. Running Rstudio powered by posit is an excellent example.

This feature is in limited private preview, and can only be activated by request for now. For more information about SPCS, contact your Snowflake account team or sales representative.

.

--

--