Unlocking Machine Learning Potential: Running R Inside Snowpark Container Services
There is a big community for R globally, and for Snowflake it is growing fast. Snowpark Container services opens new opportunities for Snowflake customers to run natively, without data ever leaving the data cloud. Snowpark Container Services (SPCS) was launched at Snowflake Summit. More details can found here. There was some work done earlier on running R inside Hex environment as well here
In this blog you will see how easy it to spin up a instance, upload a docker image and run R inside snowflake. I will be building an end to end ML basic project for demonstartion. We will be using a basic iris
dataset for this demo. Snowpark Container Services (SPCS) which is now available to a subset of customers in Private Preview
. It is a fully managed container offering that allows you to easily deploy, manage, and scale containerized services, jobs, and functions, all within the security and governance boundaries of Snowflake, and requiring zero data movement. More-Info
How can you run this demo ?
A Snowflake Account that supports Snowflake Services: This private preview works with Snowflake accounts that are created specifically for the purposes of evaluating Snowpark Container Services.
These accounts have Snowflake Image Registry
enabled so that you can create one or more image repositories. In addition, these accounts also have access to new Snowflake features, such as services and jobs
, managed by Snowpark Container Services. Please talk to your local account represaentative to activate this flag.Lastly, I have used an enterprise version of R studio which requires a license keys.
Steps for the Demo :
- Create the Infrastructure required
- Downloading the docker image
- Creating specification yaml
- Create & Run the service
- Create the R Project
- Connecting R-Studio with Snowflake
- Bringing the data set
- Creating ML project & writing back to Snowflake
1. Create the Infrastructure required
Lets create a some basic objects, such as compute pool, WH, database, stages & image repository
CREATE COMPUTE POOL yourComputepool
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = standard_1;
CREATE OR REPLACE WAREHOUSE yourWH
WITH WAREHOUSE_SIZE='X-SMALL'
AUTO_SUSPEND = 180
AUTO_RESUME = true
INITIALLY_SUSPENDED=false;
CREATE DATABASE yourdatabase;
CREATE SCHEMA yourschema;
CREATE OR REPLACE IMAGE REPOSITORY yourimage;
CREATE STAGE yourstage DIRECTORY = ( ENABLE = true );
2. Downloading the docker image
Lets download the docker image, I am using a licensed R-Studio enterprise edition. You can very well download opensource R container. To run an enterprise edition you will be needing a license keys. It is pretty starting forward.
Now lets do some docker — Logging, tagging & pushing the container in SPCS. I have named the image as rstudio:v1
https://github.com/orgs/rstudio/packages/container/package/rstudio-workbench-preview
docker login youraccount.registry.snowflakecomputing.com
docker pull image
docker tag image.registry.snowflakecomputing.com/yourdatabase/yourschema/yourimage/rstudio:v1
docker push image.registry.snowflakecomputing.com/yourdatabase/yourschema/yourimage/rstudio:v1
3. Creating specification R.yaml
Creating the yaml file, pointing to the image registry & uploading this to stage. You can specify multiple environment variables, but this is a basic example.
spec:
container:
- name: rstudio
image: image.registry.snowflakecomputing.com/yourdatabase/yourschema/yourimage/rstudio:v1
env:
RSW_LICENSE: "XXXXXXXX"
resources:
limits:
memory: 16Gi
requests:
memory: 4Gi
endpoint:
- name: e1
port: 8080
public: true
4. Create the service
We are creating R_notebook
service here.
CREATE SERVICE R_notebook
MIN_INSTANCES=1
MAX_INSTANCES=3
COMPUTE_POOL=yourComputepool
SPEC=@yourstage/R.yaml;
5. End Point/ Running the service
Using the end point, use the credential for rstudio. Now we have a rstudio running within Snowflake. Using the end point, use the credential for rstudio. Now we have a rstudio running within Snowflake.
Creating a new session of R-studio using posit workbench.
6. Connecting R-Studio with Snowflake
Now we have R-studio running inside SPCS lets connect it to Snowflake. It can either be done by UI or by code,
### Downlaod this library - which has dplyr
install.packages('tidyverse')
library(tidyverse)
con <- dbConnect(odbc::odbc(), Driver="snowflake",
Server = "account.snowflakecomputing.com",
UID = "user", PWD = rstudioapi::askForPassword("Database password:"),
Database = "yourDatabase, Warehouse = "yourWarehouse", Schema = "YourSchema")
You can put these credentials, in UI/ or define it in connection.
Server : account.snowflakecomputing.com
User : User
Database : yourDatabase
Warehouse : yourWarehouse
Schema : yourSchema
7. Bringing the data set
We will be using the iris data, and writing back training & test dataset back to Snowflake. I used the project from Source-Credit
# Importing libraries
library(tidyverse)
library(datasets)
library(caret)
# Importing the Iris data set
data(iris)
# Check to see if there are missing data?
sum(is.na(iris))
# To achieve reproducible model; set the random seed number
set.seed(16)
# Split dataset 80/20
TrainingIndex <- createDataPartition(iris$Species, p=0.8, list = FALSE)
TrainingSet <- iris[TrainingIndex,] # Training Set
TestingSet <- iris[-TrainingIndex,] # Test Set
Defining the objects/ and writing back to Snowflake
#### Defining the objects
Database - r_studio
Schema - Public
Name - TrainingSet,TestingSet (in df)
# Writing Training / Test to Snowflake
dbWriteTable(con, SQL("r_studio.public.TrainingSet"), TrainingSet)
dbWriteTable(con, SQL("r_studio.public.TestingSet"), TestingSet)
8. Creating ML project & writing back in Snowflake
Lets build a quick classification model on predicting iris dataset. For this example I have used SVM. Again goal of this blog is not to build the best possible model but to see show how R can run in Snowflake using SPCS.
###############################
# SVM model (polynomial kernel)
# Build Training model
Model <- train(Species ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
# Build CV model
Model.cv <- train(Species ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="cv", number=10),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
# Apply model for prediction
Model.training <-predict(Model, TrainingSet) # Apply model to make prediction on Training set
Model.testing <-predict(Model, TestingSet) # Apply model to make prediction on Testing set
Model.cv <-predict(Model.cv, TrainingSet) # Perform cross-validation
# Model performance (Displays confusion matrix and statistics)
Model.training.confusion <-confusionMatrix(Model.training, TrainingSet$Species)
Model.testing.confusion <-confusionMatrix(Model.testing, TestingSet$Species)
Model.cv.confusion <-confusionMatrix(Model.cv, TrainingSet$Species)
print(Model.training.confusion)
print(Model.testing.confusion)
print(Model.cv.confusion)
# Feature importance
Importance <- varImp(Model)
plot(Importance)
plot(Importance, col = "red")
Conclusion
With Snowpark Container Services, Snowflake customers are now able to host and run arbitrary containerized applications, within the data cloud. Running Rstudio powered by posit is an excellent example.
This feature is in limited private preview, and can only be activated by request for now. For more information about SPCS, contact your Snowflake account team or sales representative.
.