Optimized training of a RandomForest Regressor in Snowflake

Samuel Ehrlich | Sales Engineer Snowflake Computing

Snowflake is an incredible platform built for a data first architecture. They create a secure and elegant experience to maximize the data assets that you curate. Everything from Realtime analytics and massive data processing to building and deploying machine learning is made simple in Snowflake. Further, Snowflake allows you to create customized machines that can run your training 10x faster than model training.

In this article, we briefly describe how to create a Snowpark Container Services for training a Random Forest Regression algorithm. We will discuss deploying a container service, and then running the code to train a regression algorithm. This provides 4–10x speed up improvement over running it on a regular Snowpark warehouse.

Step 1 — Create a Snowpark Container and Compute Pool.

For this, you will need a Docker installed on your machine. Your Docker file will look something like this:

ARG BASE_IMAGE=python:3.11
FROM $BASE_IMAGE
RUN apt-get update && \
apt-get install -y python3-pip
WORKDIR /usr/app
COPY . /usr/app
RUN pip install -r requirements.txt
EXPOSE 8888
#launch jupyter notebook server. NOTE! ENTRYPOINT ( or CMD )intrscutions run each time a container is launched!
ENTRYPOINT ["jupyter", "lab"," - allow-root"," - ip=0.0.0.0"," - port=8888"," - no-browser" , " - NotebookApp.token=''", " - NotebookApp.password=''"]

Next we will need to publish this to the Snowflake Image Registry. You can create a registry on Snowflake with the following command:

CREATE OR REPLACE IMAGE REPOSITORY spcs_repository;

The Snowflake Image Registry is a great place to store containers so you can use them to create services down the road. You can do this by:

Docker login <registry-name>
Docker tag <local_repo:latest> <registry-repo:latest>
Docer Push <registry-repo:latest>

This will enable load your image on container which will do here in a minute.

Now, one of them primary reasons we can make the training so much faster is because we can use a much larger, more powerful machine. For this we are going to create a compute pool with the node type as high memory large pool. This will allow training on close to 128 cores!

Step 2 — Create the Compute Pool

Use Role accountadmin

CREATE COMPUTE POOL high_mem_training
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = HIGHMEM_X64_L;

In order to make sure we have the access we need we’ll go ahead and enable some network parameters:

Use Role accountadmin

CREATE or replace NETWORK RULE allow_all_rule
TYPE = 'HOST_PORT'
MODE= 'EGRESS'
VALUE_LIST = ('0.0.0.0:443','0.0.0.0:80');

CREATE EXTERNAL ACCESS INTEGRATION allow_all
ALLOWED_NETWORK_RULES = (allow_all_rule)
ENABLED = true;

Finally, we can create our service with the following command. For more details on creating services, go here. However, before we run this, we will need to upload the yaml file. The yaml file can look like something like this. Note you will need to create a secondary Stage to mount the directory.

spec:
containers:
- name: base-jupyter
image: sfsenorthamerica-demo-sehrlich.registry.snowflakecomputing.com/db1/schema_db/spcs_repository/base-jupyter:latest
volumeMounts:
- name: data-stg
mountPath: /data
endpoints:
- name: base-jupyter
port: 8888
public: true
volumes:
- name: data-stg
source: "@GPU_SSE_STAGE"
networkPolicyConfig:
allowInternetEgress: true

Step 3 — And you can create your service like this:

Use role DBA

CREATE SERVICE base_jupyter
IN COMPUTE POOL high_mem_training
FROM @spsc_stage
SPECIFICATION_FILE='base-jupyter.yaml'
MIN_INSTANCES=1
MAX_INSTANCES=1
EXTERNAL_ACCESS_INTEGRATIONS = (ALLOW_ALL);

At this point we have everything we need to connect to our environment. If for some reason you are running into trouble, you can see the trouble shooting guide.

Now, you can run this command to get the url link to your container service.

SHOW ENDPOINTS IN SERVICE base_jupyter;

In the next session we are going access our container that is running a Jupyter notebook instance. This instance will be where we can

  • Connect to Snowflake data to pull the data.
  • Define and train the regression model, and run the fitting.
  • Run the prediction on our test data to see the results.

Take a look at this git hub repository review the simple code we will use to complete the steps above. This also has the test and training data would be relevant to use. The core of the model looks something like this.

One thing to note is that when we define our regressor we set the n_jobs value to -1. This is make sure machine takes advantage of all of the cores for it’s processing.

Step 4 — Training the Regressor model.

from sklearn.ensemble import RandomForestRegressor

test = session.table('DB1.SCHEMA_DB.TEST')
train = session.table('DB1.SCHEMA_DB.TRAIN')

y_train = train.select("AMOUNT").to_pandas()
x_train = train.drop("AMOUNT").to_pandas()

regressor = RandomForestRegressor(n_jobs=-1)
regressor.fit(x_train, y_train)
predictions = regressor.predict(x_test)
predictions.head()

At it’s core, this model definition is very simple and allows you to easily change parameters that might fit the modality of your data.

Now, at some point you might find it useful to gather statistics about the compute usage in case you wanted to find a way minimize time the training component. Things like CPU, memory consumptions, i/o all are relevant to make sure your algorithms are maximizing the machines utilization. Snowflake has a very simple way to create a grafana view of these metrics.

Step 5 (OPTIONAL) —

It is outside the scope of this guide, but it is very easy to set up and use. You can follow this guide.

And there you have it! Snowflake makes it incredibly easy spin up a safe, secure yet powerful machine to perform any number of functions or applications. And what’s more, Snowflake’s built in Snowpark API makes it extremely easy to use data securely that is already living inside of Snowflake.

--

--