Run multi-node, multi-GPU Ray inside Snowflake

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Table of Contents

· Summary: Running Ray in Snowpark Container Services
· Motivation behind bringing Ray into Snowpark Container Services
· Deeper Dive into the Problem Statement and Solution
· Show me the code
· What else is possible with setting up Ray on Snowpark Container Services
· Ray Architecture
· High level approach for bringing Ray within Snowpark Container Services
· Under the hood: Ray deployment into Snowpark Container Services
· Automating Ray deployment into Snowpark Container Services (4 simple steps)
Prerequisites before setup
Deploying Ray within Snowpark Container Services in 4 simple steps
· Conclusion

Summary: Running Ray in Snowpark Container Services

This blog post focuses on how to bring Ray, a popular open source framework, within Snowpark Container Services in private preview You can read more about Ray here. I will also demonstrate serving streaming responses from lmsys/vicuna-13b-v1.5–16k model, which has a context window of 16K tokens and operates under the Llama 2 Community License Agreement. This model is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT. I will be hosting this model on Ray Serve using vLLM within Snowpark Container Services. Here are the key takeaways from this blog:

  • We don’t need heavy and expensive GPU machinery like GPU_10 to host GPU intensive large language models (LLMs), but we can easily shard them through vLLM and deploy on smaller GPU_7 instances at half the cost using Ray Serve on Snowpark Container Services.
  • Through the automation in last Section Automating Ray deployment into Snowpark Container Services (4 simple steps), we have the ability to just modify a few variables and kick off the Ray deployment process within minutes.

Motivation behind bringing Ray into Snowpark Container Services

Snowflake announced Snowpark Container Services in private preview on AWS at Snowflake Summit 2023. This Snowpark runtime frees users from managing and maintaining the clusters of containers that sophisticated, data-intensive workloads such as machine learning and generative AI require. And because developers do not need to move data out of Snowflake, Snowpark Container Services reduces the risk of exposing data to external security risks. With Snowpark Container Services, you can:

  • Alleviate operational burden, so you focus on development: Deliver custom applications faster by keeping data where it’s secure.
  • Protect your data and models: Your data is your most valuable asset, along with the models that you fine-tune from that data.
  • Simplify infrastructure: No need to manage GPU node groups individually in Cloud Service Providers (CSPs). You can deploy containers with LLMs and other app components securely in Snowflake.

The Ray open source community and the managed Ray offering, Anyscale, have published a plethora of blog posts on why Ray makes sense for distributing python and AI workloads. The following table highlights a few of the posts that made me fall in love with Ray, and motivated me to bring Ray into Snowpark Container Services.

With Ray on Snowpark Container Services, I attempted to solve the problem if I really need single, expensive instance like GPU_10 (GPU_10 has 8 NVIDIA A100s with total 320GB of GPU memory) to host models like lmsys/vicuna-13b-v1.5–16k, or if I could get away with multi-node, multi-GPU Ray cluster on Snowpark Container Services with smaller GPU_7s (each GPU_7 has 4 NVIDIA A10Gs with 24GB GPU memory) for serving this model using Ray Serve at half the cost.

The concepts in this blog apply irrespective of what LLM model you choose. I debated between using lmsys/vicuna-13b-v1.5–16k or meta-llama/Llama-2–70b-chat-hf before settling on the former. Although both models require a lot of GPU memory for inference, lmsys/vicuna-13b-v1.5–16k supports context up to 16K tokens, while meta-llama/Llama-2–70b-chat-hf is limited to a context of 4K tokens. If you wanted to switch to another model like meta-llama/Llama-2–70b-chat-hf, all you need to is all you need to do is to change the variable HF_MODEL in my public github here. Now, let’s dig deeper into the problem statement.

Deeper Dive into the Problem Statement and Solution

So, what’s special about this lmsys/vicuna-13b-v1.5–16k model and why would you consider hosting this on Ray Serve within Snowpark Container Services? Based on my testing, I could simply load this model using vLLM and a single GPU_7 instance (each GPU_7 has 4 NVIDIA A10Gs with 24GB GPU memory) on Snowpark Container Services — it only took ~24GB GPU RAM as observed through nvidia-smi command. However, for inferencing on this model with prompts up-to 16K tokens, this could not be done on a single GPU_7 since the process ids would just get killed due to processes running out of memory. Hence, I had two options:

  • Option 1: Request an expensive GPU_10 so that inference can be performed on prompts up-to 16K tokens.
  • Option 2: Create a scalable, multi-node, multi-GPU cluster with one GPU_3 head node (1 NVIDIA A10G with 24GB GPU memory), and two GPU_7 worker nodes, which could host this model for distributed GPU inferencing.

I went with Option 2 because GPU_10 instances are not only expensive, but hard to procure as well. If I could just connect multiple GPU nodes together with an orchestrator, I could easily scale this out. This is where Ray came in on top of Snowpark Container Services. Through this setup of Ray on Snowpark Container Services, not only was I able to serve this model for all test cases upto 16K tokens, but also ended up being half the cost of just doing it on a single GPU_10.

Here’s what the experience looked like for the end consumer. Once the model is deployed as a Ray Serve application within Snowpark Container Services, I could call it with Streamlit in Snowpark Container Services or through a Jupyter notebook loaded on the Ray head node. The choice is yours. Figure 1 shows this in action where a 6340 word paper is summarized by the model. It also shows 8 GPUs live in action.

Figure 1: Text Summarization with Ray and Streamlit in Snowpark Container Services

Here’s what the Ray cluster setup on Snowpark Container Services looks like for this example in Figure 2. It shows one Ray head node (GPU_3), and two Ray worker nodes serving the Vicuna13B model (2 GPU_7)

Figure 2: Ray setup in Snowpark Container Services

Figure 3 shows that model deployed as a Ray Serve application within Snowpark Container Services.

Figure 3: lmsys/vicuna-13b-v1.5–16k model deployed on Ray Serve within Snowpark Container Services

The associated grafana deployed within Snowpark Container Services is shown in Figure 4.

Figure 4: Grafana within Snowpark Container Services

Show me the code

If you would like to see the code and replicate in your environment, the entire repo is available on public Github here.

What else is possible with setting up Ray on Snowpark Container Services

The setup in this blog is just one of the setups possible with Ray on Snowpark Container Services. At Snowday, I also demonstrated having a cluster of 13 active GPU nodes on Snowpark Container Services, where two of those nodes were GPU_7s and the other 11 nodes were all GPU_3s (each GPU_3 has 24GB GPU memory). I used that setup to demonstrate how I could build a Retrieval Augmented Generation (RAG) pipeline within Snowpark Container Services, where text embeddings were calculated in parallel using multiple GPUs using Ray distributed GPU processing within Snowpark Container Services. I had also hosted a Llama2–13B model on Ray Serve on a single GPU_7. You can either watch the whole video, or seek forward to 27:51. The Ray in Snowpark Container Services setup showed during Snowday is highlighted in Figure 5.

FIgure 5: Ray in Snowpark Container Services setup for RAG @ Snowday

Ray Architecture

Before I delve deeper into what it took to install Ray in Snowpark Container Services, let’s quickly review Ray architecture which is shown in Figure 6 below. A Ray cluster consists of a single head node and any number of connected worker nodes. As shown in Figure 6, the Ray cluster contains two worker nodes. Each node runs Ray helper processes to facilitate distributed scheduling and memory management. The head node runs additional control processes (highlighted in blue). More details about Ray architecture and how it supports distributed processing can be found here. You can also read more about Ray Serve architecture here.

Figure 6: Ray Architecture (source)

Now that we know what Ray architecture looks like, lets dig into installing Ray within Snowpark Container Services.

High level approach for bringing Ray within Snowpark Container Services

Here’s a quick tech primer on the concepts within Snowpark Container Services. To set up Ray on Snowpark Container Services:

  1. I needed a service which could act as the Ray head node. I gave this service the name SPCS_RAY_CUSTOM_HEAD_SERVICE. A service always runs on a compute pool, and I chose GPU_3 as the type of the compute pool to host the Ray head node. GPU_3 is the smallest possible GPU compute pool type available in Snowpark Container Services. I also gave that compute pool the name VICUNA13B_RAY_HEAD_POOL.
  2. I needed a service, with multiple service replicas, that could act as the Ray worker nodes. Ray allows any type of worker nodes to be connected to the Ray head node. Developers in Snowpark Container Services have the option to choose whether they want to connect a pure CPU or GPU based compute pool to the Ray head node. In order to keep the setup flexible, I added the option in the automation script to have two types of worker resources:
    — Type 1: Ray Workers which have a tag “generic_label”.
    — Type 2: Special Ray Workers which have a tag “custom_llm_serving_label”.
    You can read more about the concept of Custom Resources in Ray here.
    In this setup, I keep the workers which have a “generic_label” tag reserved for generic CPU/GPU work, and use the workers with “custom_llm_serving_label” to only deploy Ray Serve applications.
    Since my objective in this blog was just to have the lmsys/vicuna-13b-v1.5–16k model hosted as a Ray Serve application within Snowpark Container Services, I only needed the worker nodes with “custom_llm_serving_label”. I deployed these workers on compute pool I called VICUNA13B_RAY_SERVE_POOL.
  3. I also needed a Snowpark Container Services job to deploy the lmsys/vicuna-13b-v1.5–16k model as a Ray Serve application within Snowpark Container Services, on the Ray worker nodes with the “custom_llm_serving_label” resource tag. As mentioned earlier, this model when used for inferencing on 16K token prompts uses a lot of GPU memory that can only fit on 2 GPU_7s. In order to do so, vLLM was used with 8 tensor_parallel_size for distributed tensor-parallel inference on 8 different GPUs. Since the 2 GPU_7s were part of the Ray cluster and each GPU_7 worker contributed 4 GPUs to the cluster, the multi-GPU serving across the Ray cluster could be done as seen in the Figure 7 below. You can read more about the tensor_parallel_size argument here.
  4. I also needed a Streamlit service in Snowpark Container Services to consume the results of the lmsys/vicuna-13b-v1.5–16k model. This streamlit app allows the user to submit the prompt to the Ray Serve application and presents the output back to the user in streaming fashion.
Figure 7: Distributed multi-GPU inference with Ray and vLLM

The Ray head and Ray worker service communicate with each other through service-to-service communication in Snowpark Container Services, with the final architecture for this setup represented in Figure 8.

Figure 8: Ray architecture in Snowpark Container Services

Under the hood: Ray deployment into Snowpark Container Services

In order to deploy Ray in Snowpark Container Services, I needed six docker containers to be built and pushed into Snowpark Container Services image repository: grafana, prometheus, Ray head, Ray worker (those workers that have a generic_label tag attached to it), Ray special worker (those workers that have a custom_llm_serving_label tag attached to it) and Snowpark Container Services job to deploy Ray serve application. For each of these containers, I needed to:

  1. Define a Dockerfile with an Nvidia development image.
  2. Build the Docker image and push it to the Snowpark Container Services image repository.
  3. Define a specification file for the Docker image
  4. Create a compute pool and run a service and/or Job on the compute pool.

All of this presented the opportunity to automate deployment components end to end, with the user having the ability to replicate this setup into their own environment through only a few parameter changes (like database, schema, repository_url, hugging face token etc). Hence, I built an automation for deploying the multi-node, multi-GPU setup of Ray on Snowpark Container Services, along with hosting the Vicuna13B model on Ray Serve within Snowpark Container Services. I also posted the automation code on github here so that it can be easily replicated in the readers’ Snowpark Container Services environment.

Automating Ray deployment into Snowpark Container Services (4 simple steps)

Through the built automation, I now have the ability to deploy Ray into Snowpark Container Services through 4 simple steps. However, before I go through these 4 steps, there are a few prerequisites I must talk about.

Prerequisites before setup

Before the automation to deploy Ray in Snowpark Container Services is kicked off, there are primarily three important prerequisites

  1. Docker Desktop
  2. SnowSQL
    — Instructions for installing SnowSQL here. After installation, please check you are able to run snowsql -v in a new terminal. If that command doesn’t work, it means that the terminal is not able to look up the installed snowsql. In that case, after snowsql installation, put an alias to snowsql in ~/.bash_profile and run source ~/.bash_profile before going ahead with the steps below.
  3. Access to Snowpark Container Services in Private Preview
    — Note that once you are granted access to Snowpark Container Services in Private Preview, you must have the ability to create a GPU_3 compute pool with 1 node and one GPU_7 compute pool with 2 nodes for this setup.

Deploying Ray within Snowpark Container Services in 4 simple steps

Step 1: Create basic Snowflake objects needed for Snowpark Container Services container deployment

First, I needed to create some basic Snowflake objects such as database, schema, stage for hosting the container specification yaml and the image repository to host the image. This is shown below.

create database if not exists MYDB;
use database MYDB;
create schema if not exists vicuna13bonrayserve;
use schema vicuna13bonrayserve;
create stage if not exists SPEC_STAGE;
create image repository if not exists LLM_REPO;
show image repositories in schema;

Step 2: SnowSQL connection configuration

Once SnowSQL has been installed, a connection needs to be configured. This is shown below and the SnowSQL connection has been named fcto.

[connections.fcto]
accountname = XXX
username = XXX
password = XXX
warehouse = XXX
dbname = XXX
schemaname = XXX
rolename = XXX

Step 3: Docker login

Then, I performed a docker login using the image repository URL obtained in the first step above. I updated the image repository url in the do_login.sh file and then executed the shell script.

sh bin/do_login.sh

Step 4: Execute the automation script

There is an automation script called configure_project.sh. In this automation script, I need to update a few variables below.

#these variables definitely need to be changed
repository_url="myaccount.registry.snowflakecomputing.com/mydb/vicuna13bonrayserve/llm_repo"
database="mydb"
schema="vicuna13bonrayserve"
spec_stage="spec_stage"
hf_token="X"
snowsql_connection_name=fcto

#these variables are good enough for the Vicuna model on Ray Serve in Snowpark Container Services. No need to change
num_ray_workers=0
num_additional_special_ray_workers_for_ray_serve=2
ray_head_node_type=GPU_3
ray_worker_node_type=NA
special_ray_worker_for_ray_serve_node_type=GPU_7
default_compute_pool_keep_alive_secs=120
ray_head_compute_pool_name=VICUNA13B_RAY_HEAD_POOL
ray_worker_compute_pool_name=NA
rayserve_compute_pool_name=VICUNA13B_RAY_SERVE_POOL
streamlit_feedback_table_name=ST_FEEDBACK
job_manifest_file=ray_serve_vllm_vicuna13b_manifest_v27.yaml

Here’s what these variables mean:

Once the parameters are defined in step 4, I could just execute the automation script with the following actions and kick off the deployment of Ray into Snowpark Container Services within minutes:

  1. ./configure_project.sh — action=update_variables
  2. ./configure_project.sh — action=deploy_all

A full reference of the available actions with this automation script is shown below:

Conclusion

Ray in Snowpark Container Services is a powerful combination to accelerate distributed workloads in Snowpark Container Services, as well as Gen-AI/LLM applications. With Ray on Snowpark Container Services,

  • We don’t need heavy and expensive GPU machinery like GPU_10 to host GPU intensive LLMs, but we can easily shard them through vLLM and deploy on smaller GPU_7 instances at half the cost.
  • Through the presented automation above, users have the ability to just modify a few variables and kick off the Ray deployment process within minutes.

I am very excited about Ray on Snowpark Container Services and am absolutely delighted to have such a powerful platform at my fingertips.

--

--