Using Snowflake To Train, Tune, and Run NVIDIA NeMo-Powered LLMs

Last month, at our annual Snowflake Summit 2023, we announced our collaboration with NVIDIA to provide our customers an accelerated path to create customized generative AI applications using NVIDIA AI and the company’s own proprietary data, securely within the Snowflake Data Cloud.

With the NVIDIA NeMo framework for building, customizing and deploying large language models (LLMs) and NVIDIA accelerated computing, Snowflake will enable enterprises to use data in their Snowflake accounts to make custom LLMs for advanced generative AI services, including chatbots, search and summarization. The ability to customize LLMs without moving or exposing training and tuning data enables proprietary information to remain fully secured and governed within the Snowflake platform. To learn more about this announcement, please visit here.

NeMo is an end-to-end framework that enables customers to perform data curation, LLM development and customization, and deployment of LLMs. Fine-tuning models with customers’ own first-party data allows them to make bespoke models. Nemo allows many fine-tuning options that update base LLM parameters, such as supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). There are also other options for Parameter Efficient Fine Tuning (PEFT), such as p-tuning, prompt tuning, adapters, low-rank adaptation and IA3. Since pretrained LLMs can be fine-tuned efficiently, it allows for adding new tasks to base LLM.

For more information on p-tuning and/or prompt-tuning an LLM with NeMo, refer to this documentation page. This blog will walk you through the steps on how to accomplish this kind of LLM tuning in Snowflake.

In this scenario, we have built two Snowpark Container Services. One service does the prompt-learning within Snowpark Container Services and the other hosts the p-tuned LLM model checkpoint as an API endpoint. The entire prompt engineering happens within Snowpark Container Services leveraging hardware acceleration with NVIDIA GPUs in Snowflake.

In the second service, we host a Streamlit app with a chat-based UI that makes a call to the API service through a Snowflake external function that receives responses based on the context.

LLM Tuning Overview in Snowpark Container Services — The Big Picture

To perform prompt-tuning and p-tuning on NeMo LLM in Snowflake, you will need access to Snowpark Container Services, currently in private preview to an initial set of Snowflake customers. If you are not part of the preview yet, please contact your Snowflake account team to join the waitlist. In the meantime, you can read through below and see screenshots showing how it works.

First, create a GPU-based compute pool. The NeMo framework allows developers to train and deploy LLMs at scale. Currently, NeMo supports 3 types of models architectures, and in this demo we will p-tune a GPT-style language model:

  • GPT-style models (decoder only)
  • T5/BART-style models (encoder-decoder)
  • BERT-style models (encoder only)
create compute pool nemo_pool_3
min_nodes = 1
max_nodes = 1
Instance_family = GPU_3;

This creates a GPU_3-sized compute pool, based on NVIDIA GPUs, suitable to run an efficient prompt-tuning on multiple tasks on top of a pre-trained NeMo GPT-model. For an LLM, this is a very small model, but we use it here to show the end-to-end process as a basis to test prompt-tuning before applying it to larger models and larger sets of tuning data.

Preparing the image on your docker container

docker pull nvcr.io/nvidia/nemo:22.12
docker login http://Snowflakeaccount-accountlocator.registry.Snowflakecomputing.com/nemo_db/public/nemo_repo
docker tag nvcr.io/nvidia/nemo:22.12 Snowflakeaccount-accountlocator.registry.Snowflakecomputing.com/nemo_db/public/nemo_repo/nemo:22.12
docker push Snowflakeaccount-accountlocator.registry.Snowflakecomputing.com/nemo_db/public/nemo_repo/nemo:22.12

This pulls the NeMo multitask prompt-tuning example notebook and is already curated and pre-developed for you to prompt-tune multiple tasks. The image is then pushed to Snowflake to create the container service with NVIDIA GPUs. Contact your Snowflake account team for instructions to set up this NeMo multi-task prompt-tuning exercise on Snowpark Container Services.

Creating the YAML File for Deployment to Snowpark Container Services

Create a nemo_ptune.yaml file on your local machine for the nemo_prompt_llm endpoint as described in the diagram above. Each service you create in Snowpark Containers is defined by a YAML file. Here is the YAML file for the nemo_prompt_llm Service. Fill in and push this file to your Snowflake stage using SnowSQL.

spec:
containers:
- name: "nemo_prompt_llm"
image: "Snowflakeaccount-accountlocator.registry.Snowflakecomputing.com/nemo_db/public/nemo_repo/nemo:22.12"
command:
- /workspace/nemo/start-jupyter.sh
endpoints:
- name: "nemo_llm"
port: 8888
public: true
- name: "Streamlitendpoint"
port: 8080
public: true

We are creating a container service pointing to the image that was uploaded to the Snowflake image repository within the database and schema. There are 2 endpoints, where the first endpoint (nemo_llm port:8888) will be used for prompt-tuning the LLM model and the second endpoint (Streamlitendpoint port:8080) will be used to host the FAST API that will be accessed by a Streamlit app through a function.

In this blog, we will be covering the first endpoint, and upcoming blogs in this series will cover the Streamlit endpoint, and building a Streamlit chat-UI that will be hosted in a Snowpark Container Service (nemo_Streamlit).

Creating the Snowpark Container Service

Now that we have the following, we can create the Snowpark Container Service (nemo_prompt_llm) as below:

  1. Image pushed to the Snowflake image repo,
  2. Container specification yaml file in the stage location,
  3. Compute pool created
CREATE SERVICE nemo_prompt_llm
min_instances=1
max_instances=1
compute_pool=nemo_pool_3
spec=@yaml_stage/nemo_ptune.yaml;

Finally, run this command to access the endpoint URL of your service so that you can connect to the jupyter notebook with the preloaded code for you to prompt-tune the NeMo LLM in the Snowpark Container Services using Snowflake resource powered by NVIDIA GPUs.

decribe service nemo_prompt_llm;

You can then connect to the Jupyter notebook interface that is running inside the Snowpark Container Services that is running inside your Snowflake account. We will show the most interestings steps of the tuning process in the next few sections below. You can see the PEFT NVIDIA example notebook here.

Tuning the Model — Step 1: Load the Tuning Data

The first step is to assemble the tuning input data. The NeMo example tuning notebook contains a few examples of model tuning, including this one which is used to tune a model for sentiment detection. This tuning dataset is called the Financial Phrase Bank Dataset. We load the dataset into the notebook running inside the Snowpark Container Services using Wget:

Then we can see a bit of the tuning data, to see what the model will be learning to do:

We can see a few of the lines of training data, in this case examples of neutral sentiment statements about financial topics.

Tuning the Model — Step 2: Downloading the Pretrained LLM

Then we download the pretrained 345M-parameter NeMo LLM into the Snowpark Container Services:

This is the base model that we will be further tuning to perform a few bespoke tasks, including the sentiment detection example.

Tuning the Model — Step 3: Configuring a Training Experiment

Then before we start the training, we set some parameters and create an “experiment” that will let us tune the LLM model and then evaluate the results:

Now we are ready to tune the model inside the Snowpark Container Services, inside the user’s Snowflake account, on NVIDIA GPUs. The results are logged to Tensorboard and are also displayed to the user inside the notebook.

Tuning the Model — Step 4: Run the Fine-Tuning Process

We are now ready to start the training process:

It takes about 45 minutes on an A10 GPU to train this model before we can test the results:

We have created and saved a new model that is tuned for our specific tasks, and saved the resulting model inside the Snowpark Container Services. This trained model could then be saved to a Snowflake stage if you want to deploy it more broadly, and loaded into other Snowpark Container Services in the future. But before we do that let’s try out the new model to see if it meets our needs, in the next section.

Here is the link to the Multitask_Prompt_and_PTuning Notebook from NeMo.

Accessing the Tuned Model via a UDF in SQL

After the training is completed, the prompt-tuned LLM can be accessed through a Snowflake function. To access the model checkpoint through the Snowflake function, we host a FASTAPI in the same container nemo_prompt_llm on port 8080. Here is how you create a Snowflake function that will access the API. If you want to learn more about hosting a fastapi on the snowpark container, please reach out to your Snowflake account team for the details.

create or replace function nemo_llm_function(text varchar) returns varchar
service=nemo_prompt_llm!Streamlitendpoint
as '/intent';

Let’s test out this function for inference.

SELECT nemo_llm_function('{"taskname": "intent_and_slot","utterance": "i would like to pickup a veggie sub with a cookie from subway"}') as response;

RESPONSE
food_type(veggie sub)

Now that we have tested the function that can access the API hosted on the nemo_prompt_llm container service, we can now make this function accessible from a Streamlit app. In the next part of this blog series, we will walk you through the steps involved in building a chat-based Streamlit app and here’s an overview of how it works.

Accessing the Tuned Model via a Streamlit App

The first prompt-tuned task is on Intent, and the slot where we see the slot is referring to the object we are talking about in the context.

In this interesting context, we are asking about the sentiment behind Snowflake’s announcement with NVIDIA. The second prompt-tuned task is regarding the sentiment about NVIDIA’s relationship with the company Snowflake, and we are discussing cloud.

Finally, in the third prompt-tuned task on squad and context, we provide the same context but with a question: “How can companies take advantage of this announcement?” The answer is “to build custom AI models using their internal data.”

Summary:

We have tuned and deployed an LLM within Snowpark Container Services!

You have now learned how easily you can build and launch container services within Snowflake where you can prompt-tune (p-tune) LLM models and access it within the Snowflake function. The Snowpark Container Services is fully governed, secured in the Snowflake ecosystem. Most importantly, you don’t have to move the data out of Snowflake to build these complex LLM or generative AI solutions.

In the next part of this blog post, we will show you how to build an image for a Streamlit app and host the app on a Snowpark Container Service that gives you this chat UI experience for your consumers.

AI

--

--

Karuna Nadadur
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Karuna is a Sr Data Cloud Architect at snowflake with rich experience in data analytics and data science to make insightful decisions for your business.