Running Weaviate Vector DB in Snowflake using Snowpark Container Services

Date: September 2023

Vector databases play a crucial role in developing modern LLM applications with Retrieval Augmented Generation framework.

In this post, I will show you how we can run an OSS vector database, Weaviate, within Snowflake using Snowpark Container Services (currently in Private Preview) within Snowflake’s security perimeter. Please see this post for more info on Snowpark Container Services before you try this quick setup.

  1. Let’s create a database, image repo, 3 stages for storing yaml spec file, Weaviate mounts and for our json data that we will be vectorizing. We are also creating 2 CPU-based compute pools.
CREATE DATABASE IF NOT EXISTS WEAVIATE_DB;

USE DATABASE WEAVIATE_DB;

CREATE IMAGE REPOSITORY WEAVIATE_DB.PUBLIC.WEAVIATE_REPO;

-- Stage to store the service spec file
CREATE OR REPLACE STAGE YAML_STAGE;
-- Stage to store Weaviate files
CREATE OR REPLACE STAGE DATA ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');
-- Stage to store input json data
CREATE OR REPLACE STAGE FILES ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

CREATE COMPUTE POOL IF NOT EXISTS WEAVIATE_CP
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = STANDARD_2
AUTO_RESUME = true;

CREATE COMPUTE POOL IF NOT EXISTS JUPYTER_CP
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = STANDARD_2
AUTO_RESUME = true;

2. We will creating 2 services: 1. Weaviate instance 2. Jupyter service to interact with Weaviate using the Weaviate Python client.

Using SnowSQL, we can first push service spec yaml files below for the 2 services to our stage, @yaml_stage. We are using OpenAI’s text2vec embedding model in this simple example, but you can also use an embedding model that is running in another service in Snowflake Container Services as well.

spec.yaml:

spec:
containers:
- name: "weaviate"
image: "<YOUR_SNOWFLAKE_ACCT_URL>/weaviate_db/public/weaviate_repo/weaviate"
env:
SNOWFLAKE_MOUNTED_STAGE_PATH: "stage"
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_APIKEY_ENABLED: 'true'
AUTHENTICATION_APIKEY_ALLOWED_KEYS: '<YOUR KEYS>'
AUTHENTICATION_APIKEY_USERS: '<YOUR USERS>'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: text2vec-openai
CLUSTER_HOSTNAME: 'node1'
ENABLE_MODULES: 'text2vec-openai,generative-openai'
volumeMounts:
- name: stage
mountPath: /workspace/stage
- name: data
mountPath: /var/lib/weaviate
endpoints:
- name: "weaviate"
port: 8080
public: true
volumes:
- name: data
source: "@data"
- name: stage
source: "@output"

spec-j.yaml:

spec:
containers:
- name: "jupyter"
image: "<YOUR_SNOWFLAKE_ACCT_URL>/weaviate_db/public/weaviate_repo/jupyter"
env:
SNOWFLAKE_MOUNTED_STAGE_PATH: "stage"
volumeMounts:
- name: stage
mountPath: /workspace/files
endpoints:
- name: "jupyter"
port: 8888
public: true
volumes:
- name: stage
source: "@files"
uid: 1000
gid: 1000

Next, using SnowSQL we can also push the sample json file below to @files stage (our volume mount).

jeopardy.json:

(Source: https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json)

[{"Category":"SCIENCE","Question":"This organ removes excess glucose from the blood & stores it as glycogen","Answer":"Liver"},{"Category":"ANIMALS","Question":"It's the only living mammal in the order Proboseidea","Answer":"Elephant"},{"Category":"ANIMALS","Question":"The gavial looks very much like a crocodile except for this bodily feature","Answer":"the nose or snout"},{"Category":"ANIMALS","Question":"Weighing around a ton, the eland is the largest species of this animal in Africa","Answer":"Antelope"},{"Category":"ANIMALS","Question":"Heaviest of all poisonous snakes is this North American rattlesnake","Answer":"the diamondback rattler"},{"Category":"SCIENCE","Question":"2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification","Answer":"species"},{"Category":"SCIENCE","Question":"A metal that is ductile can be pulled into this while cold & under pressure","Answer":"wire"},{"Category":"SCIENCE","Question":"In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance","Answer":"DNA"},{"Category":"SCIENCE","Question":"Changes in the tropospheric layer of this are what gives us weather","Answer":"the atmosphere"},{"Category":"SCIENCE","Question":"In 70-degree air, a plane traveling at about 1,130 feet per second breaks it","Answer":"Sound barrier"}]

3. Let’s create our weaviate container locally using the Dockerfile below.

Dockerfile:

FROM semitechnologies/weaviate:1.21.2-a843fb4

EXPOSE 8080

In the current directory, we run the 3 commands to build our image, tag and push it into our Snowflake image repo.

docker build — rm — platform linux/amd64 -t weaviate .
docker tag weaviate <YOUR_SNOWFLAKE_ACCT>/weaviate_db/public/weaviate_repo/weaviate
docker push <YOUR_SNOWFLAKE_ACCT>/weaviate_db/public/weaviate_repo/weaviate

4. Now, let’s create our jupyter container locally with the file below.

Dockerfile:

FROM jupyter/base-notebook:python-3.11

# Install the dependencies from the requirements.txt file
RUN pip install requests weaviate-client==3.21.0

# Set the working directory
WORKDIR /workspace/

# Expose Jupyter Notebook port
EXPOSE 8888

# Copy the notebooks directory to the container's /app directory
RUN mkdir /workspace/.local /workspace/.cache && chmod 777 -R /workspace
COPY notebooks /workspace/notebooks

# Run Jupyter Notebook on container startup
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

We are also creating a notebook to interact with Weaviate using the python client. (For the sample notebook I have used, please check out the quickstart here.) In our notebook, to connect to Weaviate running in Snowpark Container Services, we use the code below with API keys:

import weaviate
import json
import os
auth_config = weaviate.AuthApiKey(api_key="<YOUR WEAVIATE KEY>")
client = weaviate.Client(
url = "http://weaviate:8080",
auth_client_secret=auth_config,
additional_headers = {
"X-OpenAI-Api-Key": "<YOUR OPENAI KEY>" # Replace with your inference API key
}
)

To create a class in Weaviate, we use the code below:

if client.schema.exists("Question"):
client.schema.delete_class("Question")
class_obj = {
"class": "Question",
"vectorizer": "text2vec-openai", # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
"moduleConfig": {
"text2vec-openai": {},
"generative-openai": {} # Ensure the `generative-openai` module is used for generative queries
}
}

client.schema.create_class(class_obj)

To create vectors using the json file in a Snowflake stage and add objects, we have the following code:

# Load data from the stage
filename="../files/jeopardy.json"

with open(filename) as ifile:
system = "".join([x for x in ifile])

# Configure a batch process
with client.batch(
batch_size=100
) as batch:
# Batch import all Questions
for i, d in enumerate(json.loads(system)):
print(f"importing question: {i+1}")

properties = {
"answer": d["Answer"],
"question": d["Question"],
"category": d["Category"],
}

client.batch.add_data_object(
properties,
"Question",
)

Now we can try similarity search as below:

nearText = {"concepts": ["biology"]}

response = (
client.query
.get("Question", ["question", "answer", "category"])
.with_near_text(nearText)
.with_limit(2)
.do()
)

print(json.dumps(response, indent=4))

In the current directory, we run the 3 commands to build our second image, tag and push it into our Snowflake image repo.

docker build — rm — platform linux/amd64 -t jupyter .
docker tag jupyter <YOUR_SNOWFLAKE_ACCT>/weaviate_db/public/weaviate_repo/jupyter
docker push <YOUR_SNOWFLAKE_ACCT>/weaviate_db/public/weaviate_repo/jupyter

5. Now we can create our 2 services based on the 2 containers we built and pushed into our Snowflake image registry in the previous step:

CREATE SERVICE IF NOT EXISTS WEAVIATE
MIN_INSTANCES = 1
MAX_INSTANCES = 1
COMPUTE_POOL = WEAVIATE_CP
SPEC = @yaml_stage/spec.yaml;

CREATE SERVICE IF NOT EXISTS jupyter
MIN_INSTANCES = 1
MAX_INSTANCES = 1
COMPUTE_POOL = JUPYTER_CP
SPEC = @yaml_stage/spec-j.yaml;

That is it! Now, our services are up and running in Snowpark Container Services:

Services running in the database

We can display the service logs for both services:

Weaviate service logs
Jupyter service logs

We can get the url for the Jupyter by executing

DESCRIBE SERVICE jupyter;

and the token by executing

CALL SYSTEM$GET_SERVICE_LOGS(‘WEAVIATE_DB.PUBLIC.jupyter’, ‘0’, ‘jupyter’);

Here is a quick video for running the entire notebook after we initialized the client (not shown):

In this post, we created a very simple demo to deploy Weaviate in Snowpark Container Services and to use Weaviate python client to create vectors and perform similarity search. You can take this as a starter to experiment with RAG-based applications in Snowflake and continue iterating for a more enterprise-ready experience.

References:

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect