Building Healthcare Applications in Snowflake Leveraging Hybrid Search with Weaviate

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

6 min readNov 16, 2023

Date: November 2023

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer. Special Thanks to Jonathan Tuite from Weaviate for his contributions.

Recently, semantic search has been a prominent topic in numerous discussions about Large Language Models (LLMs). However, there have been several instances where I’ve encountered customer use cases that could significantly benefit from the implementation of hybrid search strategies. Hybrid search combines semantic search (which is based on “meaning”) with traditional search (which is based on “keyword”).

When it comes to semantic search, there are couple of options you have with Snowflake. Based on the requirements of the use case, in the order of lowest administration effort to higher, the options are:

Using the brand new Snowflake’s native Cortex functions
Using a vector database, like Weaviate, running in Snowpark Container Services (in Private Preview as of Nov 2023)
Using vector search libraries (FAISS, Txtai, Annoy) using Snowpark and/or Snowpark Container Services

In this blog post, we will go over the steps for implementing hybrid search for enriching context with Provider data in Healthcare LLM applications. Going with the second option above, we will be using Weaviate (a leader in open-source vector databases) running in Snowpark Container Services as both keyword search and semantic search are necessary for building a better user experience for this use case. Weaviate seamlessly supports hybrid search out-of-the box without having to rely on additional services. Weaviate also supports tunable hybrid search so you can have it lean more heavily towards the semantic or keyword side. Here is a related blog post:
https://weaviate.io/blog/hybrid-search-explained

For the Healthcare Provider data, we are using Providers dataset from Tuva Health in the Snowflake Marketplace, which you can get with one click in your Snowflake account: https://app.snowflake.com/marketplace/listing/GZT0ZS2I9BU/tuva-health-providers?search=provider

Here is the simple query to transform the tabular data into text format to be loaded into Weaviate.

CREATE OR REPLACE TABLE providers AS
WITH prov AS
(SELECT OBJECT_CONSTRUCT(*) AS oc
    FROM PROVIDERS.CLAIMS_DATA_MODEL.PROVIDER)
SELECT 'The ' || OC:ENTITY_TYPE_DESCRIPTION::varchar || ' with the NPI of '  || OC:NPI::varchar || ' is part of the location ' || OC:PRACTICE_CITY::varchar || ' with a specialty of ' || OC:PRIMARY_SPECIALTY_DESCRIPTION::varchar || ' is ' || OC:PROVIDER_NAME::varchar    as text,
OC:PRIMARY_SPECIALTY_DESCRIPTION as specialty, 
OC:NPI as npi   
FROM prov;

As a next step, we will be inserting this data into Weaviate vector database, running within the Snowflake Data Cloud.

Weaviate can be used as an embedded vector database in a self-hosted way as well as within a container, so you have some choices to make here again. For this demo, we will run Weaviate as a long-running service in a self-hosted fashion using Snowpark Container Services. Also, please see my previous blog post here for detailed instructions on how to deploy weaviate in Snowpark Container Services: https://medium.com/snowflake/running-weaviate-vector-db-in-snowflake-using-snowpark-container-services-490b1c391795

We are creating 3 services in Snowpark Container Services in the same database where we are leveraging service-to-service communication:

1.Weaviate service

Dockerfile (to be pushed to an internal image registry):

FROM semitechnologies/weaviate:1.22.4
 
EXPOSE 8080

spec.yaml (to be pushed to an internal stage called @yaml_stage):

spec:
    containers:
    - name: "weaviate"
      image: "<acct>.registry.snowflakecomputing.com/weaviate_db/public/weaviate_repo/weaviate"
      env:
        SNOWFLAKE_MOUNTED_STAGE_PATH: "stage"
        QUERY_DEFAULTS_LIMIT: 25
        AUTHENTICATION_APIKEY_ENABLED: 'false'
        AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
        PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
        DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
        CLUSTER_HOSTNAME: 'node1'
        ENABLE_MODULES: 'text2vec-transformers'
        TRANSFORMERS_INFERENCE_API: 'http://text2vec:8080'
      volumeMounts:
        - name: stage
          mountPath: /workspace/stage
        - name: data
          mountPath: /var/lib/weaviate
         resources:
    endpoints:
    - name: "weaviate"
      port: 8080
      public: true
    volumes:
    - name: data
      source: "@data"
    - name: stage
      source: "@files"
      uid: 1000  #write permissions
      gid: 1000

CREATE COMPUTE POOL IF NOT EXISTS WEAVIATE_CP
  MIN_NODES = 1
  MAX_NODES = 1
  INSTANCE_FAMILY = STANDARD_2
  AUTO_RESUME = true;

CREATE SERVICE IF NOT EXISTS WEAVIATE
  MIN_INSTANCES = 1
  MAX_INSTANCES = 1
  COMPUTE_POOL = WEAVIATE_CP
  SPEC = @yaml_stage/spec.yaml;

2. Embedding model service

Dockerfile (to be pushed to an internal image registry):

FROM semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
 
EXPOSE 8080

spec-c.yaml (to be pushed to an internal stage called @yaml_stage):

spec:
    containers:
    - name: "text2vec"
      image: "<acct>.registry.snowflakecomputing.com/weaviate_db/public/weaviate_repo/text2vec"
      env:
        SNOWFLAKE_MOUNTED_STAGE_PATH: "stage"
        ENABLE_CUDA: 1
        NVIDIA_VISIBLE_DEVICES: all
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
      volumeMounts:
        - name: stage
          mountPath: /workspace/stage
    endpoints:
    - name: "text2vec"
      port: 8080
      public: true
    volumes:
    - name: data
      source: "@data"
    - name: stage
      source: "@files"
      uid: 1000  #write permissions
      gid: 1000

CREATE COMPUTE POOL IF NOT EXISTS VEC_CP
  MIN_NODES = 1
  MAX_NODES = 1
  INSTANCE_FAMILY = GPU_3
  AUTO_RESUME = true;
  
CREATE SERVICE IF NOT EXISTS text2vec
  MIN_INSTANCES = 1
  MAX_INSTANCES = 1
  COMPUTE_POOL = VEC_CP
  SPEC = @yaml_stage/spec-c.yaml;

We are using the text2vec model here which requires GPU-based compute. Please note that it is always a best practice to define gpu resources in the service specification yaml file when you are using GPU-based compute pools in Snowpark Container Services.

3. Jupyter service

Dockerfile (to be pushed to an internal image registry):

FROM jupyter/base-notebook:python-3.11

# Install the dependencies from the requirements.txt file
RUN pip install requests weaviate-client==3.21.0 snowflake-snowpark-python[pandas]

# Set the working directory
WORKDIR /workspace/

# Expose Jupyter Notebook port
EXPOSE 8888

# Copy the notebooks directory to the container's /app directory
RUN mkdir /workspace/.local /workspace/.cache && chmod 777 -R /workspace
COPY notebooks /workspace/notebooks

# Run Jupyter Notebook on container startup
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]

spec-j.yaml (to be pushed to an internal stage called @yaml_stage):

spec:
    containers:
    - name: "jupyter"
      image: "<acct>.registry.snowflakecomputing.com/weaviate_db/public/weaviate_repo/jupyter"
      env:
        SNOWFLAKE_MOUNTED_STAGE_PATH: "stage"
      volumeMounts:
        - name: stage
          mountPath: /workspace/files
    endpoints:
    - name: "jupyter"
      port: 8888
      public: true
    volumes:
    - name: stage
      source: "@files"
      uid: 1000  #write permissions
      gid: 1000

CREATE COMPUTE POOL IF NOT EXISTS jupyter_cp
  MIN_NODES = 1
  MAX_NODES = 1
  INSTANCE_FAMILY = STANDARD_2
  AUTO_RESUME = true;
  
CREATE SERVICE IF NOT EXISTS jupyter
  MIN_INSTANCES = 1
  MAX_INSTANCES = 1
  COMPUTE_POOL = jupyter_cp
  SPEC = @yaml_stage/spec-j.yaml;

We can all grant USAGE on the appropriate roles to these individual services using Snowflake’s native RBAC capabilities. With this implementation, none of the enterpise data is leaving Snowflake’s security boundaries. Hence, many healthcare organizations are now in a position to explore use cases for Large Language Models (LLMs) within a framework that upholds enterprise-level security and compliance standards.

After confirming all 3 services are up, we can demonstrate both semantic and hybrid search functionality on the provider data using weaviate python client.

Here is the sample notebook: https://github.com/edemiraydin/snowflake_provider_weaviate/blob/main/Provider.ipynb

Our source table is as follows:

We import the data as objects into a Weaviate PROVIDER class:

Here is an example of semantic search:

Here is an example of hybrid search:

In this simple blog post, we demonstrated how to leverage Healthcare Provider data based on a Snowflake Marketplace dataset and use Weaviate vector database for semantic search and hybrid search all within your Snowflake account. Snowpark Container Services is the key feature that enables all this.

Building Healthcare Applications in Snowflake Leveraging Hybrid Search with Weaviate

Written by Eda Johnson