Clinical trial Assistant — a RAG based approach on Snowflake leveraging Cortex capabilities -Part 2 of 2"

Zohar Nissare-Houssen

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readFeb 13, 2024

with Harini Gopalakrishnan

Link to part 1 of 2

A Llama scientist, generated with Google gemini

This article is a sequel to the lifescience blog, in which we used a general purpose Llama2 7b for fine tuning and responding to questions on clinical trial protocols leveraging SPCS. We have now tried to implement the same using a RAG pattern leveraging new capabilities in Snowflake called Cortex AI, in private preview at the time of writing, which offers a llama2–70b-chat LLM model as a service for completion, among other more specialized tasks (also includes extraction, translation, sentiment analysis). Snowflake Cortex also offers a native vector data type, as well as functions to generate vector embeddings and perform similarity search for constructing the RAG, all of which are in private preview. More details about Cortex here.

This is a two part series with part 1 linked above, which provides the high level overview and this being part 2, which covers the specifics of implementation

A deeper dive into the solution architecture

As described in the earlier blog our aim of this application was to leverage RAG to answer questions from clinical trial protocols and created a chaining of LLM capabilities to respond to both SQL and semantic queries. The overall architecture for this is recapped in the image below.

Lets now take a look into each step described above.

Step 0: Pre-processing and data preparation

Before implementing any RAG based or fine tuning based LLM solution it is vital and imperative that we have good quality data and therefore data engineering and/or data preparation becomes the most important step. In this example, we downloaded clinical trials protocol from Clinical trials website https://clinicaltrials.gov/

The raw dataset contained around 430,000 clinical trial records. Each clinical trial record is a complex nested json format with arrays of varying length up to around 5.5MB. The following picture in Figure 1, shows the high level json structure of a clinical trial record (without the nested levels).

Given the scope of the clinical trial assistant is around the clinical trial protocol with questions around conditions in scope of the study, eligibility criteria, and outcome measures, a lot of data was cleansed to key fields of interest. This helps in simplifying the overall RAG pattern by retrieving the whole record to construct the prompt, as well as improving the accuracy of the semantic search by scoping it to key fields of interest. If some other fields are to be in scope, those can certainly be added and additional prompt engineering would be required to construct the RAG pattern.

Figure 1: A snapshot of the json downloaded from clinical trial data

Step 1: Building the knowledge base

The Data Engineering required was fairly intricate as it required traversal and unnesting of json fields, as well as traversal and slicing of arrays of fields. Given Snowflake’s ability to query and perform transformations on json data natively with an intuitive SQL syntax, it could be achieved with a couple of SQL statements. After extracting the necessary records for the knowledge base above a table is built within Snowflake that leverages the newly introduced vector data type. This new vector data type represents an array of 768 float values and the embedding is generated using a Snowflake Cortex function leveraging the e5-base-v2 model. The embedding is generated based on the entire corpus of ~400,000 records to yield better results during similarity search and the overall flow looks like the one in Figure 2.

Figure 2: Overall flow of building a vectorized clinical protocol knowledge base

Step 2: Creating the RAG pattern

The premise for the assistant is to be able to process a few types of questions of the type :

General knowledge questions not requiring any Retrieval Augmented Generation.
Retrieval Augmented Generation which could be either top 1 or top n records, which could further be restricted based on some filtering predicates specified by the user.
Single record lookup based on the study id which is more of a SQL like query

Figure 3 provides a view of the overall flow of this diagram. As can be seen, it takes multiple steps to achieve this.

Figure 3: Creating the prompt based on user interaction with chatbot

In step 6 of the above diagram, the approach used to classify the user question is to use the Llama2–70b model for inference to determine all the above and generate a json record representing the problem classification as an output. This is done using a Snowflake Cortex AI function invoking the Llama2 model for inference. Few shots learning is used in the prompt to help the model come up with the right value for certain of these fields. Overall, the classification yields fairly good results. If the user question is classified as a non-rag pattern, we simply skip to step 11. We don’t need to perform RAG. With minimum prompting/directive, we pass as part of the prompt the user’s question and perform the inference. In step 7, if the pattern is found to need RAG, we use as an input the metadata generated previously to dynamically generate the SQL required using Snowpark for Python. The SQL generated could be either a simple SQL with a lookup predicate, or a semantic search which could be further scoped based on additional filtering predicates identified during problem classification. This approach helps with the consistency and accuracy of the responses. The SQL then runs on the table created previously and the contents are retrieved.

Step 3: Prompt Building and Inference

Figure 4: Building the prompt and adding context for Summarization

The above diagram explains the subsequent steps from obtaining the context from the vector search to rendering back to the user interaction or screen. In Step 10, we perform some post-retrieval compaction to make sure we do not exceed the context window size limitation for inference (4096 tokens limitation for the Lllama2–70b model). In Step 11, we build the prompt dynamically which contains the following themes:

Providing directives like

Setting the role of the LLM as a Clinical Trial Life Sciences Assistant.
Defining inference rules and set some domain knowledge (Example: Definition of a protocol design).
Response formatting

2. Adding the Contents: RAG contents which are results of the vector search and/or the SQL query are appended to this directive.

3. Appending the user Question:User Question is added to ensure that summarization responses are relevant to the way question is phrased (“what is” questions are different from “can you” kind of questions)

4. Providing additional Context: Optionally, some additional context can be added like the chat history.

Once the prompt is constructed, we perform the inference and return the results to the user to the front-end Streamlit application. In step 15, we could continue the dialog with the LLM to get further details.

The Outcome

In summary, the outcomes of our approach, as detailed in part 1, demonstrate significant enhancements over the original fine-tuned model. While the results for inquiry questions remained comparable, the added advantage of providing attribution and avoiding hallucination with consistency in response provides a solid foundation for our solution in reality. By addressing both SQL and Semantic question types, our solution streamlines the process, facilitating simpler queries for accessing study details by ID.

In our demonstration, we limited semantic results to 1 due to cost considerations, but in practical applications, providing a summarized view of the top 5 responses would be reasonable. It’s crucial to recognize that fine-tuning is particularly relevant for complex scenarios like those in Drug Discovery, where starting with a domain-specific model and fine-tuning it on internal datasets is essential. Combining fine-tuning with RAG further enhances model performance.

The differentiator to this solution is already well delineated in Part 1 of this blog.

Call to action

Snowflake offers a comprehensive suite of tools to rapidly construct Language Model (LLM) applications, leveraging native security and services. Whether integrating your own LLM models and refining them with Snowflake SPCS or utilizing Snowflake Cortex AI LLM models via SQL functions, the key is to choose the approach that best aligns with your use case.

Should you have any further inquiries about our methodology, do not hesitate to reach out to us for clarification.