Leveraging Llama2 for Querying /Summarizing Clinical Trial Protocols on Snowflake

with Jenllieu

The Industry Trend: What do we hear from Life Sciences organizations with respect to Gen AI & LLM

In our interaction with customers, we have typically heard the following set of use cases, described in Figure 1, with respect to the applicability of Generative AI in Life Sciences. Most of them are in exploratory stage with companies looking to extract maximum value from existing foundational models.

Figure 1: A snapshot of the life sciences trends

As can be seen they are either in the realm of summarization or tagging while R&D have some additional differentiated use cases in terms of Drug discovery models like Structure prediction (an example of this pattern, is the BioNeMo blog mentioned earlier). From these we picked Clinical trial protocol summarization as an example to demonstrate the art of possible for two reasons :

  1. there is a need for a model to understand biomedical verbiage which meant fine tuning was necessary
  2. there was public data with which we could try to fine tune thus making it more effective

Both of these factors helped set up a good target use case for us to see the effectiveness of fine tuning LLaMA 2.

Note: Fine tuning is an compute intensive process and requires sophisticated data scientists. It is not necessary to fine tune a model for all use cases and you may be able to take advantage of existing foundational models and recent Snowflake Product announcements to help with needs like Summarization and sentiment analysis. However, fine tuning was necessary in this scenario as we needed to create a domain specific model that understood clinical terms (in this example). The other option would be to private LLM’s that have already created these domain specific models from a foundational model.

Our Objective in fine tuning LLaMA 2 for Life sciences

Recently at the Summit, Snowflake announced support for Snowpark Container Services allowing you to run your containerized apps on Snowflake, now in Private Preview, at the time of writing this blog. This opened the possibility for Snowflake to support entirely a new set of business use cases including those that require hosting open source LLMs, to more traditional R&D ones like being able to run R or to perform compute intensive training & inference that need a GPU like image processing and classification. While there are many solutions now possible in Snowflake, in this blog, we will focus on a Generative AI /open source LLM example in the Lifescience R&D domain for querying/ summarizing Clinical protocols.

Figure 2: Gen AI in life sciences- finetuning LLaMA 2 within Snowpark container services

We leveraged the seven billion parameterized LLaMA 2 model from Hugging face (Llama-2–7b-chat-hf ) which is a non domain specific model that does not understand the context of specific biomedical terms like “eligibility criteria”, etc. Our aim here was to provide an example of how a customer could fine tune their own model on their corpus of information (in this case clinical protocols) to help with content authoring and gathering insights in a specific domain that needs these understanding of these specialized context. The conceptual diagram of this is represented in Figure 2

We also compared LLaMA 2 with BioGPT and found that LLaMA 2 with some fine tuning provided better responses for our purpose compared to BioGPT even though BioGPT was pre-trained on biomedical abstracts from Pubmed.

The blog is a continuation of how Snowflake’s container services can be leveraged in a scientific domain using the example of Clinical trial protocol summarization, that is hosted in Snowflake’s own environment for fine tuning and inference.The previous one on leveraging external LLM for the protein folding problem is here

The business problem: Clinical trials and the relevance of LLM

Trial protocols are documents that describe the objectives, design, methodology, statistical considerations and aspects related to the organization of clinical trials. Trial protocols provide the background and rationale for conducting a study, highlighting specific research questions that are addressed, and taking into consideration ethical issues. Trial protocols must meet a standard that adheres to the principles of Good Clinical Practice, and are used to obtain ethics approval by local Ethics Committees or Institutional Review Boards

Cipriani A, Barbui C. What is a clinical trial protocol? Epidemiol Psichiatr Soc. 2010 Apr-Jun;19(2):116–7. PMID: 20815294

Clinical trials are expensive and are time consuming. In most cases trial protocol authoring and study design involve preparing a submission document that delineates the study objectives including the eligibility criteria for patients and the outcomes that will be measured. Large pharma have performed several trials over the past years and have amassed several dossiers with institutional knowledge and hence have a corpus of knowledge base with these documents.

Helping query the protocol and answering questions typically end up being a good start to productivity gains in the long process of authoring and submission even if the documents themselves are not ready for regulatory submission. For example , a typical KPI that a clinical operations business sponsor may look to achieve with leveraging LLM for protocol authoring might be

  • % reduction in time spent in creation of first submission ready document (or)
  • % reduction in time spent in finding relevant past documents for submission

Do note however, our aim in this case was not to aim for 100% precision to create a regulatory ready document but was only to demonstrate the art of being able to achieve this with Snowflake.

The Snowflake context: Fine Tuning in Snowflake with Container services

With the above constructs in mind we can now delve to seeing how it comes together in Snowflake.

Figure 3: Fine Tuning in Snowpark container services

The architecture in Figure 3 delineates the steps involved in the end to end model fine tuning and inference and the subsequent steps describe each step in brief.

Step 1 and 2: Data loading and preprocessing

To begin with, we downloaded the data from clinical trials.gov and loaded them into a variant column as a raw format as highlighted in Step 1 above. The final data looks like the snapshot in Figure 4 below. There were a total of 438,125 unique protocols that were available for the final fine tuning.

Figure 4: Snapshot of clinical trials data

From this pattern we wanted to answer four unique questions, two in an enquiry mode and one in as a summarization pattern as encapsulated below.

Table 1: Prompt questions used for Model fine tuning

As mentioned in Step 2 of the architecture, we fed the values associated with the 4 prompts into a snowflake table so that it can be leveraged in the model for fine tuning. The total data points that were considered for fine tuning was 1.5 million records and they included the following fields:

Step 3 & 4 : The model fine tuning and inferencing

We used the LLaMA 2 7b parameter model for this effort and fine tuned it on a subset of the data in order to see the accuracy. As called out earlier, we did not fine tune the entire data set as it was more to understand the art of possible. Training on more data points would have definitely yielded more results. Snowflake’s Snowpark container services hosted the model and fine tuning was performed on the A10 G Nvidia chip. The total time for fine tuning took roughly about “50 hours”.

Step 5: Deploying a dashboard for model interaction

The final step was to deploy a Streamlit app to interact with the model.The Streamlit app is a simple Python based way to create apps and to interact with machine learning outcomes and it is also hosted in container services within Snowflake. This allows us to take advantage of all the security and governance offered by Snowflake since the data does not leave the ecosystem while providing the end users with an experience to interact with the model.

The Outcome : what was our experience and feedback?

  1. In general the model performed well in enquiry questions with respect to eligibility and outcome measures. Summarization did reasonably well also, however the site specific questions had an issue with repeatability
  2. Initially we had fine tuned the model with 50K data sets and had gradually increased it to 200K datasets. Using additional training data definitely improved model output. Outputs for summarization, especially, were often more detailed than the prior iteration in this case.
  3. Our assumption is that the outputs for summarization in particular were more reasonable than other questions, likely due to the base model already being competent at summarization. Querying for outcome measures also produced reasonable results in most cases.
  4. One remaining issue was the repetition of information. Technically this seems to be due to sensitivity to max_new_tokens which is a parameter that could be tweaked in fine tuning LLaMA 2. This was an expected behavior of LLM as too high of a value can result in repetition of information, while too low of a value truncated output.
Table 2 below provides a detailed summarization of the model performance for different questions.

Figure 5 provides the snapshots of the output with our different enquiry. Additional training (more training examples and/or more training epochs) will likely continue to improve the reasonableness of response as the additional training has generally improved responses to prompts in our case as well.

Figure 5a: Snapshot of model responses to questions
Figure 5b: Snapshot of model responses to questions
Figure 5c: Snapshot of model responses to questions

Comparison to BioGPT

When we tried this with BioGPT , the model parameter was lesser but it performed well with respect to summarization given that the base model has been trained on Medical abstracts.

Next steps: Why Snowflake and what next?

Support for Container services is the next step in Snowflake’s evolution as we now allow you to leverage the data analytics and security to ensure you can build your data apps and LLM interaction without having to leave the system.

All of the data, analysis results reside within Snowflake and the container service interaction with Snowflake data is single tenant. This ensures your downstream analysis is guarded by the same security and governance as any other workload within Snowflake including building simple Streamlit apps and sharing them as can be demonstrated above.

The next version of the model would be to reduce fine tuning and leverage a RAG approach by creating embeddings of the prompts used in training and storing them as a vector database. Hosting the embeddings as vector databases that can be served to the model for context should serve as a more reasonable approach compared to fine tuning for most cases given the complexity involved in finetuning.

For more details on Snowflake and LLM, do watch the announcements including Snowday events.

Additional Reference

For more details about fine tuning a model and hosting a open source vector database , please refer to Eda Johnson’s articles here and here

--

--