From Prototype to Production: Advancing Research Collaborations with RAG

Published in

TatonettiLab

7 min readMar 28, 2024

We recently deployed a retrieval-augmented generation (RAG) based application called Collabot at Cedars Sinai, designed to assist researchers in finding collaborators through their grants and abstracts data. RAG is easy to prototype but getting it to production is a whole other story. You can bring 80% of your app to completion in a day, but it’s that final 20% that requires weekslong dedicated experimentation. This post outlines strategies we employed to enhance RAG’s capabilities, such as data cleaning, strategic chunking, query transformation, retrieval optimization, LLM observability, and deployment strategies were essential in refining the end product.

Overview of the Collabot Architecture. The RAG pipeline transforms user queries to identify and summarize potential research collaborators from a database of grants and abstracts.

Data Preparation

A key element to the success of RAG, or any machine learning model for that matter, is the quality of the data. We focused on data cleaning and preprocessing to ensure it performed well. Our methods included:

Entity Extraction: A key functionality of the app is to facilitate the search for potential collaborators. This requires accurately identifying and extracting the names of principal investigators from Cedars-Sinai Cancer Center within the grant and abstract data. We automatically extracted and indexed these names that are typically nestled within the citation column among other researchers and indexed them for efficient data retrieval.
Text Normalization: To further refine our data, we implemented a series of text normalization processes for the extracted names of the principal investigators. This included stripping any punctuation that may surround the names, standardizing the naming convention, rectifying inconsistencies in name representations. This ensured uniformity across the dataset for more accurate data retrieval.

Chunking

Chunking up the context data is a core part of a RAG system. It is a technique to generate logically coherent snippets of information, usually by breaking up long documents into smaller sections to fit the model’s context window. The size of the chunk is a parameter to consider — it depends on the embedding model you use and its capacity in tokens. For example, standard transformer Encoder models like BERT-based Sentence Transformers take 512 tokens at most. In contrast, OpenAI ada-002 can handle longer sequences of 8191 tokens. The key consideration here is having enough context for the LLM to reason upon vs specific enough text embedding to efficiently execute search upon. There’s no rule of thumb for this. We created our evals for our data (see point 4) and experimented with different chunk sizes. We found that 1000–2000 is the range with the best performance, before and after which the performance degrades. Less than 1000 tokens implies too little context and over 2000 tokens implies too much to efficiently search over. Hence, we picked 1500 as our chunk size.

Relationship between Chunk Size and F1 Score for Text Data Retrieval

Query Transformation

Since the query to retrieve relevant documents is also embedded into the vector space, it’s phrasing can significantly impact search results. Therefore, transforming the user query in a manner that aligns with our prompt was crucial for obtaining high quality responses. To achieve this, we implemented a two-step process for query transformation. First, we corrected typos using autocorrect to ensure accuracy in the query text. Then, we employed a LLM to rephrase the user query to enhance its clarity and effectiveness in retrieving the desired documents. For example, if a user typed in a query with typos like “who si workign on bladder cancer?”, the autocorrect function will first resolve these errors to produce ‘who is working on bladder cancer?’ prior to processing it through the embedding model. Likewise, if a user submits a query that is simply a topic name, such as ‘bladder cancer’, the query transformation step will refine it into a full and clear inquiry like ‘Who is working on bladder cancer?’, thereby clarifying the user’s intent for the language model.

Additionally, we also experimented with query expansion strategy, which involves adding related queries to the original query to broaden the search scope within the embedding space. But ultimately decided not to go with it since we found that query expansion leads to a lot of false positives.

Performance Comparison of Query Transformation Techniques in Document Retrieval

Evaluation

Evals are best done early and often. To ensure the accuracy of our system in identifying the right principal investigators for specific cancer research areas, we set our own evals in addition to the LLM-specific ones. In collaboration with the Director of the Cancer Center, we identified a list of recognized experts across various cancer subfields to serve as our benchmark for accuracy. These experts were the “ground truth” against which we measured the success of our RAG’s retrieval capability. We then applied conventional machine learning evaluation metrics such as F1 score, ROC AUC, PR AUC, precision and recall to assess how well our RAG identified these experts in relation to the topics at hand.

Retrieval

The app’s primary purpose is to facilitate the search for collaborators. This requires a RAG pipeline that is adept at handling specific user queries. For example, a user asking, “Who is working on bladder cancer?”. The system needs to retrieve all principal investigators focused on bladder cancer from both the grant and abstract data source and provide summaries of their work. For this task, we employed a self-query retriever, which is adept at filtering results based on metadata for Principal Investigators.

To select the top ‘K’ documents using this retriever, we relied on cosine similarity. In addition, a boost in our retrieval came with the integration of a re-ranker. This tool evaluates the relevance of text passages, retrieved through similarity searches, by scoring them against the user query and providing a relevance score. Though architecturally distinct and slower than similarity searches, re-rankers offer a higher degree of accuracy. We used cross encoding from sentence transformer for reranking the top ‘K’ documents.

Hence, the key components of our RAG system include:

Self-Query Retriever: Indispensable since it is the only method that allows filtering on metadata at this time
Re-ranker: This critical part assigns relevance scores to text passages in relation to the query, enhancing precision despite being slower than simple similarity searches. For re-ranking, we adopted cross-encoders from Sentence Transformers, which provided a significant boost in determining document relevance.

Comparison of Retrieval Performance Metrics

Precision-Recall AUC by Indication for Different Retrieval Methods

LLM Observability

Monitoring and analyzing both inputs and outputs is essential for the continuous improvement of LLMs. ML models and especially LLMs are full of silent and subtle failures, so it’s critical to maintain a rigorous logging system. Regular review of logged data is key to identifying and addressing these errors. In practice, this means keeping an eye on a blend of specialized LLM metrics and the traditional ones used in machine learning.

After some exploration, we settled on Arize AI for its expansive suite of observability tools. Their platform stood out with features that not only allowed us to track metrics but also provided insightful visualizations and anomaly detection which are instrumental in troubleshooting problems. There are plenty of other similar tools on the market, find one that works best for you.

Deployment

Deployment proved to be as complex as building the RAG component. Moving the model from localhost to a live environment was fraught with issues. And it seems like a common problem. Two of the primary issues we faced were:

The rapidly shifting pace of AI and the packages and libraries that underpin it are also developing rapidly that it causes many python dependency issues.
The ever-evolving packages also create another problem. The docker images for these apps are massive, spanning over 5GBs. This makes it hard to take advantage of easy cloud deployment tools like Streamlit Cloud or Heroku due to their size limitations which is usually around 500MB.

Our main lesson here is don’t delay deployment until the end. Implement continuous deployment every time you update your app to quickly troubleshoot and streamline the process.

Conclusion

In summary, bringing Collabot to life at Cedars Sinai was marked by a series of enhancements at each step of the RAG system. Starting with data cleaning, we ensured a foundation of quality for robust retrieval. By experimenting with chunking, we optimized the context for our LLM to reason effectively. With careful query reformulation, we ensured precision in search results. With thoughtful retrieval architecture, we ensured only the relevant information was being surfaced. And our evaluations, rooted in real-world expertise, affirmed the system’s capability to pinpoint specialists with high accuracy.

The RAG landscape is evolving rapidly with tools like ColBERT, DSPy, and Deep Memory, each with unique advantages. Deep Memory, for instance, is a deep learning model that purports to boost vector search accuracy by 22%. ColBERT is BERT based fast and accurate retrieval model that enables scalable search in milliseconds. And DSPy which stands for Declarative Self-Improving Language Programs (in Python) is a whole new framework redefining prompt engineering and RAG with its algorithmic approach yield more stable and predictable model behavior. These tools represent potential pathways for future enhancements, promising to further refine the precision and extend the capabilities of Collabot.

Code available here: https://github.com/tatonetti-lab/collabot