Advanced retrieval for AI with Chroma

5 min readJan 18, 2024

Advanced RAG (Retrieval Augmented Generation) techniques are discussed to unlock full potential of LLMs with enterprise knowledge base.

Introduction

Retrieval Augmented Generation has become a very popular way to provide context to a LLM from your own documents. Retrieval of the context is based on the user query. As a result, parts of the documents containing answers are not retrieved and response generated by LLM will not have that context. Advanced Retrieval for AI with Chroma introduces query expansion and fine tuning of embeddings to make retrieval much better. A query can be expanded by appending a hypothetical answer or by adding similar questions to the query. An embedding model can be finetuned by providing user preferences to retrieved results.

Overview of embedding based retrieval

Retrieval augmented generation is used for building a chat application by adding context from organization specific knowledge base to any LLM. This needs the transformation of knowledge artifacts such as pdfs, docs, or other files into smaller chunks, embed these chunks and store them in a vector database for later retrieval. Once a user sends query to the chat application, this query is sent to vector database for retrieval of relevant chunks. User query and retrieved chunks are then sent to the LLM for generating a response that is then sent to user.

Challenge/pitfall in the retrieval process

In the above process, retrieved context is based on user query solely. As a result, it fetches information that is not very relevant and misses on the information that contains answers.

Query expansion by adding a hypothetical answer

One of the solutions of above problem is to ask LLM to generate a hypothetical answer before sending query for retrieval. A hypothetical answer generated by LLM is appended with the user query and sent to vector database for retrieval relevant of document chunks. This short course usage an example of Microsoft 2022 annual report and starts with asking simple questions such as “What is the total revenue?” or “What is the strategy around artificial intelligence (AI)?". All retrieved documents have some information related to questions asked but some of them are not very relevant for an expected answer. Let us generate a hypothetical answer be using the prompt below. Once a hypothetical answer is generated, it is appended into the query and sent for retrieval for better context. Retrieved documents are then sent to LLM for processing the answer.

# Query sent to LLM for generating a hypothetical answer. 
messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant.
            Provide an example answer to the given question, 
            that might be found in a document like an annual report. "
        },
        {"role": "user", "content": query}
    ]

Another approach is to generate additional queries by sending original query to the LLM and ask it to suggest any number of additional related queries. Here is the prompt for the same.

messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about an annual report. "
            "Suggest up to five additional related questions to help them find the information they need, for the provided question. "
            "Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic."
            "Make sure they are complete questions, and that they are related to the original question."
            "Output one question per line. Do not number the questions."
        },
        {"role": "user", "content": query}
    ]

Both techniques will expand the original query and this expanded query will enable better retrieval of documents and provide better context to LLM for generating a far better response.

Cross-encoder reranking

Increase the number of retrieved documents say from five documents to ten documents and re-rank them using a BERT cross-encoder. Here the cross-encoder takes a pair of the query and a retrieved document and scores them using a classifier. As we are retrieving now ten documents, there will be ten such pairs. We send these ten pairs for the scoring and pick up the top 5 from the list and pass them to LLM for generating the response.

Embedding adapters

Embedding adapters are trained by providing feedback from the users on retrieved results. This is an intermediate step between query embedding and retrieval store. To begin with, start with developing a dataset and here we ask LLM to generate multiple questions. Following prompt generates multiple questions that a user may ask for analyzing an annual report.

# Prompt for generating multiple questions
messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. You help users analyze financial statements to better understand companies. "
            "Suggest 10 to 15 short questions that are important to ask when analyzing an annual report. "
            "Do not output any compound questions (questions with multiple sentences or conjunctions)."
            "Output each question on a separate line divided by a newline."
        },
    ]

Now, we use these questions to retrieve documents and pass each document with query to a LLM and ask to evaluate the document and provide the answer in simple “yes” or “no”.

# Prompt for evaluating retrived document for the given query
messages = [
    {
        "role": "system",
        "content": "You are a helpful expert financial research assistant. You help users analyze financial statements to better understand companies. "
        "For the given query, evaluate whether the following satement is relevant."
        "Output only 'yes' or 'no'."
    },
    {
        "role": "user",
        "content": f"Query: {query}, Statement: {statement}"
    }
    ]

Once the dataset is ready, it can be used to train a simple model and this model will evaluate retrieved documents for each query.

Other techniques

There are many new techniques that are being researched and introduced, some of them are —

Build a better embedding adapter using a full-scale neural network or transformer layer
A relevancy model able to handle much more complexity than the cross-encoder introduced here
Evaluate different chunking strategies to improve retrieved results

In summary, RAG is becoming an important component that enables effective use of LLMs. Query expansion and Embedding adapters promise to generate response that are much more relevant and useful.