Is the Langchain Multi-Vector Retriever Worth It?

8 min readNov 6, 2023

TLDR;

The Multi-Vector Retriever, which employs summaries of document sections or pages to retrieve original content for final answer generation, enhances the quality of RAG, particularly for table-intensive documents such as 10-K reports.

Intro

Tables are very data-dense sources of information and lot of financial docs like 10-k fillings has a lot of them.

Let’s take as an example Tesla’s 2022 10-k report

Tables can be in many different shapes and forms starting from pretty basic and small:

Summary of the status of production of each of our announced Tesla vehicle models in production and under development

And growing to very nuanced and data-rich, like:

And if you check the underlying HTML code things will get even more messier

There’s a compelling inclination to feed the entire content of 10-K reports into the RAG system, entrusting the Language Model to manage the complexity of extracting information from HTML tables and providing a useful answer.

However, the issue lies in the fact that basic RAG systems, with fixed-size or HTML elements chunking, tend to underperform when addressing questions that revolve around numerical data concealed within tables.

When employing fixed-size chunking, a large table may be divided into more than two chunks, resulting in the loss of table and column name information for all but the first chunk.

**Consolidated Balance Sheets splited into 2 chunks**

Conversely, dealing with large chunks can also lead to retrieval issues, as the combination of text paragraphs and numerical table data significantly hampers the performance of semantic search.

Splitting HTML by tags can disassemble a table down to individual rows, making correct retrieval nearly impossible.

While it’s feasible to create a custom Beautiful Soup parser, it would need to be tailored for each type of financial document. Moreover, such solutions are susceptible to changes in formatting in future document versions.

Therefore, indexing a document with tables is a complex task.

Multi-Vector Retriever

One way to improve RAG performance for table-heavy docs is Multi-Vector Retriever

It’s implementation is described in Langchain Notebook in details.

In short:

It uses Unstructured to parse both text and tables from HTML page
We will use the multi-vector retriever to store raw HTML pages along with it’s summaries better suited for retrieval.

But it will require:

additional summarization of content
meaning more LLM calls
moaning more expensive RAG pipeline
adoption of Langchain lib
or implementation from scratch of Multi-vector retriever

So no free lunch here. Still considerable chunk of work has to be done and.

Both Indexing and Retrieval parts of RAG system has to be seriously modified to make this aproach work.

Evaluation Dataset

To get an idea abot RAG performance on Tesla’s 2022 10-k filling let’s use this small question-answer dataset.

q_a_10_k = {
        "What is the value of cash and cash equivalents in 2022?": "16,253 $ millions",
        "What is the value of cash and cash equivalents in 2021?": "17,576 $ millions",
        "What is the net value of accounts receivable in 2022?": "2,952 $ millions",
        "What is the net value of accounts receivable in 2021?": "1,913 $ millions",
        "What is the total stockholders' equity? in 2022?": "44,704 $ millions",
        "What is the total stockholders' equity? in 2021?": "30,189 $ millions",
        "What are total operational expenses for research and development in 2022?": "3,075 $ millions",
        "What are total operational expenses for research and development in 2021?": "2,593 $ millions",
    }

It’s nowhere near enough for production-grade system evaluation.

But good enough to get genera idea is things are better with Multi Vector approach.

Basic Retriever

Let’s use as baseline

Unstructured based HTML chunking in “paged” mode
Chroma DB vector store with all-MiniLM-L6-v2 embeddings

Setup Chroma DB:

# load data
doc_path = "/content/Articles/10-K.html"
loader = UnstructuredHTMLLoader(doc_path, mode="paged")
data = loader.load()

# load it into Chroma
data_texts = [element.page_content for element in data]
db = Chroma.from_texts(data_texts, collection_name="inacc-table",
                        embedding=SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2"))

And run evaluation with our sample question-answer pairs

basic_answers = []
for question in q_a_10_k.keys():
    query = question
    expected_answer = q_a_10_k[query]

    docs = db.similarity_search(query)

    template = """Answer the question based only on the following context, which can include text and tables:
        {context}
        Question: {question}
        """
    prompt = ChatPromptTemplate.from_template(template)

    # LLM
    model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k", openai_api_key=open_ai_key)
    chain = LLMChain(llm=model, prompt=prompt)

    full_answer = chain.run({"question": query, "context": docs})

    

    template = """Please validate the output of the other LLM chain and compare it to the expected answer:
            {context}
            Expected answer: {expected_answer}
            
            Return YES if the answer is correct, otherwise return NO.
            """
    prompt = ChatPromptTemplate.from_template(template)

    chain = LLMChain(llm=model, prompt=prompt)

    actual_answer = chain.run({"expected_answer": expected_answer, "context": full_answer})

    res = {
        "doc_id": "10-K.html",
        "enrichment_type": "None",
        "question": query,
        "question_type": "Specific",
        "actual_answer": full_answer,
        "expected_answer": expected_answer,
        "is_correct": actual_answer
    }
    basic_answers.append(res)

df = pd.DataFrame(basic_answers)
df

Note that evaluation is also done with LLMChain. Reason is simple — LLM output is non-deterministic. It can be

The value of cash and cash equivalents in 2022 is $16,253 million.

The cash and cash equivalents value in 2022 was $16,253 million.

Both answers semantically the same. But strict string comparison will show that they are not.

This is why i use simple verification chain:

template = """Please validate the output of the other LLM chain and compare it to the expected answer:
            {context}
            Expected answer: {expected_answer}
            
            Return YES if the answer is correct, otherwise return NO.
            """
prompt = ChatPromptTemplate.from_template(template)

chain = LLMChain(llm=model, prompt=prompt)

actual_answer = chain.run({"expected_answer": expected_answer, "context": full_answer})

Let’s check the results:

Not good.

Only for one pair of question about RnD expenses RAG pipeline produced correct results.

And on top of that we have hallucinations for question 2

Multi-Vector Retriever

Building Multi Vector Retriever requires a bit of preparations

Helper functions to

summarize pages of 10-k report
build MultiVectorRetriever with both summarize and raw pages
RAG retrieving pipeline

# helper functions
#to save time the summaries were pre-calculated
TABLE_SUMMARIES_CSV = "/content/Articles/table_summaries_0.csv"

def summarize(texts):
    """
    This function summarizes the given texts using a GPT-3.5 model. It also checks if a CSV file with previous summaries exists,
    if it does, it loads the summaries from there instead of generating new ones.

    Args:
        texts (list): A list of texts to be summarized.

    Returns:
        list: A list of summarized texts.

    """
    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text. \
    Give a concise summary of the table or text. Table or text chunk: {element} """
    prompt = ChatPromptTemplate.from_template(prompt_text)

    # Summary chain
    model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k-0613", openai_api_key=open_ai_key)
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    tables = [i for i in texts]
    table_summaries = []

    # open csv file if it exists
    if os.path.exists(TABLE_SUMMARIES_CSV):
        t_frame = pd.read_csv(TABLE_SUMMARIES_CSV)
        table_summaries = [elem[1] for elem in t_frame.values.tolist()]
    else:
        for i in range(0, len(tables)):
            res = summarize_chain.invoke(tables[i])
            table_summaries.append(res)

        t_frame = pd.DataFrame(table_summaries)
        t_frame.to_csv(TABLE_SUMMARIES_CSV)

    return table_summaries

def setup_retriever(sections):
    """
    This function sets up a retriever for the given sections of text. It first summarizes the sections, then creates a 
    Chroma vectorstore to index the summaries. It also sets up an InMemoryStore for the parent documents and a 
    MultiVectorRetriever to retrieve the documents. Finally, it adds the summarized texts to the vectorstore and the 
    original sections to the docstore.

    Args:
        sections (list): A list of sections of text to be indexed and retrieved.

    Returns:
        MultiVectorRetriever: A retriever set up with the given sections of text.
    """
    text_summaries = summarize(sections)
    # The vectorstore to use to index the child chunks
    vectorstore = Chroma(
        collection_name="summaries",
        
        embedding_function=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"),
    )

    # The storage layer for the parent documents
    store = InMemoryStore()
    id_key = "doc_id"

    # The retriever (empty to start)
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Add texts
    doc_ids = [str(uuid.uuid4()) for _ in sections]
    summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summaries)]
    retriever.vectorstore.add_documents(summary_texts)
    retriever.docstore.mset(list(zip(doc_ids, sections)))

    return retriever


def rag(retriever):
    """
    This function sets up a RAG (Retrieval-Augmented Generation) pipeline. It first sets up a prompt template, then 
    initializes a GPT-3.5 model. It then creates a chain that takes a context from the retriever and a question, 
    passes them through the prompt and the model, and parses the output into a string.

    Args:
        retriever (MultiVectorRetriever): A retriever set up with the sections of text to be used as context.

    Returns:
        Chain: A chain that can be used to answer questions based on the context provided by the retriever.
    """
    # Prompt template
    template = """Answer the question based only on the following context, which can include text and tables:
    {context}
    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)

    # LLM
    model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k", openai_api_key=open_ai_key)

    # RAG pipeline
    chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser()
    return chain

RAG Chain:

The chain is similar to Basic Retrieval case. Including evaluation part

texts_in_data = [element.page_content for element in data]
# sum_texts = summarize(texts_in_data)
retriever = setup_retriever(texts_in_data)
chain = rag(retriever)
answers = []
for question in q_a_10_k.keys():
    res = chain.invoke(question)
    # res = {question["question"]: res}
    expected_answer = q_a_10_k[question]

    template = """Please validate the output of the other LLM chain and compare it to the expected answer:
                    {context}
                    Expected answer: {expected_answer}

                    Return YES if the answer is correct, otherwise return NO.
                    """
    prompt = ChatPromptTemplate.from_template(template)

    # LLM
    model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo", openai_api_key=open_ai_key)
    verify_chain = LLMChain(llm=model, prompt=prompt)

    actual_answer = verify_chain.run({"expected_answer": expected_answer, "context": res})

    res = {
        "gem_id": "gem_tesla_10k_2022.json",
        "enrichment_type": "None",
        "question": question,
        "question_type": "Specific",
        "actual_answer": res,
        "expected_answer": expected_answer,
        "is_correct": actual_answer
    }

    answers.append(res)

df_multi = pd.DataFrame(answers)
df_multi

Results:

Much better.

In case of incorrect answer RAG at least haven’t provided hallucinated answer.

Conclusion

In conclusion, the practice of summarizing document pages and utilizing their embeddings for retrieval significantly outperforms the results of a more “naive” retrieval method.

However, a notable drawback of this strategy is the necessity to summarize all document pages in advance, which incurs additional costs.

On a positive note, this method is compatible with GPT-3.5. Furthermore, the cost of GPT-4 is becoming increasingly affordable.

Nevertheless, multi-vector retrieval is not a panacea. Answering more complex, high-level questions necessitates improved data extraction from tables.