Is the Langchain Multi-Vector Retriever Worth It?

Andrew Batutin
8 min readNov 6, 2023

--

TLDR;

The Multi-Vector Retriever, which employs summaries of document sections or pages to retrieve original content for final answer generation, enhances the quality of RAG, particularly for table-intensive documents such as 10-K reports.

Intro

Tables are very data-dense sources of information and lot of financial docs like 10-k fillings has a lot of them.

Let’s take as an example Tesla’s 2022 10-k report

Tables can be in many different shapes and forms starting from pretty basic and small:

Summary of the status of production of each of our announced Tesla vehicle models in production and under development

And growing to very nuanced and data-rich, like:

Consolidated Balance Sheets

And if you check the underlying HTML code things will get even more messier

Inspecting underlying table HTML

There’s a compelling inclination to feed the entire content of 10-K reports into the RAG system, entrusting the Language Model to manage the complexity of extracting information from HTML tables and providing a useful answer.

However, the issue lies in the fact that basic RAG systems, with fixed-size or HTML elements chunking, tend to underperform when addressing questions that revolve around numerical data concealed within tables.

When employing fixed-size chunking, a large table may be divided into more than two chunks, resulting in the loss of table and column name information for all but the first chunk.

Consolidated Balance Sheets splited into 2 chunks

Conversely, dealing with large chunks can also lead to retrieval issues, as the combination of text paragraphs and numerical table data significantly hampers the performance of semantic search.

Splitting HTML by tags can disassemble a table down to individual rows, making correct retrieval nearly impossible.

While it’s feasible to create a custom Beautiful Soup parser, it would need to be tailored for each type of financial document. Moreover, such solutions are susceptible to changes in formatting in future document versions.

Therefore, indexing a document with tables is a complex task.

Multi-Vector Retriever

One way to improve RAG performance for table-heavy docs is Multi-Vector Retriever

It’s implementation is described in Langchain Notebook in details.

In short:

  • It uses Unstructured to parse both text and tables from HTML page
  • We will use the multi-vector retriever to store raw HTML pages along with it’s summaries better suited for retrieval.

But it will require:

  • additional summarization of content
  • meaning more LLM calls
  • moaning more expensive RAG pipeline
  • adoption of Langchain lib
  • or implementation from scratch of Multi-vector retriever

So no free lunch here. Still considerable chunk of work has to be done and.

Both Indexing and Retrieval parts of RAG system has to be seriously modified to make this aproach work.

Evaluation Dataset

To get an idea abot RAG performance on Tesla’s 2022 10-k filling let’s use this small question-answer dataset.

q_a_10_k = {
"What is the value of cash and cash equivalents in 2022?": "16,253 $ millions",
"What is the value of cash and cash equivalents in 2021?": "17,576 $ millions",
"What is the net value of accounts receivable in 2022?": "2,952 $ millions",
"What is the net value of accounts receivable in 2021?": "1,913 $ millions",
"What is the total stockholders' equity? in 2022?": "44,704 $ millions",
"What is the total stockholders' equity? in 2021?": "30,189 $ millions",
"What are total operational expenses for research and development in 2022?": "3,075 $ millions",
"What are total operational expenses for research and development in 2021?": "2,593 $ millions",
}

It’s nowhere near enough for production-grade system evaluation.

But good enough to get genera idea is things are better with Multi Vector approach.

Basic Retriever

Let’s use as baseline

  • Unstructured based HTML chunking in “paged” mode
  • Chroma DB vector store with all-MiniLM-L6-v2 embeddings

Setup Chroma DB:

# load data
doc_path = "/content/Articles/10-K.html"
loader = UnstructuredHTMLLoader(doc_path, mode="paged")
data = loader.load()

# load it into Chroma
data_texts = [element.page_content for element in data]
db = Chroma.from_texts(data_texts, collection_name="inacc-table",
embedding=SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2"))

And run evaluation with our sample question-answer pairs

basic_answers = []
for question in q_a_10_k.keys():
query = question
expected_answer = q_a_10_k[query]

docs = db.similarity_search(query)

template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k", openai_api_key=open_ai_key)
chain = LLMChain(llm=model, prompt=prompt)

full_answer = chain.run({"question": query, "context": docs})



template = """Please validate the output of the other LLM chain and compare it to the expected answer:
{context}
Expected answer: {expected_answer}

Return YES if the answer is correct, otherwise return NO.
"""
prompt = ChatPromptTemplate.from_template(template)

chain = LLMChain(llm=model, prompt=prompt)

actual_answer = chain.run({"expected_answer": expected_answer, "context": full_answer})

res = {
"doc_id": "10-K.html",
"enrichment_type": "None",
"question": query,
"question_type": "Specific",
"actual_answer": full_answer,
"expected_answer": expected_answer,
"is_correct": actual_answer
}
basic_answers.append(res)

df = pd.DataFrame(basic_answers)
df

Note that evaluation is also done with LLMChain. Reason is simple — LLM output is non-deterministic. It can be

The value of cash and cash equivalents in 2022 is $16,253 million.

or

The cash and cash equivalents value in 2022 was $16,253 million.

Both answers semantically the same. But strict string comparison will show that they are not.

This is why i use simple verification chain:

template = """Please validate the output of the other LLM chain and compare it to the expected answer:
{context}
Expected answer: {expected_answer}

Return YES if the answer is correct, otherwise return NO.
"""
prompt = ChatPromptTemplate.from_template(template)

chain = LLMChain(llm=model, prompt=prompt)

actual_answer = chain.run({"expected_answer": expected_answer, "context": full_answer})

Let’s check the results:

Not good.

Only for one pair of question about RnD expenses RAG pipeline produced correct results.

And on top of that we have hallucinations for question 2

Multi-Vector Retriever

Building Multi Vector Retriever requires a bit of preparations

Helper functions to

  • summarize pages of 10-k report
  • build MultiVectorRetriever with both summarize and raw pages
  • RAG retrieving pipeline
# helper functions
#to save time the summaries were pre-calculated
TABLE_SUMMARIES_CSV = "/content/Articles/table_summaries_0.csv"

def summarize(texts):
"""
This function summarizes the given texts using a GPT-3.5 model. It also checks if a CSV file with previous summaries exists,
if it does, it loads the summaries from there instead of generating new ones.

Args:
texts (list): A list of texts to be summarized.

Returns:
list: A list of summarized texts.

"""
# Prompt
prompt_text = """You are an assistant tasked with summarizing tables and text. \
Give a concise summary of the table or text. Table or text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k-0613", openai_api_key=open_ai_key)
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

tables = [i for i in texts]
table_summaries = []

# open csv file if it exists
if os.path.exists(TABLE_SUMMARIES_CSV):
t_frame = pd.read_csv(TABLE_SUMMARIES_CSV)
table_summaries = [elem[1] for elem in t_frame.values.tolist()]
else:
for i in range(0, len(tables)):
res = summarize_chain.invoke(tables[i])
table_summaries.append(res)

t_frame = pd.DataFrame(table_summaries)
t_frame.to_csv(TABLE_SUMMARIES_CSV)

return table_summaries

def setup_retriever(sections):
"""
This function sets up a retriever for the given sections of text. It first summarizes the sections, then creates a
Chroma vectorstore to index the summaries. It also sets up an InMemoryStore for the parent documents and a
MultiVectorRetriever to retrieve the documents. Finally, it adds the summarized texts to the vectorstore and the
original sections to the docstore.

Args:
sections (list): A list of sections of text to be indexed and retrieved.

Returns:
MultiVectorRetriever: A retriever set up with the given sections of text.
"""
text_summaries = summarize(sections)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="summaries",

embedding_function=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"),
)

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)

# Add texts
doc_ids = [str(uuid.uuid4()) for _ in sections]
summary_texts = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(text_summaries)]
retriever.vectorstore.add_documents(summary_texts)
retriever.docstore.mset(list(zip(doc_ids, sections)))

return retriever


def rag(retriever):
"""
This function sets up a RAG (Retrieval-Augmented Generation) pipeline. It first sets up a prompt template, then
initializes a GPT-3.5 model. It then creates a chain that takes a context from the retriever and a question,
passes them through the prompt and the model, and parses the output into a string.

Args:
retriever (MultiVectorRetriever): A retriever set up with the sections of text to be used as context.

Returns:
Chain: A chain that can be used to answer questions based on the context provided by the retriever.
"""
# Prompt template
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo-16k", openai_api_key=open_ai_key)

# RAG pipeline
chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser()
return chain

RAG Chain:

The chain is similar to Basic Retrieval case. Including evaluation part

texts_in_data = [element.page_content for element in data]
# sum_texts = summarize(texts_in_data)
retriever = setup_retriever(texts_in_data)
chain = rag(retriever)
answers = []
for question in q_a_10_k.keys():
res = chain.invoke(question)
# res = {question["question"]: res}
expected_answer = q_a_10_k[question]

template = """Please validate the output of the other LLM chain and compare it to the expected answer:
{context}
Expected answer: {expected_answer}

Return YES if the answer is correct, otherwise return NO.
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0.1, model="gpt-3.5-turbo", openai_api_key=open_ai_key)
verify_chain = LLMChain(llm=model, prompt=prompt)

actual_answer = verify_chain.run({"expected_answer": expected_answer, "context": res})

res = {
"gem_id": "gem_tesla_10k_2022.json",
"enrichment_type": "None",
"question": question,
"question_type": "Specific",
"actual_answer": res,
"expected_answer": expected_answer,
"is_correct": actual_answer
}

answers.append(res)

df_multi = pd.DataFrame(answers)
df_multi

Results:

Much better.

In case of incorrect answer RAG at least haven’t provided hallucinated answer.

Conclusion

In conclusion, the practice of summarizing document pages and utilizing their embeddings for retrieval significantly outperforms the results of a more “naive” retrieval method.

However, a notable drawback of this strategy is the necessity to summarize all document pages in advance, which incurs additional costs.

On a positive note, this method is compatible with GPT-3.5. Furthermore, the cost of GPT-4 is becoming increasingly affordable.

Nevertheless, multi-vector retrieval is not a panacea. Answering more complex, high-level questions necessitates improved data extraction from tables.

P.S. OpenAI RAG

What about latest OpenAI integrated RAG functionality?

It’s actually great!

Long story short i was able to create my own Retrieval Agent and get all correct answers

I’ve used both new gpt-4–1106-preview model and gpt-3.5-turbo-16k

And in both cases all answers are correct

OpenAI did it again.

Resources

Finance Lesson 10 by Martin Shkreli: https://www.youtube.com/watch?v=Kwu5vxTaEZg

Semi_Structured_RAG.ipynb

Colab with code: https://colab.research.google.com/drive/12CWP9wbeY_29uuuT5lyitBT43lLJqpnq?usp=sharing

--

--