LLM Study Diary: Breaking Down and Analyzing the Official MultiVector Retriever Sample Code

While doing a comprehensive review of LangChain, I realized I wanted to study Retriever more in-depth. So, I decided to dive into the official documentation’s sample code for MultiVectorRetriever.

After just skimming through it, I didn’t really understand. I was stumped. First of all, I couldn’t grasp what the end result of the code was supposed to be.

# Retriever returns larger chunks 
len(retriever.invoke("justice breyer")[0].page_content)

After pondering for a while and reading through the code, I suddenly had an ‘aha moment’. I see, in this part of the code above:

retriever.invoke("justice breyer")[0] 

It clicked that within the invoke() method, it’s not only performing a similarity search using the vector store, but also searching for docs with the same id within the docs. That’s when it all fell into place.

In the middle of this official documentation, it states:

Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream.

Vector search, or similarity search, becomes more accurate when using smaller chunks. I understood that MultiVectorRetriever performs vector search on smaller chunks, but retrieves the original document data in larger chunks. That’s what MultiVectorRetriever does!

When I looked back at the code from the end, trying to understand why this works, I noticed that when creating the MultiVectorRetriever object, they set a parameter id_key=”doc_id”. This is so subtle that it’s easy to overlook.

id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)

In the following code, docs are further divided into sub_docs, but these sub_docs are assigned the same id as their parent doc using metadata with the name id_key=”doc_id”. This setup allows the Retriever to perform accurate similarity searches on the sub_doc level using the vector store, and then use the “doc_id” of the matching sub_docs to retrieve the docs containing wider context information that the sub_docs belonged to.

# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

Here, doc.metadata[id_key] = _id sets metadata for the sub_doc with the key name id_key (which is “doc_id”). Because “doc_id” was specified in the MultiVectorRetriever constructor earlier, the retriever can use this to link sub_docs and docs. Thus, with just an invoke to the retriever as shown below, MultiVectorRetriever manages to perform a similarity search in the vector store, find matching sub_docs, use their “doc_id” to search for docs with the same id in the byte_store, and return them.

retriever.invoke("justice breyer")

With this core concept in mind, when I looked at the sample again, everything fell into place much better. Let me review the parts I initially struggled with, from the beginning.

loaders = [
TextLoader("../../paul_graham_essay.txt"),
TextLoader("../../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())

These two txt files, paul_graham_essay.txt and state_of_the_union.txt, are flattened into a single docs list through docs.extend(loader.load()). (They’re not structured as a list of lists divided by file.)

Next, using RecursiveCharacterTextSplitter, this docs list is further divided into chunks of maximum 10000 characters, stored flatly in docs as a list.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

Each item in this docs list is a doc. For each doc in docs, a UUID is generated and stored in the doc_ids list.

doc_ids = [str(uuid.uuid4()) for _ in docs]
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)

Next, each doc is divided into chunks of maximum 400 characters. Since a doc is already chunked to a maximum of 10000 characters, this further divides it into roughly 25 parts, creating sub_docs. Before adding these to sub_docs, each sub_doc (temporarily stored in the variable _doc in the code) has its metadata set with the name “doc_id”, containing the id of its original (parent) doc. (In other words, multiple sub_docs will exist with the same doc_id.)

The sub_docs are added to the vector store, while the docs are added to the docstore.

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

The docstore is handled by the InMemoryByteStore object passed to the byte_store parameter when creating the retriever object.

This sets up the MultiVectorRetriever for searching, allowing us to either search for similar sub_docs directly in the vector store:

# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("justice breyer")[0]

Or, as explained at the beginning, use invoke() to perform the coordinated search of sub_docs -> docs:

# Retriever returns larger chunks
len(retriever.invoke("justice breyer")[0].page_content)

While LangChain’s official samples may seem unfriendly due to limited explanations, the code itself is actually quite friendly, revealing an intent to communicate something.

To be continued.

Thank you for reading through this detailed analysis of the MultiVector Retriever sample code. I hope you found this breakdown informative and that it helped clarify some of the more complex aspects of LangChain’s functionality.

While this concludes our deep dive into MultiVector Retriever, the exploration of AI and language technologies is an ongoing journey. There’s always more to learn and discover in this rapidly evolving field.

If you have any questions about this blog post or are interested in discussing OpenAI API, LLM, or LangChain-related development projects, I’d be delighted to hear from you. Please feel free to contact me directly at:

mizutori@goldrushcomputing.com

At Goldrush Computing, we pride ourselves on our expertise as a Japanese company with native speakers. We specialize in developing prompts and RAG systems tailored for Japanese language and culture. If you’re seeking to optimize AI solutions for the Japanese market or create Japanese-language applications, we’re uniquely positioned to assist you. Don’t hesitate to reach out for collaborations or projects that require Japan-specific AI optimization.

Stay tuned for my upcoming blogs about AI and LLM!

--

--

Taka Mizutori
LLM Study Diary: A Beginner’s Path Through AI

Founder and CEO of Goldrush Computing Inc (https://goldrushcomputing.com). Keep making with Swift, Kotlin, Java, C, Obj-C, C#, Python, JS, and Assembly.