Enhancing Document Interaction with LangChain VectorStore Retrieval

Published in

Indicium Engineering

5 min readJun 14, 2024

In today’s world, the ability to find relevant information quickly and accurately in an optimized way is very important, especially when dealing with large amounts of documents.
LangChain helps us achieve that by providing the tools to build a ChatBot on our documents with Retrieval Augmented Generation (RAG). In this blogpost, we are gonna explore one of its key elements: the retrievers, more specifically VectorStore Retrievals.

Retrieval-Augmented Generation (RAG) is a method that first retrieves relevant documents from a knowledge base or a dataset and then uses this retrieved information to generate more accurate and contextually relevant responses. RAG helps overcome the limitations of LLMs by providing them with up-to-date and specific information that improves their outputs.

In the context of LangChain, a retriever is a tool that helps extract relevant information from a source. It’s like setting up a search engine specifically to work with language models on your pre-loaded document. Retrievers fetch pertinent information that can aid the LLM in generating accurate and contextually relevant responses. It’s an efficient way to handle large sources, and since it delivers more pertinent information, it makes the interactions more satisfying and efficient for the user.

LLMs, despite their capabilities, sometimes generate generic or less accurate responses when they lack specific context. Retrievers bridge this gap by supplying the model with precise information, increasing the accuracy of the generated output. Because they pre-select relevant information from a vast dataset, retrievers can also help reduce the amount of data that needs to be processed by the LLM, not only speeding up the response time but also lowering computational costs.

VectorStore Retrievals are a type of retrieval that works on a vector database, using document embeddings to perform their search. Unlike traditional keyword-based retrieval, VectorStore retrieval captures the context and meaning behind the words, enabling more accurate and relevant document fetching.

There are several types of VectorStore Retrievals, but today we are gonna focus on two: Similarity Search and Maximum Marginal Relevance.

Setting up

Before we can see our retrievers in action, let’s make sure that we have our OpenAPI key and our environment set up.

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

After that, let’s install LangChain.

pip install langchain

To better showcase the differences between the two types of retrievals, we are going to query a simple text database, but this could be scaled to any size of database needed.

texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

Then, we move to embedding. Embedding documents in LangChain converts text into numerical vectors, allowing for efficient similarity searches and better understanding by the language model. After that, we load those embedding vectors into a vector store, a specialized database for storing and retrieving document embeddings. We are using Chroma today.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

persist_directory = 'docs/chroma/'

embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)
smalldb = Chroma.from_texts(texts, embedding=embedding)

Using similarity_search

Similarity search involves retrieving documents that are semantically similar to a given query.

Unlike traditional keyword-based search, which depends on exact matches of keywords, similarity search uses embeddings to capture the contextual meaning of the text. This allows for more accurate and relevant document retrieval.

We are gonna give our database a query and we are going to showcase similarity search first.

question = "Tell me about all-white mushrooms with large fruiting bodies"

docs_ss = smalldb.similarity_search(question, k=2)
print(docs_ss[0], docs_ss[1])
#page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.' 
page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).'

Our query got converted into a numerical vector representation using document embeddings. Then, with LangChain, it searches through the vector representations of all documents in our Vector Store to find the ones that are most similar to the query vector, and gives us an answer that was semantically similar to our query.

Using MMR

Maximum Marginal Relevance (MMR) is a retrieval type that balances relevance and diversity in the results. MMR considers both the relevance of documents to the query and the novelty of the information they provide compared to already selected documents.

docs_mmr = smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)
print(docs_mmr[0], docs_mmr[1])

# page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'
page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.'

One of its biggest benefits is that MMR ensures that retrieved documents provide diverse perspectives and information, reducing redundancy. It also maintains a high level of relevance to the original query while ensuring that the information is varied and comprehensive.

All that combined improves the quality of user experience, because they receive a broader range of relevant information, making their interactions more satisfying for the user.

Conclusion

As seen in our blogpost, LangChain’s Similarity Search and Maximal Marginal Relevance (MMR) are just two examples on how powerful retrievers can be. They leverage document embeddings and similarity measures to efficiently search through any size collections of documents and prioritize relevant results.

It’s important to remember that LangChain offers a variety of other retrievers, each with its own unique features and capabilities. You can also explore these two retrievers shown here in more depth to further refine their information retrieval tasks,using fine-tuning parameters, integrating with other algorithms, or customizing for specific use cases.

With a deeper understanding and exploration of LangChain’s retrievers, users can maximize the potential of their natural language processing workflows and achieve more accurate and insightful results.

Acknowledgements

This blog post was inspired by the insights and knowledge gained from the LangChain: Chat with Your Data course offered by deeplearning.ai.

This blogpost is also accompanied by a GitHub repository, where you can find the scripts showcased here and interact with them!
https://github.com/belbarros/langchain-vectorstore/tree/main