Enough with Prototyping: Time for Persistent Multi-Vector Storage with PostgreSQL in LangChain

Eric Vaillancourt
18 min readMay 13, 2024

--

EDIT: I have created a follow-up article: Reducing Costs and Enabling Granular Updates with Multi-Vector Retriever in LangChain

Introduction

The Retrieval-Augmented Generation (RAG) approach in natural language processing has greatly benefited from the use of multi-vectors, which facilitate the handling of summaries, smaller data chunks, and the exploration of hypothetical scenarios.

The complete code for the methods and techniques discussed in this article is available on GitHub.

Typically, LangChain implementations utilize InMemoryByteStore to manage these multi-vectors, which is advantageous for rapid prototyping due to its simplicity and speed.

However, this in-memory approach lacks persistence, posing challenges for applications that require durability and the ability to manage extensive datasets over time.

This article explores the persistent storage of multi-vectors using PostgreSQL, addressing the limitations posed by the non-persistent nature of InMemoryByteStore. By integrating PostgreSQL with LangChain, we can achieve a more robust and scalable solution that not only ensures data durability across sessions but also supports the complex data management needs of advanced RAG applications.

Setting Up the Project

To get started with the project, you’ll first need to clone the repository and install the necessary dependencies. Here’s a step-by-step guide:

Clone the repository: Open your terminal and run the following command to clone the project from GitHub:

git clone https://github.com/ericvaillancourt/LangChain_persistant_multi_vector.git

Navigate to the project directory: Change into the project directory with:

cd LangChain_persistant_multi_vector

Create a virtual environment (optional, but recommended): This step isolates the project dependencies from your global Python environment. Create the virtual environment by running:

python -m venv env

Activate the virtual environment:

  • On Windows:
.\env\Scripts\activate
  • On macOS and Linux:
source env/bin/activate

Install the dependencies: Install all required packages from the requirements.txt file:

pip install -r requirements.txt

Handling psycopg-binary Installation Issues

The psycopg-binary package, which is included in our requirements.txt, is optimized for Windows environments. However, Mac and Unix/Linux users might encounter issues with this binary distribution. To ensure compatibility and smooth operation across different operating systems, follow these platform-specific instructions:

For Mac and Unix/Linux Users:

  1. Remove psycopg-binary from requirements.txt: Before installing the requirements, open the requirements.txt file and remove the line containing psycopg-binary.
  2. Install psycopg: Instead of the binary package, install the standard psycopg package, which compiles the library from source. This method ensures that the installation is tailored to your specific operating system configuration. To install psycopg, activate your virtual environment and run:
pip install psycopg

Note: The source installation of psycopg requires Python headers and PostgreSQL development headers to be pre-installed on your system. For Unix/Linux systems, these can often be installed via your package manager (for example, sudo apt-get install libpq-dev python-dev on Debian-based systems).

Starting the Docker Container for Database Services

To start the PostgreSQL database with vector support using Docker, follow these steps using the provided docker-compose.db.yml file. This file configures the database service to run in a Docker container.

Step 1: Check Docker Compose Installation

Ensure that Docker and Docker Compose are installed on your system. You can verify if Docker Compose is installed by running:

docker-compose --version

This command will display the version of Docker Compose installed, indicating it is ready to use.

Step 2: Navigate to the Project Directory

Change to the directory containing your docker-compose.db.yml file. This file contains the necessary configuration to set up your PostgreSQL database in a Docker container.

cd path/to/your/project

Step 3: Start the Database Service

Execute the following command to start the database service using Docker Compose:

docker-compose -f docker-compose.db.yml up -d

-f docker-compose.db.yml specifies the Docker Compose file to use.

up starts the services defined in the Docker Compose file.

-d runs the containers in the background.

Step 4: Verify the Container is Running

To ensure that the database container is running correctly, use:

docker ps

This command lists all active Docker containers. Look for the container running the ankane/pgvector image, which indicates that your database service is up and running.

To manage and inspect the content of your PostgreSQL database tables effectively, installing a graphical database management tool like pgAdmin is recommended. pgAdmin offers a user-friendly interface for handling database operations and visualizing table structures. However, the installation and usage of pgAdmin will not be covered in this article, as our focus is primarily on setting up and deploying the multi-vector retriever.

Understanding InMemory Store in LangChain

The InMemory Store, commonly featured in LangChain examples and on its official website, is a widely used method for storing data temporarily during the execution of language models. This section will delve into the characteristics of the InMemory Store, its typical applications, and why it might be used in many scenarios despite its limitations.

Key Features and Usage

The InMemory Store operates by storing data directly in the RAM of the executing device. This approach has several immediate benefits:

  • Speed: Access to data stored in memory is significantly faster compared to disk-based storage. This results in quicker response times, which is particularly beneficial in interactive applications that require real-time processing.
  • Simplicity: Implementing an InMemory Store is straightforward, which reduces the complexity of the code and the overhead involved in managing external storage systems.
  • Development Efficiency: It allows developers to focus more on other aspects of their application without worrying about the intricacies of data persistence and database management.

Limitations

Despite its advantages, the InMemory Store is not suitable for all use cases. The primary limitations include:

  • Lack of Persistence: Data stored in memory is lost when the application is terminated or if the system is restarted. This trait makes it unsuitable for applications that require long-term data retention or recovery capabilities.
  • Scalability Issues: As the volume of data increases, the amount of available RAM can become a limiting factor, potentially leading to performance degradation or system crashes.

Code Walkthrough: Multi-Vector RAG Implementation in LangChain using the InMemoryStore and PostgresByteStore

In the Jupyter notebook Multi-vector-RAG.ipynb, we explore how LangChain can be used to implement a Retrieval-Augmented Generation (RAG) model using multi-vectors. This walkthrough covers two key sections of the notebook: loading and processing a PDF document, and setting up the retrieval system using multi-vectors.

Loading and Processing the PDF

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

file_path = r"data\montreal.pdf"

loader = PyPDFLoader(file_path=file_path)

# by default, we will split by pages with no text_splitter
documents = loader.load_and_split(text_splitter=None)
documents

Key Components:

  • Document Loading: PyPDFLoader is used to load a PDF file. This loader is specifically designed to handle PDF documents in LangChain.
  • File Path: Specifies the path to the PDF document that will be processed.
  • Data Loading and Splitting: The method load_and_split is called with text_splitter=None, meaning the document is split into pages without any additional text segmentation.

This section of the code is responsible for reading the PDF document and splitting it into manageable chunks (pages), which are essential for the subsequent retrieval tasks.

Setting Up the InMemoryStore Multi-Vector Retrieval System

from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from database import COLLECTION_NAME

vectorstore = Chroma(
collection_name=COLLECTION_NAME,
embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

id_key = "doc_id"

retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)
retriever

Key Components:

  • Vector Store: Chroma is initialized with a collection_name and an embedding_function. The OpenAIEmbeddings function is used to generate embeddings for the documents.
  • Document Store: InMemoryStore is used to store documents temporarily in memory. This is suitable for demonstration purposes but not for persistent storage.
  • Retriever: MultiVectorRetriever is set up using the vectorstore and docstore. It uses the document ID key (id_key) to manage documents effectively.

This section establishes the retrieval system for the RAG model. It embeds the document chunks into a vector space and sets up an in-memory store for retrieval during the question-answering tasks.

Setting Up the PostgresByteStoreMulti-Vector Retrieval System

from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_postgres import PGVector
from database import COLLECTION_NAME, CONNECTION_STRING
from store import PostgresByteStore
from langchain_postgres import PostgresSaver, PickleCheckpointSerializer

embeddings = OpenAIEmbeddings()
vectorstore = PGVector(
embeddings=embeddings,
collection_name=COLLECTION_NAME,
connection=CONNECTION_STRING,
use_jsonb=True,
)

store = PostgresByteStore(CONNECTION_STRING, COLLECTION_NAME)
id_key = "doc_id"

retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)

retriever

This code sets up a retrieval system for document embeddings using PostgreSQL for persistent storage. Instead of using InMemoryStore, which only provides temporary storage, it uses PostgresByteStore from the store.py file for persistent storage of document data. This ensures that the data remains intact across multiple sessions and can be accessed later, even after the system is restarted. The PostgresByteStore is connected to a PostgreSQL database specified by CONNECTION_STRING and COLLECTION_NAME. The MultiVectorRetriever then uses this store along with a PGVector vector store for retrieving document embeddings based on their doc_id.

Whether you choose to use InMemoryStore or PostgresByteStore, the rest of the code remains the same. This is because both classes are designed to provide a common interface for storing and retrieving data. The main difference lies in the persistence of the data: InMemoryStore provides temporary storage and is wiped clean when the program ends, while PostgresByteStore provides persistent storage that keeps the data intact across multiple sessions. Regardless of the storage method chosen, the retrieval system works seamlessly without any changes to the rest of the code.

Creating the Vector Store

Uniquely Identifying Parent Chunks for Effective Retrieval

  1. Structured Retrieval: Each document can have multiple vectors representing different aspects, such as smaller chunks, summaries, or hypothetical questions. Assigning unique IDs to these parent chunks (main document parts) helps in structuring the retrieval process. It ensures that each vector can be accurately linked back to its parent chunk, facilitating a coherent retrieval of related content.
  2. Efficient Querying: With unique IDs, each parent chunk becomes a discrete entity that can be indexed and retrieved efficiently. This is especially important when using the MultiVectorRetriever, as it simplifies the querying mechanism to fetch vectors associated with specific parent chunks, enhancing performance and reducing computational overhead.
  3. Data Integrity and Tracking: Unique IDs maintain data integrity by ensuring that updates, deletions, or modifications are accurately reflected across related vectors. If a parent chunk is updated, all associated vectors (like summaries or hypothetical questions derived from it) can be easily identified and updated accordingly, thanks to the unique IDs.

In summary, assigning unique IDs to parent chunks when dealing with multiple vectors per document ensures efficient, scalable, and reliable document management and retrieval in systems leveraging advanced NLP tools like LangChain’s MultiVectorRetriever.

import uuid

doc_ids = [str(uuid.uuid4()) for _ in documents]
doc_ids

Enhancing Similarity Search with Finer-Grained Document Chunks

When documents are divided into smaller chunks, similarity search — the process of finding similar pieces of text based on content — can be significantly improved. Here’s how this advantage manifests:

Increased Granularity

By breaking down documents into smaller, more focused segments, similarity searches can operate at a finer granularity. This means that the search mechanism can match queries to the most relevant parts of a document rather than to the whole document. If a user is looking for information on a specific topic that is only mentioned in one paragraph, smaller chunks make it more likely that this paragraph will be identified and returned in the search results.

Improved Precision

Smaller chunks reduce the noise in each document segment. In a large document, multiple themes or topics might be covered, which can dilute the relevance of the search results. With smaller chunks, each segment is more likely to be homogenous in terms of its content, which enhances the precision of matching algorithms by limiting the scope to exactly what the query is about.

Efficient Similarity Calculations

Calculating similarity across entire documents can be computationally intensive and less accurate, especially when documents are long and cover diverse topics. With smaller chunks, similarity calculations can be performed more quickly and efficiently, as each calculation is less complex and requires fewer computational resources.

Better Use of Embeddings

In modern text retrieval systems, text segments are often converted into vector embeddings that capture semantic meanings. Smaller chunks ensure that these embeddings are more representative of the specific content of each segment. This precision improves the effectiveness of vector-based similarity searches, where the distance between vectors directly corresponds to the semantic similarity between text segments.

Handling Sparse Data

In cases where certain information within a document is sparse or only briefly mentioned, smaller chunks ensure that these elements don’t get overshadowed by the more dominant themes of the document. This aspect is crucial for ensuring that even the less frequently mentioned details are captured and made searchable.

Overall, using smaller chunks enhances the capabilities of similarity searches by focusing on precision, efficiency, and relevance, making them ideal for detailed and user-focused text retrieval applications.

child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

all_sub_docs = []
for i, doc in enumerate(documents):
doc_id = doc_ids[i]
sub_docs = child_text_splitter.split_documents([doc])
for sub_doc in sub_docs:
sub_doc.metadata[id_key] = doc_id
all_sub_docs.extend(sub_docs)

all_sub_docs

This code performs the task of splitting a collection of documents into smaller chunks and associating each chunk with its original document using unique identifiers.

Updating Retrieval System Stores

This code segment efficiently updates the data stores within a document retrieval system, enhancing both the vector-based search capabilities and the direct document retrieval functionalities:

retriever.vectorstore.add_documents(all_sub_docs)
retriever.docstore.mset(list(zip(doc_ids, documents)))

Adding Documents to the Vector Store:

retriever.vectorstore.add_documents(all_sub_docs)

Function: This line adds the list of sub-documents (all_sub_docs) to the vectorstore of the retriever. The vectorstore is responsible for managing vector embeddings of documents, which are essential for performing similarity searches and retrievals based on content.

Purpose: By adding the sub-documents to the vector store, you are essentially preparing the system to perform efficient text retrieval operations based on the semantic content of these smaller chunks. Each sub-document’s content is converted into a vector representation that can be quickly compared to query vectors in retrieval tasks.

Updating the Document Store:

retriever.docstore.mset(list(zip(doc_ids, documents)))

Function: This line updates the docstore of the retriever. The mset method stands for "multi-set" and is used to set multiple documents in the store simultaneously. zip(doc_ids, documents) pairs each document ID from doc_ids with its corresponding full document from documents.

Purpose: The document store holds the actual content of the documents, indexed by their unique identifiers (doc_ids). By pairing each ID with its document, the store allows for rapid retrieval of the full text of each document based on its ID. This is crucial for applications where the complete document content needs to be fetched after initial retrieval steps based on vectors.

Combined Workflow:

Together, these lines set up a retrieval system where:

The vectorstore enables semantic-based searching by handling vectorized representations of smaller chunks (sub-documents), facilitating fine-grained and contextually relevant search capabilities.

The docstore maintains a reference to the original, complete documents, allowing for their retrieval once a particular sub-document or set of sub-documents has been identified as relevant through the vector search.

Testing the Retriever: Returning Smaller Chunks

Once the retriever is set up, it’s important to test its functionality. One of the key features of the multi-vector retriever is its ability to return smaller chunks of data that are most relevant to a given query. This is particularly useful in scenarios where you’re looking for specific information within a large dataset.

Let’s test this feature using the similarity_search method. This method takes a string query and returns the most similar vectors (i.e., smaller chunks of data) from the vector store.

Here’s an example of how to use this method:

retriever.vectorstore.similarity_search("give me some tips for managing my costs")

Result:

[Document(page_content='Once you have reached this limit, the plan will cover 100% of your expenses for  the \nremainder of the year.  \nTips for Managing Your Costs  \nThere are several steps that you can take to help manage your costs when you are enrolled \nin Northwind Standard. Here are a few tips that you can use to get the most out of your \ncoverage:', metadata={'doc_id': 'bce590b8-ca2f-4e50-b36c-e30d422d08ba', 'page': 3, 'source': 'data\\Northwind_Standard_Benefits_Details.pdf'}),
Document(page_content='There are several steps that you can take to help manage your costs when you are enrolled \nin Northwind Standard. Here are a few tips that you can use to get the most out of your \ncoverage: \n• Make su re to take advantage of preventive care services. These services are covered 100% \nby the plan and can help you avoid more costly treatments down the line.', metadata={'doc_id': 'bce590b8-ca2f-4e50-b36c-e30d422d08ba', 'page': 3, 'source': 'data\\Northwind_Standard_Benefits_Details.pdf'}),
Document(page_content='• Talk to your doctor about ways to save money. Many doctors are willing to work with you \nto find the most cost -effective treatment options available. \n• Review your Explanation of Benefits (EOB) statements carefully. This document will show \nyou exactly ho w much you are being charged for each service and what your plan is \ncovering.', metadata={'doc_id': 'bce590b8-ca2f-4e50-b36c-e30d422d08ba', 'page': 3, 'source': 'data\\Northwind_Standard_Benefits_Details.pdf'}),
Document(page_content='costs. \nFinally, take advantage of an y discount programs that may be available. Many providers \noffer discounts for cash payments on services, and these can help reduce the amount of \nmoney you need to pay out of pocket. \nBy following these tips, you can make sure that you reach your deductible and take \nadvantage of the full benefits of the Northwind Standard plan. \nCoinsurance', metadata={'doc_id': '1aef9ba6-1242-4968-bded-95298f89a14c', 'page': 12, 'source': 'data\\Northwind_Standard_Benefits_Details.pdf'})]

Retrieving Context-Rich Parent Chunks

In the end, what we really want is not just any chunks, but the parent chunks that provide a richer context for our Language Model (LLM). This is where the invoke method comes into play.

Consider the following example:

retriever.invoke("give me some tips for managing my costs")

Result:

[Document(page_content='Once you have reached this limit, the plan will cover 100% of your expenses for  the \nremainder of the year.  \nTips for Managing Your Costs  \nThere are several steps that you can take to help manage your costs when you are enrolled \nin Northwind Standard. Here are a few tips that you can use to get the most out of your \ncoverage:  \n• Make su re to take advantage of preventive care services. These services are covered 100% \nby the plan and can help you avoid more costly treatments down the line.  \n• Always make sure to visit in -network providers. Doing so will ensure that you receive the \nmaximum benefit from your plan.  \n• Consider generic prescription drugs when available. These drugs can often be cheaper \nthan brand -name drugs and are just as effective.  \n• Talk to your doctor about ways to save money. Many doctors are willing to work with you \nto find the most cost -effective treatment options available.  \n• Review your Explanation of Benefits (EOB) statements carefully. This document will show \nyou exactly ho w much you are being charged for each service and what your plan is \ncovering.  \nBy following these tips, you can ensure that you are getting the most out of your Northwind \nStandard health plan.  \nHOW PROVIDERS AFFECT YOUR COSTS  \nIn-Network Providers  \nHOW PROVID ERS AFFECT YOUR COSTS  \nWhen selecting a health insurance plan, one of the most important factors to consider is the \nnetwork of in -network providers that are available with the plan.  \nNorthwind Standard offers a wide variety of in -network providers, ranging from primary \ncare physicians, specialists, hospitals, and pharmacies. This allows you to choose a provider \nthat is convenient for you and your family, while also helping you to keep your costs low.  \nWhen you choose a provider that is in -network with your p lan, you will typically pay lower \ncopays and deductibles than you would with an out -of-network provider. In addition, many \nservices, such as preventive care, may be covered at no cost when you receive care from an \nin-network provider.  \nIt is important to n ote, however, that Northwind Standard does not offer coverage for \nemergency services, mental health and substance abuse coverage, or out -of-network', metadata={'source': 'data\\Northwind_Standard_Benefits_Details.pdf', 'page': 3}),
Document(page_content='The Northwind Standard plan has a calendar year deductible of $2,000 for each individual \nand $4,000 for each family. A calendar year deductible is the amount you must pay for \nhealth care services before your insurance plan starts to pay. The deduct ible applies to most \nservices received from in -network providers, including primary care physicians, specialists, \nhospitals, and pharmacies. \nHowever, there are some exceptions. For example, preventive care services, such as \nimmunizations and annual physic als, are covered at 100% with no deductible. Additionally, \nprescription drugs are subject to a separate prescription drug deductible of $250 per \nindividual and $500 per family. \nIt is important to note that this deductible does not roll over into the next year. This means \nthat you must meet the deductible amount in the current year before your insurance begins \nto pay. Additionally, the deductible may not apply to all services. For example, you may not \nbe subject to the deductible when you receive in -network emergency services. \nTips for Meeting the Calendar Year Deductible \nMeeting your calendar year deductible may seem like a daunting task, but there are a few \nsteps you can take to help ensure that you reach it. \nFirst, take advantage of any preventive care services that are covered at 100%. These \nservices are important for your health, and you can use them to help meet your deductible \nwithout paying out of pocket. \nSecond, use caut ion when selecting providers. The Northwind Standard plan has a large \nnetwork of in -network providers, and using these providers will help ensure that you are \nnot paying more than you have to for services. \nThird, consider using a health savings account (H SA). An HSA is a tax -advantaged savings \naccount that can be used to pay for qualified medical expenses. Contributions to an HSA are \ntax-deductible and the funds can be used to help pay for deductibles and other medical \ncosts. \nFinally, take advantage of an y discount programs that may be available. Many providers \noffer discounts for cash payments on services, and these can help reduce the amount of \nmoney you need to pay out of pocket. \nBy following these tips, you can make sure that you reach your deductible and take \nadvantage of the full benefits of the Northwind Standard plan. \nCoinsurance \nIMPORTANT PLAN INFORMATION: Coinsurance \nCoinsurance is a type of cost sharing that you are responsible for after meeting your \ndeductible. Coinsurance is often a percentage of the cost of the service you receive. For', metadata={'source': 'data\\Northwind_Standard_Benefits_Details.pdf', 'page': 12})]

Creating Summaries for Each Parent Chunk

In addition to creating Smaller Chunks, we can also create summaries for each of the parent chunks. This can be particularly useful when dealing with large chunks of text or complex tables, where a summary can provide a quick overview of the content.

Here’s how you can set up a summary chain:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt_text = """You are an assistant tasked with summarizing text. \
Directly summarize the following text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Initialize the Language Model (LLM)
model = ChatOpenAI(temperature=0, model="gpt-4")

# Define the summary chain
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

Invoking the Chain in Batches for Parallelism:

parent_chunk = [i.page_content for i in documents]
text_summaries = summarize_chain.batch(parent_chunk, {"max_concurrency": 5})

Assigning Unique IDs to Text Summaries

from langchain.schema.document import Document

summary_docs = []
for i, (summary, doc_id) in enumerate(zip(text_summaries, doc_ids)):
# Define your new metadata here
new_metadata = {"page": i, "doc_id": doc_id}

# Create a new Document instance for each summary
doc = Document(page_content=str(summary))

# Replace the metadata
doc.metadata = new_metadata

# Add the Document to the list
summary_docs.append(doc)

This code creates a list of Document objects from the langchain.schema.document module, each representing a text summary from the text_summaries list. Each Document is associated with a unique doc_id from the doc_ids list and a page number, both stored in the metadata attribute of the Document. The page_content attribute of each Document is set to the corresponding text summary. This way, each summary text is encapsulated in a Document object with its own doc_id and page number, effectively assigning the doc_id of the parent chunk to each summary.

Storing Document Objects in Vector and Document Stores

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, documents)))

This code is used to store the Document objects created earlier into a vector store and a document store for later retrieval. The add_documents method of the vectorstore object is used to add the summary_docs (which are Document objects) into the vector store. The mset method of the docstore object is then used to add the documents into the document store, with each document associated with a unique doc_id from the doc_ids list. This allows for efficient storage and retrieval of documents based on their doc_id.

Generating Hypothetical Questions for Each Parent Chunk

functions = [
{
"name": "hypothetical_questions",
"description": "Generate hypothetical questions",
"parameters": {
"type": "object",
"properties": {
"questions": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["questions"],
},
}
]
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

question_chain = (
{"doc": lambda x: x.page_content}
# Only asking for 5 hypothetical questions, but this could be adjusted
| ChatPromptTemplate.from_template(
"""Generate a list of exactly 5 hypothetical questions that the below document could be used to answer:\n\n{doc}
seperate each question with a comma (,)
"""
)
| ChatOpenAI(max_retries=0, model="gpt-4").bind(
functions=functions, function_call={"name": "hypothetical_questions"}
)
| JsonKeyOutputFunctionsParser(key_name="questions")
)

This code creates an automated pipeline, question_chain, to generate hypothetical questions based on the content of a document. It extracts the document content, constructs a chat prompt, and uses the GPT-4 model to generate questions. The JsonKeyOutputFunctionsParser then parses the output to retrieve the generated questions. This allows for the efficient creation of relevant questions that can be used to probe deeper into the document’s content.

Invoking the Chain in Batches for Parallelism:

hypothetical_questions = question_chain.batch(documents, {"max_concurrency": 5})

Creating Document Objects for Hypothetical Questions

from langchain.schema.document import Document

hypothetical_docs = []
for question_list, doc_id in zip(hypothetical_questions, doc_ids):
for question in question_list:
# Define your new metadata here
new_metadata = {"doc_id": doc_id}

# Create a new Document instance for each question
# The question itself is the page_content
doc = Document(page_content=question, metadata=new_metadata)

# Add the Document to the list
hypothetical_docs.append(doc)

This code constructs a list of Document objects, each representing a hypothetical question. For each question in the hypothetical_questions list, a Document object is created with the question as the page_content and the corresponding doc_id from the doc_ids list in the metadata. This results in a list of Document objects (hypothetical_docs), each encapsulating a hypothetical question and its associated doc_id.

Storing Document Objects in Vector and Document Stores

retriever.vectorstore.add_documents(hypothetical_docs)
retriever.docstore.mset(list(zip(doc_ids, documents)))

Creating an LCEL Chain and Testing the Retriever

In this section, we will create a LangChain Execution Language (LCEL) chain and test our retriever. The LCEL chain is a powerful feature of LangChain that allows us to define a sequence of operations for our RAG model.

Here’s how you can set it up:

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Prompt template
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0, model="gpt-4")

# RAG pipeline
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)

Invoking the chain:

chain.invoke("What dining options are available in Montreal for those interested in Middle Eastern cuisine?")

Conclusion

Throughout this article, we’ve demonstrated how to break down larger documents into smaller, manageable chunks, generate concise summaries, and create hypothetical questions for each chunk. These techniques can be invaluable for information retrieval, text analysis, and enhancing comprehension. The complete code for the methods and techniques discussed in this article is available on GitHub. This allows you to explore, adapt, and implement these strategies in your own projects. Happy coding!

Thank you for reading me!

→ EDIT: SQL stores are now available in LangChain (0.1.4) https://github.com/langchain-ai/langchain/releases/tag/v0.1.4

--

--

Eric Vaillancourt

Eric Vaillancourt, an AI enthusiast, began his career in 1989, founded a tech consultancy in 1996, and has led over 1,500 trainings in IT and AI.