Reducing Costs and Enabling Granular Updates with Multi-Vector Retriever in LangChain

Eric Vaillancourt
23 min readMay 20, 2024

--

This article is a follow-up to my previous article on the Multi-Vector Retriever. While the Multi-Vector Retriever enhances retrieval capabilities by using multiple vectors to represent different aspects of a document, it still faces challenges with the standard indexing and SQL management provided by LangChain. Let’s dive into these challenges and how a custom solution can provide better management and retrieval of documents.

Understanding the Challenge

The SQLRecordManager and Index work well for many use cases, but they fall short in scenarios requiring granular updates to summaries, smaller chunks, and hypothetical questions. Specifically, when you reload an updated version of already vectorized documents, it becomes challenging to identify and update the parent chunks’ IDs.

Limitations of the Current Mechanism

While LangChain’s indexing mechanism is effective for managing and querying large datasets, it has notable limitations, especially when integrating with the document store (docstore) using the SQLRecordManager.

  1. Record Manager Limitations: The current SQLRecordManager provided by LangChain works well with the index but does not extend to the docstore. This limitation means that while we can efficiently manage and query the vector store, the same capability is not available for the docstore.
  2. Lack of Granular Control: When adding documents to the vector store using the SQLRecordManager, we can handle the embeddings and indexing efficiently. However, when adding the parent document to the docstore, there is no mechanism to track whether documents have been inserted, deleted, or updated. This lack of granular control creates a significant challenge in maintaining an accurate and up-to-date docstore.
  3. Reinsertion of Documents: Due to the inability to determine changes in the parent document, the current mechanism forces us to re-insert all documents whenever there is an update. This process is inefficient and computationally expensive, as it requires reprocessing and re-indexing the entire document set.

In my article on Multi-Vector Retiever, I detailed the process of creating smaller chunks, summaries, and hypothetical questions for documents using the Multi-Vector Retriever. This foundational work is crucial for understanding the enhancements discussed here, but to avoid repetition, we will not revisit those concepts in this article.

Instead, we will focus on the limitations of LangChain’s standard tools and how custom solutions can address the challenges of granular updates and efficient document management.

To illustrate, consider the scenario where a document is updated. With the existing system, we cannot simply update the affected chunks and their embeddings. Instead, we must reinsert the entire document, generating new embeddings for all chunks, even those that haven’t changed. Additionally, it is impossible to identify which chunks require the regeneration of summaries and hypothetical questions. This not only increases computational costs but also leads to significant inefficiencies in document management.

Adapting the “Multi-vector-RAG with SQLRecordManager” Notebook

In this section, we will discuss how to adapt the “Multi-vector-RAG” notebook to utilize LangChain’s SQLRecordManager and Index. The “Multi-vector-RAG with SQLRecordManager” notebook is essentially a copy of the original “Multi-vector-RAG” notebook, with modifications to integrate the SQLRecordManager for more efficient document management and retrieval.

Setting Up the SQLRecordManager

To begin, we need to instantiate a record manager. This involves defining the record manager and creating the necessary schema in the database. Here’s how you can set up the SQLRecordManager in your notebook:

# define record manager
namespace = f"pgvector/{COLLECTION_NAME}"
record_manager = SQLRecordManager(
namespace, db_url=CONNECTION_STRING
)
record_manager.create_schema()

Indexing and Document Store Management

When integrating LangChain’s SQLRecordManager and Index into our multi-vector retrieval process, it’s crucial to understand the flow and outputs of key operations. Here’s a detailed explanation of how the indexing process works and its limitations.

Calling the Index Function

In our adapted notebook, we call the index function to add documents to the vector store using the SQLRecordManager. This function indexes the documents and manages their embeddings efficiently. Here's how the call looks:

idx = index(all_sub_docs, record_manager, vectorstore, cleanup="incremental", source_id_key="source")

After executing this line, the idx dictionary provides an overview of the indexing operation:

{
'num_added': 13,
'num_updated': 0,
'num_skipped': 0,
'num_deleted': 0
}

This dictionary indicates that 13 new documents were added, with no updates, skips, or deletions. This feedback is crucial for understanding the changes made to the indexed documents.

Managing the Document Store

To manage the parent documents in the docstore, we generate a list of (doc_id, document) tuples from the documents and pass them to the retriever’s docstore using the mset method:

# Generate the list of (doc_id, document) tuples from the documents
doc_id_document_tuples = [(doc.metadata["doc_id"], doc) for doc in documents]

# Pass the list of tuples to retriever.docstore.mset
retriever.docstore.mset(doc_id_document_tuples)

While mset effectively updates the docstore with the provided documents, it does not return any feedback or status. This lack of return information creates a challenge: it's impossible to determine whether any changes occurred in the parent document. Without knowing if documents were inserted, deleted, or updated, we are forced to reinsert all documents, leading to inefficiencies.

Challenges with UUID-based Document IDs

In our indexing and retrieval system, we use UUIDs to generate unique document IDs. While this ensures that each document is uniquely identifiable, it introduces a significant problem when re-importing documents, even if there are no changes.

The UUID Issue

When documents are re-imported, new UUIDs are generated for each document. This results in new, distinct document IDs for every import, even if the content of the documents has not changed. Here’s how we generate the UUIDs:

import uuid

# Add a unique doc_id to each document's metadata
for doc in documents:
doc.metadata["doc_id"] = str(uuid.uuid4())

Consequences of New UUIDs

  1. Inconsistent Linking: Since new UUIDs are generated each time, the newly imported documents cannot be linked to their corresponding entries in the docstore. This breaks the association between the parent documents stored in the docstore and the vectors inserted into the vector store.
  2. Inefficiency: Every re-import of the same document set, even without changes, results in the generation of new document IDs. This leads to unnecessary duplication of documents and vectors, as the system cannot recognize that the content is identical to previously imported documents.
  3. Data Integrity Issues: The inability to link new imports to existing parent documents undermines the integrity of the document management system. It becomes challenging to maintain a consistent and accurate representation of the document set, as each import is treated as entirely new data.

Example Scenario

Consider the following scenario: you import a set of documents, each assigned a unique UUID. These documents are split into chunks, and their embeddings are stored in the vector store. The parent documents are stored in the docstore. Upon re-importing the same set of documents, new UUIDs are generated, resulting in new document IDs. Consequently, the system now has multiple entries for what is essentially the same content, but with no way to link these new entries to the original parent documents or their associated vectors.

Addressing the UUID Challenge with Reproducible IDs

To mitigate the issues caused by generating new UUIDs for each document import, we can introduce reproducible IDs. These IDs are generated based on the content of the document, ensuring that the same document will always receive the same ID. Here’s how we can implement this approach:

from utils.utils import generate_reproducible_id_by_content

# Add a reproducible unique doc_id to each document's metadata
for position, doc in enumerate(documents):
doc.metadata["doc_id"] = generate_reproducible_id_by_content(doc.page_content, doc.metadata)

Remaining Challenges

Despite using reproducible UUIDs, we still face significant challenges. Specifically, we do not have a mechanism to determine what has changed in the docstore. Without knowing which documents have been inserted, deleted, or updated, we cannot manage document updates efficiently.

Solution: CustomSQLRecordManager, index_with_ids, and conditional_mset

To overcome these challenges, we need to introduce a CustomSQLRecordManager and new functions index_with_ids and conditional_mset to manage updates more effectively.

Updating the Docstore

To update the docstore with the new or updated documents, we generate a list of (doc_id, document) tuples and use the conditional_mset method:

# Generate the list of (doc_id, document) tuples from the documents
doc_id_document_tuples = [(doc.metadata["doc_id"], doc, doc.metadata["source"]) for doc in documents]

# Pass the list of tuples to retriever.docstore.conditional_mset
parent_docs_operations = retriever.docstore.conditional_mset(doc_id_document_tuples)

The conditional_mset method ensures that only documents with changes are updated, improving efficiency and consistency in the docstore.

Updating the Vector Database

To update the vector database, we use the index_with_ids function, which integrates the CustomSQLRecordManager to manage document chunks and their embeddings.

idx = index_with_ids(all_sub_docs, record_manager, vectorstore, cleanup="incremental", source_id_key="source")

Code Walkthrough

Let’s walk through the code step-by-step from the notebook Multi-vector-RAG update vectors.ipynb, explaining each snippet and its purpose in the context of the overall document management and retrieval process using LangChain.

Loading Environment Variables

First, we load environment variables which are required for database connections and API keys:

from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()

Loading and Splitting Documents

Next, we load a PDF document and split it into pages:

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

file_path = r"data\toronto.pdf"
loader = PyPDFLoader(file_path=file_path)

# By default, split by pages with no text_splitter
documents = loader.load_and_split(text_splitter=None)
documents
[Document(page_content='Things to Do in Toronto \nPage 1: Introduction \nToronto, the capital of Ontario, is the largest city in Canada and a dynamic, cosmopolitan \nhub. Known for its towering skyline, bustling waterfront, and numerous cultural attractions, \nToronto o Ưers a wealth of experiences for every visitor. \nKey Attractions:  \n\uf0b7 CN Tower:  This iconic symbol of Toronto o Ưers panoramic views of the city. Don’t \nmiss the glass floor and the revolving restaurant at the top. \n\uf0b7 Royal Ontario Museum (ROM):  Canada’s largest museum of world cultures and \nnatural history is a must-visit. \n\uf0b7 Toronto Islands:  A group of small islands located just o Ư the city’s shore, o Ưering \nbeautiful beaches, picnic spots, and bike rentals.', metadata={'source': 'data\\toronto.pdf', 'page': 0}),
Document(page_content='Page 2: Cultural Experiences \nToronto is a melting pot of cultures, and this is reflected in its neighborhoods and festivals. \nNeighborhoods: \n\uf0b7 Chinatown: One of North America’s largest Chinatowns, known for its vibrant food \nscene. \n\uf0b7 Kensington Market: A bohemian neighborhood o Ưering vintage shops, eclectic \nboutiques, and international food stalls. \n\uf0b7 Distillery District: Known for its well-preserved Victorian Industrial architecture, it’s \nnow home to boutiques, art galleries, and performance spaces. \nFestivals: \n\uf0b7 Caribana: A festival celebrating Caribbean culture and traditions, held in summer. \n\uf0b7 Toronto International Film Festival (TIFF): One of the most prestigious film \nfestivals in the world, held annually in September.', metadata={'source': 'data\\toronto.pdf', 'page': 1}),
Document(page_content='Page 3: Outdoor Activities \nToronto o Ưers numerous opportunities for outdoor activities. \n\uf0b7 High Park: Toronto’s largest public park featuring many hiking trails, sports facilities, \na beautiful lakefront, a zoo, and several playgrounds. \n\uf0b7 Toronto Zoo: Home to over 5,000 animals representing over 500 species. \n\uf0b7 Ripley’s Aquarium of Canada: Located at the base of the CN Tower, this enormous \naquarium is one of the city’s newest top attractions.', metadata={'source': 'data\\toronto.pdf', 'page': 2}),
Document(page_content='Page 4: Food and Nightlife \nToronto’s food scene is as diverse as its population. \n\uf0b7 St. Lawrence Market: Named the world’s best food market by National Geographic \nin 2012, this is a must-visit for foodies. \n\uf0b7 Nightlife: Toronto has a vibrant nightlife with a plethora of bars, nightclubs, and live \nmusic venues. The Entertainment District is known for its nightclubs and theaters. \nIn conclusion, whether you’re a lover of art and culture, outdoor activities, food, or just \nlooking to have a good time, Toronto has something for everyone.', metadata={'source': 'data\\toronto.pdf', 'page': 3})]

Setting Up the Vector Store and Retriever

We initialize the vector store, docstore, and multi-vector retriever. The CustomSQLRecordManager is also set up to manage document embeddings:

from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain_openai import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_postgres import PGVector
from database import COLLECTION_NAME, CONNECTION_STRING
from utils.store import PostgresByteStore
from langchain_postgres import PostgresSaver, PickleCheckpointSerializer
from utils.custom_sql_record_manager import CustomSQLRecordManager
from utils.index_with_ids import index_with_ids

embeddings = OpenAIEmbeddings()
vectorstore = PGVector(
embeddings=embeddings,
collection_name=COLLECTION_NAME,
connection=CONNECTION_STRING,
use_jsonb=True,
)

store = PostgresByteStore(CONNECTION_STRING, COLLECTION_NAME)
id_key = "doc_id"

retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)

# Define record manager
namespace = f"pgvector/{COLLECTION_NAME}"
record_manager = CustomSQLRecordManager(
namespace, db_url=CONNECTION_STRING
)
record_manager.create_schema()

retriever

Adding Reproducible Unique Document IDs

To ensure documents have consistent IDs across imports, we generate reproducible IDs based on document content:

from utils.utils import generate_reproducible_id_by_content

# Add a reproducible unique doc_id to each document's metadata
for position, doc in enumerate(documents):
doc.metadata["doc_id"] = generate_reproducible_id_by_content(doc.page_content, doc.metadata)
[Document(page_content='Things to Do in Toronto \nPage 1: Introduction \nToronto, the capital of Ontario, is the largest city in Canada and a dynamic, cosmopolitan \nhub. Known for its towering skyline, bustling waterfront, and numerous cultural attractions, \nToronto o Ưers a wealth of experiences for every visitor. \nKey Attractions:  \n\uf0b7 CN Tower:  This iconic symbol of Toronto o Ưers panoramic views of the city. Don’t \nmiss the glass floor and the revolving restaurant at the top. \n\uf0b7 Royal Ontario Museum (ROM):  Canada’s largest museum of world cultures and \nnatural history is a must-visit. \n\uf0b7 Toronto Islands:  A group of small islands located just o Ư the city’s shore, o Ưering \nbeautiful beaches, picnic spots, and bike rentals.', metadata={'source': 'data\\toronto.pdf', 'page': 0, 'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa'}),
Document(page_content='Page 2: Cultural Experiences \nToronto is a melting pot of cultures, and this is reflected in its neighborhoods and festivals. \nNeighborhoods: \n\uf0b7 Chinatown: One of North America’s largest Chinatowns, known for its vibrant food \nscene. \n\uf0b7 Kensington Market: A bohemian neighborhood o Ưering vintage shops, eclectic \nboutiques, and international food stalls. \n\uf0b7 Distillery District: Known for its well-preserved Victorian Industrial architecture, it’s \nnow home to boutiques, art galleries, and performance spaces. \nFestivals: \n\uf0b7 Caribana: A festival celebrating Caribbean culture and traditions, held in summer. \n\uf0b7 Toronto International Film Festival (TIFF): One of the most prestigious film \nfestivals in the world, held annually in September.', metadata={'source': 'data\\toronto.pdf', 'page': 1, 'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68'}),
Document(page_content='Page 3: Outdoor Activities \nToronto o Ưers numerous opportunities for outdoor activities. \n\uf0b7 High Park: Toronto’s largest public park featuring many hiking trails, sports facilities, \na beautiful lakefront, a zoo, and several playgrounds. \n\uf0b7 Toronto Zoo: Home to over 5,000 animals representing over 500 species. \n\uf0b7 Ripley’s Aquarium of Canada: Located at the base of the CN Tower, this enormous \naquarium is one of the city’s newest top attractions.', metadata={'source': 'data\\toronto.pdf', 'page': 2, 'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b'}),
Document(page_content='Page 4: Food and Nightlife \nToronto’s food scene is as diverse as its population. \n\uf0b7 St. Lawrence Market: Named the world’s best food market by National Geographic \nin 2012, this is a must-visit for foodies. \n\uf0b7 Nightlife: Toronto has a vibrant nightlife with a plethora of bars, nightclubs, and live \nmusic venues. The Entertainment District is known for its nightclubs and theaters. \nIn conclusion, whether you’re a lover of art and culture, outdoor activities, food, or just \nlooking to have a good time, Toronto has something for everyone.', metadata={'source': 'data\\toronto.pdf', 'page': 3, 'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767'})]

Updating the Docstore

We create a list of document tuples and use the conditional_mset method to update the docstore. This method helps in determining the changes required for each document:

# Generate the list of (doc_id, document) tuples from the documents
doc_id_document_tuples = [(doc.metadata["doc_id"], doc) for doc in documents]

# Pass the list of tuples to retriever.docstore.conditional_mset
parent_docs_operations = retriever.docstore.conditional_mset(doc_id_document_tuples)
[('eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'INS'),
('4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'INS'),
('80e1944b-044e-5dc8-899f-7a941f1fa08b', 'INS'),
('a9cace25-2615-59b1-9669-ccd6656ac767', 'INS')]

Managing Sub-Documents with SQLAlchemy

This section involves fetching and splitting documents into smaller chunks, managing the database connections, and updating the document metadata:

from sqlalchemy import create_engine, Column, String, LargeBinary, select, Table, MetaData
from sqlalchemy.orm import sessionmaker
from sqlalchemy.dialects.postgresql import JSONB
from langchain.schema.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

separators = ["\n\n", "\n", ".", "?", "!"]

# Initialize the RecursiveCharacterTextSplitter with fixed parameters
child_text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=20,
separators=separators
)

# List to store all sub-documents
all_sub_docs = []

# Database connection setup
engine = create_engine(CONNECTION_STRING)
Session = sessionmaker(bind=engine)
session = Session()

# Define table structure
metadata = MetaData()
langchain_pg_embedding = Table(
'langchain_pg_embedding', metadata,
Column('id', String, primary_key=True),
Column('collection_id', String),
Column('embedding', LargeBinary),
Column('document', String),
Column('cmetadata', JSONB)
)

# Sort the parent documents operations to ensure deterministic processing order
parent_docs_operations = sorted(parent_docs_operations, key=lambda x: x[0])

# Iterate through the operations
for doc_id, operation in parent_docs_operations:
if operation == 'SKIP':
# Fetch records from langchain_pg_embedding table for SKIP documents
query = select(
langchain_pg_embedding.c.id,
langchain_pg_embedding.c.collection_id,
langchain_pg_embedding.c.embedding,
langchain_pg_embedding.c.document,
langchain_pg_embedding.c.cmetadata
).where(
(langchain_pg_embedding.c.cmetadata['doc_id'].astext == doc_id) &
(langchain_pg_embedding.c.cmetadata['type'].astext == 'smaller chunk')
).order_by(langchain_pg_embedding.c.id) # Ensure fixed order

result = session.execute(query).fetchall()

# Recreate sub-documents from fetched records
for row in result:
metadata = row.cmetadata
sub_doc_content = row.document
sub_doc = Document(page_content=sub_doc_content, metadata=metadata)
all_sub_docs.append(sub_doc)
else:
# Retrieve the document from the docstore for non-SKIP documents
doc = retriever.docstore.get(doc_id)
if doc:
source = doc.metadata.get("source") # Retrieve the source from the document's metadata
sub_docs = child_text_splitter.split_documents([doc])
# Ensure fixed order for sub-documents
sub_docs = sorted(sub_docs, key=lambda x: x.page_content)
for sub_doc in sub_docs:
sub_doc.metadata["doc_id"] = doc_id # Assign the same doc_id to each sub-document
sub_doc.metadata["source"] = f"{source}(smaller chunk)" # Add the suffix to the source
sub_doc.metadata["type"] = "smaller chunk"
all_sub_docs.extend(sub_docs)

# Close the session after use
session.close()

# The resulting sub-documents
all_sub_docs
[Document(page_content='Things to Do in Toronto \nPage 1: Introduction \nToronto, the capital of Ontario, is the largest city in Canada and a dynamic, cosmopolitan \nhub. Known for its towering skyline, bustling waterfront, and numerous cultural attractions, \nToronto o Ưers a wealth of experiences for every visitor. \nKey Attractions:  \n\uf0b7 CN Tower:  This iconic symbol of Toronto o Ưers panoramic views of the city. Don’t', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 0, 'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'type': 'smaller chunk'}),
Document(page_content='Toronto o Ưers a wealth of experiences for every visitor. \nKey Attractions: \n\uf0b7 CN Tower: This iconic symbol of Toronto o Ưers panoramic views of the city. Don’t \nmiss the glass floor and the revolving restaurant at the top. \n\uf0b7 Royal Ontario Museum (ROM): Canada’s largest museum of world cultures and \nnatural history is a must-visit.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 0, 'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'type': 'smaller chunk'}),
Document(page_content='miss the glass floor and the revolving restaurant at the top. \n\uf0b7 Royal Ontario Museum (ROM): Canada’s largest museum of world cultures and \nnatural history is a must-visit. \n\uf0b7 Toronto Islands: A group of small islands located just o Ư the city’s shore, o Ưering \nbeautiful beaches, picnic spots, and bike rentals.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 0, 'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'type': 'smaller chunk'}),
Document(page_content='Page 2: Cultural Experiences \nToronto is a melting pot of cultures, and this is reflected in its neighborhoods and festivals. \nNeighborhoods: \n\uf0b7 Chinatown: One of North America’s largest Chinatowns, known for its vibrant food \nscene. \n\uf0b7 Kensington Market: A bohemian neighborhood o Ưering vintage shops, eclectic \nboutiques, and international food stalls.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 1, 'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'type': 'smaller chunk'}),
Document(page_content='scene. \n\uf0b7 Kensington Market: A bohemian neighborhood o Ưering vintage shops, eclectic \nboutiques, and international food stalls. \n\uf0b7 Distillery District: Known for its well-preserved Victorian Industrial architecture, it’s \nnow home to boutiques, art galleries, and performance spaces. \nFestivals: \n\uf0b7 Caribana: A festival celebrating Caribbean culture and traditions, held in summer.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 1, 'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'type': 'smaller chunk'}),
Document(page_content='now home to boutiques, art galleries, and performance spaces. \nFestivals: \n\uf0b7 Caribana: A festival celebrating Caribbean culture and traditions, held in summer. \n\uf0b7 Toronto International Film Festival (TIFF): One of the most prestigious film \nfestivals in the world, held annually in September.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 1, 'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'type': 'smaller chunk'}),
Document(page_content='Page 3: Outdoor Activities \nToronto o Ưers numerous opportunities for outdoor activities. \n\uf0b7 High Park: Toronto’s largest public park featuring many hiking trails, sports facilities, \na beautiful lakefront, a zoo, and several playgrounds. \n\uf0b7 Toronto Zoo: Home to over 5,000 animals representing over 500 species. \n\uf0b7 Ripley’s Aquarium of Canada: Located at the base of the CN Tower, this enormous', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 2, 'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'type': 'smaller chunk'}),
Document(page_content='\uf0b7 Toronto Zoo: Home to over 5,000 animals representing over 500 species. \n\uf0b7 Ripley’s Aquarium of Canada: Located at the base of the CN Tower, this enormous \naquarium is one of the city’s newest top attractions.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 2, 'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'type': 'smaller chunk'}),
Document(page_content='Page 4: Food and Nightlife \nToronto’s food scene is as diverse as its population. \n\uf0b7 St. Lawrence Market: Named the world’s best food market by National Geographic \nin 2012, this is a must-visit for foodies. \n\uf0b7 Nightlife: Toronto has a vibrant nightlife with a plethora of bars, nightclubs, and live \nmusic venues. The Entertainment District is known for its nightclubs and theaters.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 3, 'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'type': 'smaller chunk'}),
Document(page_content='\uf0b7 Nightlife: Toronto has a vibrant nightlife with a plethora of bars, nightclubs, and live \nmusic venues. The Entertainment District is known for its nightclubs and theaters. \nIn conclusion, whether you’re a lover of art and culture, outdoor activities, food, or just \nlooking to have a good time, Toronto has something for everyone.', metadata={'source': 'data\\toronto.pdf(smaller chunk)', 'page': 3, 'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'type': 'smaller chunk'})]

Indexing the Sub-Documents

We index the sub-documents using the index_with_ids function to manage embeddings and ensure efficient retrieval:

idx = index_with_ids(all_sub_docs, record_manager, vectorstore, cleanup="incremental", source_id_key="source")
{'status': 'success',
'ids': [{'key': 'de64f070-8219-5b39-8032-cc0f8b38129c', 'operation': 'INS'},
{'key': '7e3089a7-17c8-5f81-85bf-85d2ffbd49e5', 'operation': 'INS'},
{'key': 'f5d04788-e830-5f9e-a59b-6f868320a1aa', 'operation': 'INS'},
{'key': '503d7916-e17c-560b-8f6d-fb43883ba7bf', 'operation': 'INS'},
{'key': 'be89892a-f871-5170-9e26-ea9b31662aab', 'operation': 'INS'},
{'key': '1bdda57a-675b-5bb3-8a3c-0484976f70f1', 'operation': 'INS'},
{'key': '5a393ada-9b12-5354-b38f-295b353ce558', 'operation': 'INS'},
{'key': '3862ca2f-e184-5d94-8ed9-1747dbecdfc0', 'operation': 'INS'},
{'key': '55c17723-722d-53da-ad32-56d0c6e00111', 'operation': 'INS'},
{'key': '664c65ff-79b2-551c-93e5-fba7ece620cb', 'operation': 'INS'}],
'results': [{'num_added': 10,
'num_updated': 0,
'num_skipped': 0,
'num_deleted': 0}]}

Summarizing Documents

We generate summaries for the parent chunks using the language model and the defined summary chain:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

prompt_text = """You are an assistant tasked with summarizing text. \
Directly summarize the following text chunk: {element} """
prompt = ChatPromptTemplate.from_template(prompt_text)

# Initialize the Language Model (LLM)
model = ChatOpenAI(temperature=0, model="gpt-4o")

# Define the summary chain
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

Fetching and Creating Summary Documents

We fetch existing summary documents or create new ones based on the summaries generated by the language model:

from sqlalchemy import create_engine, Column, String, LargeBinary, select, Table, MetaData
from sqlalchemy.orm import sessionmaker
from sqlalchemy.dialects.postgresql import JSONB
from langchain.schema.document import Document

# List to store all summary documents
summary_docs = []

# Database connection setup
engine = create_engine(CONNECTION_STRING)
Session = sessionmaker(bind=engine)
session = Session()

# Define table structure
metadata = MetaData()
langchain_pg_embedding = Table(
'langchain_pg_embedding', metadata,
Column('id', String, primary_key=True),
Column('collection_id', String),
Column('embedding', LargeBinary),
Column('document', String),
Column('cmetadata', JSONB)
)

# Create a dictionary to map doc_id to documents
documents_dict = {doc.metadata['doc_id']: doc for doc in documents}

# Collect parent chunks and associated document IDs for documents that are not SKIP and exist in documents_dict
non_skip_docs = [(documents_dict[doc_id], doc_id) for doc_id, operation in parent_docs_operations if operation != 'SKIP' and doc_id in documents_dict]
skip_doc_ids = [doc_id for doc_id, operation in parent_docs_operations if operation == 'SKIP']


# Generate summaries for the parent chunks that are not SKIP
parent_chunk = [doc.page_content for doc, _ in non_skip_docs]
text_summaries = summarize_chain.batch(parent_chunk, {"max_concurrency": 5})
text_summaries_iter = iter(text_summaries)

# Dictionary to store summaries temporarily
temp_summary_docs = {}

# Process non-SKIP documents and store their summaries
for doc, doc_id in non_skip_docs:
source = doc.metadata.get("source")
page = doc.metadata.get("page")
summary_content = next(text_summaries_iter)
summary_doc = Document(page_content=summary_content, metadata={
"doc_id": doc_id,
"source": f"{source}(summary)",
"page": page,
"type": "summary"
})
temp_summary_docs[doc_id] = summary_doc


# Process SKIP documents and store their summaries
for doc_id in skip_doc_ids:
query = select(
langchain_pg_embedding.c.id,
langchain_pg_embedding.c.collection_id,
langchain_pg_embedding.c.embedding,
langchain_pg_embedding.c.document,
langchain_pg_embedding.c.cmetadata
).where(
(langchain_pg_embedding.c.cmetadata['doc_id'].astext == doc_id) &
(langchain_pg_embedding.c.cmetadata['type'].astext == 'summary')
)


result = session.execute(query).fetchall()

if not result:
print(f"No result found for SKIP doc_id {doc_id}")
else:
for row in result:
metadata = row.cmetadata
summary_content = row.document
summary_doc = Document(page_content=summary_content, metadata=metadata)
temp_summary_docs[doc_id] = summary_doc


# Combine the summaries into the final summary_docs list
for doc in documents:
doc_id = doc.metadata['doc_id']
if doc_id in temp_summary_docs:
summary_docs.append(temp_summary_docs[doc_id])
else:
# Handle the case where no summary was found or generated
print(f"No summary found for document ID {doc_id}")

# Close the session after use
session.close()

[Document(page_content="Toronto, the capital of Ontario and Canada's largest city, is a vibrant and cosmopolitan destination. Key attractions include the CN Tower with its panoramic views and revolving restaurant, the Royal Ontario Museum (ROM) which is Canada's largest museum of world cultures and natural history, and the Toronto Islands, which offer beautiful beaches, picnic spots, and bike rentals.", metadata={'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'source': 'data\\toronto.pdf(summary)', 'page': 0, 'type': 'summary'}),
Document(page_content="Toronto's cultural experiences are highlighted through its diverse neighborhoods and festivals. Key neighborhoods include Chinatown, known for its vibrant food scene; Kensington Market, offering vintage shops and international food stalls; and the Distillery District, noted for its Victorian Industrial architecture and cultural spaces. Major festivals include Caribana, celebrating Caribbean culture in the summer, and the Toronto International Film Festival (TIFF), a prestigious annual event in September.", metadata={'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'source': 'data\\toronto.pdf(summary)', 'page': 1, 'type': 'summary'}),
Document(page_content="Toronto offers numerous opportunities for outdoor activities, including High Park, which is the city's largest public park with hiking trails, sports facilities, a lakefront, a zoo, and playgrounds. The Toronto Zoo houses over 5,000 animals from more than 500 species. Ripley's Aquarium of Canada, located at the base of the CN Tower, is a major new attraction in the city.", metadata={'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'source': 'data\\toronto.pdf(summary)', 'page': 2, 'type': 'summary'}),
Document(page_content="Toronto's food scene is diverse, reflecting its population. St. Lawrence Market, named the world's best food market by National Geographic in 2012, is a must-visit for food enthusiasts. The city also boasts a vibrant nightlife with numerous bars, nightclubs, and live music venues, particularly in the Entertainment District. Overall, Toronto offers a wide range of activities and attractions for everyone.", metadata={'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'source': 'data\\toronto.pdf(summary)', 'page': 3, 'type': 'summary'})]

Indexing the Summary Documents

We index the summary documents using the index_with_ids function to manage embeddings:

idx = index_with_ids(summary_docs, record_manager, vectorstore, cleanup="incremental", source_id_key="source")
{'status': 'success',
'ids': [{'key': 'cfec048e-c4ad-5750-a165-439aedd3cc1a', 'operation': 'INS'},
{'key': 'c8f19c3f-3fca-5bd3-9702-1ade48aca1b1', 'operation': 'INS'},
{'key': '5b16db17-9eee-59ff-9600-a15f9075d821', 'operation': 'INS'},
{'key': '15482b0f-b1c0-59c3-9e8f-7801a4409cc6', 'operation': 'INS'}],
'results': [{'num_added': 4,
'num_updated': 0,
'num_skipped': 0,
'num_deleted': 0}]}

Generating Hypothetical Questions

We create a prompt template and use the language model to generate hypothetical questions for each document:

functions = [
{
"name": "hypothetical_questions",
"description": "Generate hypothetical questions",
"parameters": {
"type": "object",
"properties": {
"questions": {
"type": "array",
"items": {"type": "string"},
},
},
"required": ["questions"],
},
}
]
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

question_chain = (
{"doc": lambda x: x.page_content}
# Only asking for 5 hypothetical questions, but this could be adjusted
| ChatPromptTemplate.from_template(
"""Generate a list of exactly 5 hypothetical questions that the below document could be used to answer:\n\n{doc}
separate each question with a comma (,)
"""
)
| ChatOpenAI(max_retries=0, model="gpt-4o").bind(
functions=functions, function_call={"name": "hypothetical_questions"}
)
| JsonKeyOutputFunctionsParser(key_name="questions")
)

Fetching and Creating Question Documents

We fetch existing question documents or create new ones based on the questions generated by the language model:

from sqlalchemy import create_engine, Column, String, LargeBinary, select, Table, MetaData
from sqlalchemy.orm import sessionmaker
from sqlalchemy.dialects.postgresql import JSONB
from langchain.schema.document import Document

# List to store all question documents
question_docs = []

# Database connection setup
engine = create_engine(CONNECTION_STRING)
Session = sessionmaker(bind=engine)
session = Session()

# Define table structure
metadata = MetaData()
langchain_pg_embedding = Table(
'langchain_pg_embedding', metadata,
Column('id', String, primary_key=True),
Column('collection_id', String),
Column('embedding', LargeBinary),
Column('document', String),
Column('cmetadata', JSONB)
)

# Create a dictionary to map doc_id to documents
documents_dict = {doc.metadata['doc_id']: doc for doc in documents}

# Separate non-SKIP and SKIP document IDs
non_skip_docs = [(documents_dict[doc_id], doc_id) for doc_id, operation in parent_docs_operations if operation != 'SKIP' and doc_id in documents_dict]
skip_doc_ids = [doc_id for doc_id, operation in parent_docs_operations if operation == 'SKIP']

# Generate hypothetical questions for the parent documents that are not SKIP
parent_documents = [doc for doc, _ in non_skip_docs]
hypothetical_questions = question_chain.batch(parent_documents, {"max_concurrency": 5})
hypothetical_questions_iter = iter(hypothetical_questions)

# Dictionary to store questions temporarily
temp_question_docs = {}

# Process non-SKIP documents and store their questions
for doc, doc_id in non_skip_docs:
source = doc.metadata.get("source")
page = doc.metadata.get("page")
question_list = next(hypothetical_questions_iter)

# Ensure there are exactly 5 questions for each document
if len(question_list) < 5:
question_list = question_list + [""] * (5 - len(question_list)) # Pad with empty strings if fewer than 5

for question_content in question_list[:5]:
question_doc = Document(page_content=question_content, metadata={
"doc_id": doc_id,
"source": f"{source}(question)",
"page": page,
"type": "question"
})
if doc_id not in temp_question_docs:
temp_question_docs[doc_id] = []
temp_question_docs[doc_id].append(question_doc)

# Process SKIP documents and store their questions
for doc_id in skip_doc_ids:
query = select(
langchain_pg_embedding.c.id,
langchain_pg_embedding.c.collection_id,
langchain_pg_embedding.c.embedding,
langchain_pg_embedding.c.document,
langchain_pg_embedding.c.cmetadata
).where(
(langchain_pg_embedding.c.cmetadata['doc_id'].astext == doc_id) &
(langchain_pg_embedding.c.cmetadata['type'].astext == 'question')
)

result = session.execute(query).fetchall()

if result:
questions = []
for row in result:
metadata = row.cmetadata
question_content = row.document
question_doc = Document(page_content=question_content, metadata=metadata)
questions.append(question_doc)

# Ensure there are exactly 5 questions for each document
if len(questions) < 5:
questions = questions + [Document(page_content="", metadata={
"doc_id": doc_id,
"source": f"{documents_dict[doc_id].metadata.get('source')}(question)",
"page": documents_dict[doc_id].metadata.get("page"),
"type": "question"
}) for _ in range(5 - len(questions))] # Pad with empty documents if fewer than 5

temp_question_docs[doc_id] = questions[:5]

# Combine the questions into the final question_docs list
for doc in documents:
doc_id = doc.metadata['doc_id']
if doc_id in temp_question_docs:
question_docs.extend(temp_question_docs[doc_id])

# Close the session after use
session.close()

# The resulting question documents
question_docs
[Document(page_content='What are the must-visit attractions in Toronto?', metadata={'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'source': 'data\\toronto.pdf(question)', 'page': 0, 'type': 'question'}),
Document(page_content="Where can I find a good view of Toronto's skyline?", metadata={'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'source': 'data\\toronto.pdf(question)', 'page': 0, 'type': 'question'}),
Document(page_content='What activities are available on the Toronto Islands?', metadata={'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'source': 'data\\toronto.pdf(question)', 'page': 0, 'type': 'question'}),
Document(page_content='Which museum in Toronto focuses on world cultures and natural history?', metadata={'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'source': 'data\\toronto.pdf(question)', 'page': 0, 'type': 'question'}),
Document(page_content='Is there a unique dining experience available at the CN Tower?', metadata={'doc_id': 'eac9bbc7-a391-5931-a26c-11d9ee2402aa', 'source': 'data\\toronto.pdf(question)', 'page': 0, 'type': 'question'}),
Document(page_content='What are some culturally diverse neighborhoods to visit in Toronto?', metadata={'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'source': 'data\\toronto.pdf(question)', 'page': 1, 'type': 'question'}),
Document(page_content='Which neighborhood in Toronto is known for its Victorian Industrial architecture?', metadata={'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'source': 'data\\toronto.pdf(question)', 'page': 1, 'type': 'question'}),
Document(page_content='What is the Toronto International Film Festival (TIFF) and when is it held?', metadata={'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'source': 'data\\toronto.pdf(question)', 'page': 1, 'type': 'question'}),
Document(page_content="Where can I experience a vibrant food scene in Toronto's Chinatown?", metadata={'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'source': 'data\\toronto.pdf(question)', 'page': 1, 'type': 'question'}),
Document(page_content='What is Caribana and when does it take place?', metadata={'doc_id': '4d722603-1c85-56ab-82f2-2d4dfdd3eb68', 'source': 'data\\toronto.pdf(question)', 'page': 1, 'type': 'question'}),
Document(page_content='What outdoor activities can you do in Toronto?', metadata={'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'source': 'data\\toronto.pdf(question)', 'page': 2, 'type': 'question'}),
Document(page_content='What can you find in High Park?', metadata={'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'source': 'data\\toronto.pdf(question)', 'page': 2, 'type': 'question'}),
Document(page_content='How many animals are there in the Toronto Zoo?', metadata={'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'source': 'data\\toronto.pdf(question)', 'page': 2, 'type': 'question'}),
Document(page_content='Where is Ripley’s Aquarium of Canada located?', metadata={'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'source': 'data\\toronto.pdf(question)', 'page': 2, 'type': 'question'}),
Document(page_content='What are some of the features of Toronto’s largest public park?', metadata={'doc_id': '80e1944b-044e-5dc8-899f-7a941f1fa08b', 'source': 'data\\toronto.pdf(question)', 'page': 2, 'type': 'question'}),
Document(page_content='What makes St. Lawrence Market a must-visit for food enthusiasts?', metadata={'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'source': 'data\\toronto.pdf(question)', 'page': 3, 'type': 'question'}),
Document(page_content="How does Toronto's nightlife compare to other major cities?", metadata={'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'source': 'data\\toronto.pdf(question)', 'page': 3, 'type': 'question'}),
Document(page_content="What are some popular bars and nightclubs in Toronto's Entertainment District?", metadata={'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'source': 'data\\toronto.pdf(question)', 'page': 3, 'type': 'question'}),
Document(page_content='How diverse is the food scene in Toronto?', metadata={'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'source': 'data\\toronto.pdf(question)', 'page': 3, 'type': 'question'}),
Document(page_content="What are the top recommendations for someone looking to experience Toronto's nightlife?", metadata={'doc_id': 'a9cace25-2615-59b1-9669-ccd6656ac767', 'source': 'data\\toronto.pdf(question)', 'page': 3, 'type': 'question'})]

Indexing the Question Documents

We index the question documents using the index_with_ids function to manage embeddings:

idx = index_with_ids(question_docs, record_manager, vectorstore, cleanup="incremental", source_id_key="source")
{'status': 'success',
'ids': [{'key': 'd8956c05-e16e-51e6-9802-3a7c653302af', 'operation': 'INS'},
{'key': '28da1e5f-0173-531c-b01e-dd77fb80b3bd', 'operation': 'INS'},
{'key': '177a0495-c04f-56da-8b87-e186dfeb16cd', 'operation': 'INS'},
{'key': 'de413224-1663-5da8-8cd5-077c91128a72', 'operation': 'INS'},
{'key': '79496681-d438-5aad-bdee-189eb2fc6160', 'operation': 'INS'},
{'key': 'a490421e-cfa3-5898-bd49-1b3935f4382d', 'operation': 'INS'},
{'key': '0341f6fb-8c54-5017-a8d5-e62ea0d491ef', 'operation': 'INS'},
{'key': '72dc46a8-294d-5f3f-bd38-3e6dbc1e099b', 'operation': 'INS'},
{'key': '2b6bd764-6b1c-57fa-a236-1adb95d21e6e', 'operation': 'INS'},
{'key': 'd41ee898-c49e-5e07-909a-3a384adfe238', 'operation': 'INS'},
{'key': '6220ccf4-fa2c-53ca-9b7e-83c97fe3362c', 'operation': 'INS'},
{'key': '654d6c24-9e64-55a0-b58c-4073db2f7b1a', 'operation': 'INS'},
{'key': '10f6c346-4cb0-540c-8cd8-f86d9c0413c6', 'operation': 'INS'},
{'key': '514a6032-f24a-51ef-a035-132990968947', 'operation': 'INS'},
{'key': '9927f2d6-a23f-5c31-a9d0-733a26cedb1e', 'operation': 'INS'},
{'key': '1dec701b-838c-5db2-9d16-d8facc48424d', 'operation': 'INS'},
{'key': 'b860ef0a-4e76-560b-82e0-bb7b14740b11', 'operation': 'INS'},
{'key': '8c71e9ca-485e-5b26-a0e1-a15bcb9ea3bb', 'operation': 'INS'},
{'key': '1ae755a8-e7f2-523c-a36d-80e5ffe79aac', 'operation': 'INS'},
{'key': 'f9436aeb-486b-57af-9c5b-fd82d65e4682', 'operation': 'INS'}],
'results': [{'num_added': 20,
'num_updated': 0,
'num_skipped': 0,
'num_deleted': 0}]}

Querying the System with RAG Pipeline

Finally, we set up a Retrieval-Augmented Generation (RAG) pipeline to answer questions based on the context provided by the retriever:

from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Prompt template
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
model = ChatOpenAI(temperature=0, model="gpt-4o")

# RAG pipeline
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)

# Example query
chain.invoke("What types of shops and food can I find in Kensington Market?")

Conclusion

In this walkthrough, we have explored the comprehensive process of managing and retrieving documents using LangChain, enhanced with custom utilities and SQLAlchemy for more efficient and granular updates. By leveraging reproducible UUIDs, we addressed the challenges of re-importing documents and maintaining consistent document IDs across imports. Additionally, we introduced the CustomSQLRecordManager, index_with_ids, and conditional_mset methods to ensure that document updates are handled efficiently, reducing computational overhead and improving data integrity.

Through this process, we demonstrated how to split documents into manageable chunks, generate embeddings, and update the vector store and docstore accurately. By integrating advanced retrieval methods such as multi-vector retrieval and Retrieval-Augmented Generation (RAG), we significantly enhanced the system’s ability to provide accurate and relevant information retrieval.

These improvements make the document management system more robust, scalable, and capable of handling frequent updates and complex document structures. For AI practitioners and developers, this approach offers a powerful framework for building efficient, reliable, and scalable document retrieval systems, ensuring that your AI applications can manage large datasets effectively and provide high-quality information retrieval.

The complete code for the methods and techniques discussed in this article is available on GitHub. This allows you to explore, adapt, and implement these strategies in your own projects. Happy coding!

Thank you for reading me!

--

--

Eric Vaillancourt

Eric Vaillancourt, an AI enthusiast, began his career in 1989, founded a tech consultancy in 1996, and has led over 1,500 trainings in IT and AI.