Using langchain for Question Answering on Own Data

Step by step guide to using langchain to chat with own data

23 min readAug 7, 2023

Large language models are able to answer questions on topics on which they are trained. But they are not able to answer questions on our personal data or a company’s proprietary documents or articles written after the LLM was trained. It will be really great if we are able to have some conversations with our own documents and answer questions from these documents using an LLM. We can have conversations with two types of documents: structured and unstructured document. This article discusses how to have conversations with unstructured document like pdf. This article takes most of its content from the course LangChain: Chat with Your Data by Prof. Andrew Ng and Harrison Chase, founder of LangChain. This is the third article on langchain. The first article discusses how langchain can be used for LLM application development. The second article discusses how to use chains and agents for LLM application development. I have discussed how to have conversations with structured data such as a relational database in my another article Context Management in text2sql tasks.

LangChain is an open-source developer framework for building LLM applications. In this article, we will focus on a specific use case of LangChain i.e. how to use LangChain to chat with own data. We will cover mostly the following topics in this article:

Document Loading
Document Splitting
Vector Store and Embeddings
Retrieval
Question Answering

Document Loading

In retrieval augmented generation (RAG) framework, an LLM retrieves contextual documents from an external dataset as part of its execution. This is useful when we want to ask questions about specific documents (e.g., PDFs, videos, etc). If we want to create an application to chat with our data, we need to first load our data into a format where it can be worked with.

We use LangChain’s document loaders for this purpose. Document loaders deal with the specifics of accessing and converting data from a variety of different formats and sources into a standardized format. We may have to load from structured data sources or unstructured data sources. For example, we may have to access and load data from websites, databases, YouTube, arxiv, Twitter, Hacker News or proprietary data sources like Figma, and Notion or sources like Airbyte, Stripe, and Airtable. These documents come in different data types, like pdf, html, json, word, and PowerPoint or can be in tabular format. Document loaders take in data from these data sources and load them into a standard document object, consisting of content and associated metadata. Further, langchain has more than 80 different document loaders as seen below.

PyPDF DataLoader

Now, we will use PyPDF loaders to load pdf. We will be loading MachineLearning-Lecture01.pdf from Andrew Ng’s famous CS229 course.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")

#Load the document by calling loader.load()
pages = loader.load()

print(len(pages))
print(pages[0].page_content[0:500])

print(pages[0].metadta)
# {'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

This loads a list of documents. In this case, there are 22 different pages in this PDF. Each page is a document and a document contains page_content and metadata. Page_content is the content of the page and the metadata element consists of metadata associated with each document.

Youtube DataLoader

LangChain provides YoutubeAudioLoader that loads videos from YouTube. We can use this loader to ask questions from videos or lectures.

from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

print(docs[0].page_content[0:500])

YouTubeAudioLoader loads an audio file from a YouTube link and uses OpenAIWhisperParser, which uses OpenAI’s speech-to-text Whisper model to convert the YouTube audio into a text format that we can work with. We need to specify a youtube URL and a directory in which to save the audio files.

WebBaseLoader

WebBaseLoader is used to load URLs from the Internet.

from langchain.document_loaders import WebBaseLoader

# Use a markdown file from github page
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

docs = loader.load()
print(docs[0].page_content[:500])

Here, we need to do post-processing on the above output as there is a lot of white space followed by some text.

NotionDirectoryLoader

NotionDirectoryLoader is used to load data from Notion. Notion is a popular store of personal and company data. We can duplicate a page from your Notion Space and export the page as a markdown/CSV file.

from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

print(docs[0].page_content[0:200])
print(docs[0].metadata)

Till now, we covered how to load data from a variety of sources and get it into a standardized document interface. But, if these documents are large, we may have to split them up into smaller chunks. This is important because, in the case of retrieval augmented generation, we need to retrieve only those pieces of content that are most relevant to us.

Document Splitting

Document Splitting is required to split documents into smaller chunks. Document splitting happens after we load data into standardised document format but before it goes into the vector store.

Splitting documents into smaller chunks is important and tricky as we need to maintain meaningful relationships between the chunks. For example, if have 2 chunks on Toyota Camry as follows:

chunk 1: on this model. The Toyota Camry has a head-snapping
chunk 2: 80 HP and an eight-speed automatic transmission that will

In this case, we did a simple splitting and we ended up with part of the sentence in one chunk, and the other part of the sentence in another chunk. So we will not be able to answer a question about the specifications of the Camry due to lack of right information in either chunk. So it is important that we split the chunks into semantically relevant chunks.

We will now initialize RecursiveCharacterTextSplitter and CharacterTextSplitter as below:

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size =26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

The input text is split based on a defined chunk size with some defined chunk overlap. Chunk Size is a length function to measure the size of the chunk. This is often characters or tokens.

Chunk Size and Chunk Overlap in Document Splitting

A chunk overlap is used to have little overlap between two chunks and this allows for to have some notion of consistency between 2 chunks. There are different types of splitters in Lang Chain as can be seen below:

The text splitters in Lang Chain have 2 methods — create documents and split documents. Both have the same logic under the hood but one takes in a list of text and another takes in a list of documents. These text splitters vary across a bunch of dimensions like how they split the chunks (by character or by tokens) or how they measure the length of the chunks. We can sometimes use other smaller models to determine the end of a sentence and use that information to split chunks. Metadata is also important when splitting texts/documents into chunks. We may have to add new pieces of metadata while maintaining the same metadata across all chunks. Sometimes, the splitting of chunks can be specific to the type of document. It can be seen when we are splitting on code. We use a language text splitter which uses different separators for different languages like Python, Ruby, and C.

Now, we will look into some examples of text splitters in LangChain with some toy use cases.

# Recursive text Splitter
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)
# Output - ['abcdefghijklmnopqrstuvwxyz']

# Character Text Splitter
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)
# Output - ['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

# Recursive text Splitter
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)
# output - ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

# Character Text Splitter
c_splitter.split_text(text3)
# output - ['a b c d e f g h i j k l m n o p q r s t u v w x y z']

# Character Text Splitter with separator defined
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)
# Output - ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

RecursiveSplitting

Now, we will try out some real-world examples. We will see how Recursive Text Splitter and Character text Splitter work differently.

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

len(some_text) -> 496

c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

Here, the CharacterTextSplitter uses space as a separator and we pass a list of separators in the case of RecursiveCharacterText Splitter.

In the first case, we got output:

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In the second case, we got output:

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In the case of RecursiveCharacterTextSplitter, we have a list of separators as double newline, single newline, space and an empty string. Thus, it splits a piece of text by double newlines and then splits chunks by single newlines followed by space and lastly splits character by character. The RecursiveTextSplitter splits on double newlines and so it splits the text into two paragraphs. We can see that the first paragraph is shorter than the 450 characters, as a split on double newline is probably a better split. It can also be seen that the character text splitter splits into spaces and so, we end up with the weird separation in the middle of the sentence.

Now, we will run one more real-world example of TextSplitter with a PDF.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

docs = text_splitter.split_documents(pages)

len(docs) -> 77
len(pages) -> 22

Here we also passed the length function which is Python’s built-in default length.

Token splitting

Till now, we split text based on characters. We can split based on token count as well. This can be useful because LLMs often have context windows designated in tokens. Tokens are generally made of ~4 characters.

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
# ['foo', ' bar', ' b', 'az', 'zy', 'foo']

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)

docs[0]
# Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0})

pages[0].metadata
# {'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

Context-aware splitting

The purpose of chunking is to have text with a common context together. A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use a markdown header text splitter for this purpose to preserve header metadata in our chunks. It splits a markdown file based on the header or any subheaders and then it adds those headers as content to the metadata fields and that will get passed on along to any chunks that originate from those splits.

from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

We have a document with a title and then a subheader (chapter 1) with some sentences. Then, we have another section with a subheader (chapter 2) and some sentences there.

markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""


headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

Now, we define our MarkdownHeaderTextSplitter.

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

Finally, we get text splits as follows:

md_header_splits[0]
# Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

md_header_splits[1]
# Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

We were able to get semantically relevant chunks with appropriate metadata. Now, we will move these chunks of data into a vector store.

Vector Stores and Embeddings

We split up our document into small chunks and now we need to put these chunks into an index so that we are able to retrieve them easily when we want to answer questions on this document. We use embeddings and vector stores for this purpose.

Vector Store and Embeddings in LangChain

Vector stores and embeddings come after text splitting as we need to store our documents in an easily accessible format. Embeddings take a piece of text and create a numerical representation of the text. Thus, text with semantically similar content will have similar vectors in embedding space. Thus, we can compare embeddings(vectors) and find texts that are similar.

The whole pipeline starts with documents. We split these documents into smaller splits and create embeddings of those splits or documents. Finally, we store all these embeddings in a vector store.

A vector store is a database where you can easily look up similar vectors later on. This becomes useful when we try to find documents that are relevant to a question.

Question Answer using comparison of embeddings in vector store

Thus, when we want to get an answer for a question, we create embeddings of the question and then we compare the embeddings of the question with all the different vectors in the vector store and pick the n most similar. Finally, we take n most similar chunks and pass these chunks along with the question into an LLM, and get the answer.

Now, we will see how we load a set of documents into a vector store.

from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

We use RecursiveCharacterTextSplitter to create chunks after documents are loaded.

# Define the Text Splitter 
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

#Create a split of the document using the text splitter
splits = text_splitter.split_documents(docs)

Now, we will create embeddings for all the chunks of the PDFs and then store them in a vector store. We use OpenAI to create these embeddings. We will use Chroma as the vector store in our case. Chroma is lightweight and in memory making it easy to start with.

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings()

We save this vector store in a persistent directory so that we can use it in future.

persist_directory = 'docs/chroma/'

# Create the vector store
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

We pass splits created earlier, embedding, an open AI embedding model, and the persist directory to create the vector store.

Similarity Search

We will now ask questions using the similarity search method and pass k, which specifies the number of documents that we want to return.

question = "is there an email i can ask for help"

docs = vectordb.similarity_search(question,k=3)

# Check the length of the document
len(docs)

# Check the content of the first document
docs[0].page_content

# Persist the database to use it later
vectordb.persist()

Similarity Search: Edge Cases

A basic similarity search gets most of the results correct. But, there are some edge cases where similarity search fails. Now, we will make another query and will check for duplicate results.

question = "what did they say about matlab?"

# Similarity search with k = 5
docs = vectordb.similarity_search(question,k=5)

# Check for first two results
print(docs[0])
print(docs[1])

Here, the first two results are identical as we loaded duplicate pdfs (duplicate MachineLearning-lecture01.pdf) in the beginning. So we got duplicate chunks and passed both of these chunks to the language model. It can be concluded that semantic search fetches all similar documents, but does not enforce diversity. We will cover in the next section how to retrieve both relevant and distinct chunks at the same time.

There can be another failure in the similarity search that we will see by making another query.

question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(question,k=5)


# Print the metadata of the similarity search result
for doc in docs:
    print(doc.metadata)

print(docs[4].page_content)

We checked for the metadata of the search result i.e. the lectures these results came from. We can see that the results came from the third lecture, second lecture and first lecture. The reason for this failure may be that the fact that we want documents from only the third lecture is a piece of structured information but we’re just doing a semantic lookup based on embeddings and the embedding is probably more focused on the word regression and does not capture the information about third lecture. So we are getting all results that are relevant to regression. We can check this by printing the fifth document and can confirm that it in fact mentions the word regression.

Retrieval

Retrieval is the centrepiece of our retrieval augmented generation (RAG) flow. Retrieval is one of the biggest pain points faced when we try to do question-answering over our documents. Most of the time when our question answering fails, it is due to a mistake in retrieval. We will also discuss some advanced retrieval mechanisms in LangChain such as, Self-query and Contextual Compression. Retrieval is important at query time when a query comes in and we want to retrieve the most relevant splits.

We saw that semantic search worked pretty well for a good amount of use cases. But it failed for some edge cases. Thus, we are going to deep dive into retrieval and discuss a few different and more advanced methods to overcome these edge cases.

Accessing/indexing data in the vector store

Basic semantic similarity
Maximum Marginal Relevance
Including Metadata

2. LLM-aided retrieval

3. Contextual Compression

1. Maximum Marginal Relevance(MMR)

MMR is an important method to enforce diversity in the search results. In the case of semantic search, we get documents that are most similar to the query in the embedding space and we may miss out on diverse information. For example, if the query is “Tell me about all-white mushrooms with large fruiting bodies”, we get the first two most similar results in the first two documents with information similar to the query about a fruiting body and being all-white. But we miss out on information that is important but not similar to the first two documents. Here, MMR helps to solve this problem as it helps to select a diverse set of documents.

The idea behind MMR is we first query the vector store and choose the “fetch_k” most similar responses. Now, we work on this smaller set of “fetch_k” documents and optimize to achieve both relevance to the query and diversity among the results. Finally, we choose the “k” most diverse response within these “fetch_k” responses.If we will print the first 100 characters of the first 2 documents, we will find that we will get the same result if we will use the similarity search as above. Now, we will run a search query with MMR and the first few results.

texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)
question = "Tell me about all-white mushrooms with large fruiting bodies"
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

Here, we were able to diverse results by using MMR search as mentioned above. Now, we will compare the results for similarity search and maximum marginal relevance search results.

# Compare the result of similarity searcha nd MMR search
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)
docs_ss[0].page_content[:100]
docs_ss[1].page_content[:100]

docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
docs_mmr[0].page_content[:100]
docs_mmr[1].page_content[:100]

We can see that the first 100 characters in the first 2 documents are the same in the similarity search but the first 100 characters in the 2 documents are different in the case of search with MMR and so we could get some diversity in the query result.

2. MetaData

Metadata is also used to address specificity in the search. Earlier, we found that the answer to the query “What did they say about regression in the third lecture?” returned results not just from the third lecture but also from the first and second lectures.

To address this, we will specify a metadata filter to solve the above. Many vector stores support operations on metadata. So, we will pass the information that the source should be equal to the third lecture pdf. Here, metadata provides context for each embedded chunk.

question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

# Print metadata of the document retrieved
for d in docs:
    print(d.metadata)

Now, if we will look into the metadata of the documents retrieved, we can see that all the documents are retrieved from the third lecture.

3. Self Query

Self Query is an important tool when we want to infer metadata from the query itself. We can use SelfQueryRetriever, which uses an LLM to extract

The query string to use for vector search
A metadata filter to pass in as well

Here, we use a language model to filter results based on metadata. But, we don't need to manually specify filters as done earlier and instead use metadata along with a self-query retriever.

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

This method is used when we have a query not solely about the content that we want to look up semantically but also includes some metadata that we want to apply a filter on.

We have 2 fields in metadata, source and page. We need to provide a description of the name and the type for each of these attributes. This information is used by the language model and so we should make this description as descriptive as possible.

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

We also need to specify information about what is actually in the document store. Here, LLM infers the query that should be passed along with the metadata filters.

document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

Now we run the retriever with the following question.

question = "what did they say about regression in the third lecture?"
docs = retriever.get_relevant_documents(question)

For example, we can have a query “What are some movies about aliens made in 1980?”. This query has 2 components and we can use the language model to split the original question into 2 separate things: a metadata filter and a search term.

Addressing Specificity: Self Query with metadata

For example, in this case, we look up aliens in our databases of movies and filter for metadata of each movie in the form of 1980 being the year of the movie. Most vector store supports metadata filter, so we don't need any new databases or indexes. Since most vector stores support a metadata filter, we can easily filter records based on metadata, for example, the year of the movie being 1980.

4. Contextual Compression

Compression is another approach to improve the quality of retrieved docs. Since passing the full document through the application can lead to more expensive LLM calls and poorer response, it is useful to pull out only the most relevant bits of the retrieved passages.

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

With compression, we run all our documents through a language model and extract the most relevant segments and then pass only the most relevant segments into a final language model call.

This comes at the cost of making more calls to the language model, but it’s also good to focus the final answer on only the most important things. And so it’s a bit of a tradeoff.

Question Answering

We have discussed how to do question answering with the documents that we have just retrieved in Retrieval. Now, we take these documents and the original question, pass both of them to a language model and ask the language model to answer the question.

RetrievalQA Chain

We will first see how to do question answering after multiple relevant splits have been retrieved from the vector store. We may also need to compress the relevant splits to fit into the LLM context. Finally, we send these splits along with a system prompt and human question to the language model to get the answer.

By default, we pass all the chunks into the same context window, into the same call of the language model. But, we can also use other methods in case the number of documents is high and if we can't pass them all in the same context window. MapReduce, Refine, and MapRerank are three methods that can be used if the number of documents is high. Now, we will look into these methods in detail.

We will first load the vector database that we persisted in earlier.

# Load vector database that was persisted earlier and check collection count in it
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
print(vectordb._collection.count())

We will first do a similarity search to check if the database is working properly.

question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

Now, we will use RetrievalQA chain to get the answer to this question. For this, we initialize the language model (ChatOpenAI model). We set the temperature to zero as zero temperature is good to get factual answers from models due to their low variability, highest fidelity and reliable answers.

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)

We also need RetrievalQA chain which does question answering backed by a retrieval step. This is created by passing a language model and vector database as a retriever.

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

Now, we call qa_chain with the question that we want to ask.

# Pass question to the qa_chain
question = "What are major topics for this class?"
result = qa_chain({"query": question})
result["result"]

RetrievalQA chain with Prompt

Let’s try to understand a little bit better what’s going on underneath the hood. First, we define the prompt template. The prompt template has instructions about how to use the context. It also has a placeholder for a context variable. We will use prompts to get answers to a question. Here, the prompt takes in the documents and the question and passes it to a language model.

from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

We create a new retrieval QA chain using a language model, a vector database and a few new arguments.

# Initilaize chain
# Set return_source_documents to True to get the source document
# Set chain_type to prompt template defines
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

This time, we will try a new question and check the result.

question = "Is probability a class topic?"
result = qa_chain({"query": question})
# Check the result of the query
result["result"]
# Check the source document from where we 
result["source_documents"][0]

Till now, we used the “stuff” method by default, which stuffs all the documents into the final prompt. This involves only one call to the language model. But, in case we have too many documents, the documents may not fit inside the context window. In such cases, we may use different techniques namely map-reduce, refine and map_rerank.

RetrievalQA Chain with MapReduce, Reine and Map ReRank

In this technique, each of the individual documents is first sent to the language model to get an original answer and then these answers are composed into a final answer with a final call to the language model. This involves many more calls to the language model, but it does have the advantage that it can operate over arbitrarily many documents.

There are 2 limitations of this method. First, it is slower than the previous one and second that the result is worse than the previous one. This may occur if there is information spread across two documents then the information may not be present in the same context.

qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

When RetrievalQA chain calls MapReduceDocumentsChain under the hood. This involves four separate calls to the language model(ChatOpenAI in this case) for each of these documents. The result of these calls is combined in a final chain (StuffedDocumentsChain), which stuffs all these responses into the final call. StuffedDocumentsChain uses the system message, four summaries from the previous documents and the user question to get the answer.

qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

In case, we use “refine” as chain type for retrieval, RetrievalQA chain invokes RefineDocumentsChain, which involves four sequential calls to an LLM chain. Each of these four calls involves a prompt before it’s sent to the language model. The prompt includes a system message as defined in the prompt template before. The system message has context information, one of the documents that we retrieved and the user question followed by the answer. We make a call to the next language model. The final prompt that we send to the next language model is a sequence that combines the previous response with new data and asks for an improved/refined response with the added context. This runs four times and runs over all the documents before it arrives at the final answer. We get a better answer in the refine chain as it allows us to combine information sequentially leading to more carrying over of information than the MapReduce chain.

RetrievalQA limitations

One of the biggest disadvantages of RetrievalQA chain is that the QA chain fails to preserve conversational history. This can be checked as follows:

# Create a QA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

We will now ask a question to the chain.

question = "Is probability a class topic?"
result = qa_chain({"query": question})
result["result"]

Now, we will ask a second question to the chain.

question = "why are those prerequesites needed?"
result = qa_chain({"query": question})
result["result"]

We were able to get a reply from the chain which was not related to the previous answer. Basically, the RetrievalQA chain doesn’t have any concept of state. It doesn’t remember what previous questions or what previous answers were. We could In order for the chain to remember the previous question or previous answer, we need to introduce the concept of memory. This ability to remember the previous question or previous answer is required in the case of chatbots as we are able to ask follow-up questions to the chatbot or ask for clarification about previous answers.

We discussed how to use LangChain to load data from a variety of documents. We also learnt to split the documents into chunks. After that, we created embeddings for these chunks and these into a vector store. Later, we did a semantic search using this vector store. Semantic Search fails in certain edge cases. Then, we covered retrieval, where we talked about various retrieval algorithms to overcome these edge cases. We combined retrieval with LLMs in Question Answering, where we take the retrieved documents and the user question and pass them to an LLM to generate an answer to the question we asked. We did not discuss the conversational aspect of question answering and I will discuss that later sometime by creating an end-to-end chatbot over our data. Also, I have also discussed how to have conversation with structured data in my article Context Management in text2sql tasks.