Using AI to chat with documents: Leveraging LangChain, FAISS, and OpenAI

5 min readAug 9, 2023

Introduction

In the age of information overload, documents with information stand as timeless repositories of valuable knowledge. However, turning these unstructured data sources into actionable insights has been a persistent challenge. In this article I present you a seamless integration of powerful technologies, as they form an alliance to compose the perfect symphony of question-answering, flawlessly choreographed by LangChain.

OpenAI is a company that develops and provides access to Large Language Models (LLMs). These models are trained on massive datasets of text and code, and they can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. OpenAI’s large language models are a powerful tool for answering questions from documents, and they can be used to generate more natural and informative answers than other approaches.

LangChain 🦜️🔗 is a framework for developing applications powered by language models. It provides modular abstractions for the components necessary to work with LLMs while also leveraging the reasoning capabilities of LLMs to perform tasks.

FAISS, or Facebook AI Similarity Search is a library that unlocks the power of similarity search algorithms, enabling swift and efficient retrieval of relevant documents based on semantic similarities. Its high-dimensional indexing capabilities and fast search performance become our compass, directing us towards the most pertinent documents it stores as vectors.

Answering questions from a document involves the following steps:

Splitting the document into smaller chunks.
Convert text chunks into embeddings.
Perform a similarity search on the embeddings.
Generate answers to questions using an LLM.

Splitting the document in smaller chunks

The first step in answering questions from documents is to load the document. LangChain provides document loaders that can help load the documents. For example, the PyPDFLoader can be used to load pdf documents. These documents or pages can then be split into smaller chunks. This is necessary because LLMs can only process a limited amount of text at a time. For example, the gpt-3.5-turbo model has max token limit of 4096 tokens shared between the prompt and completion. LangChain has a Character TextSplitter tool that can be used here. It works by splitting text into smaller chunks.

Converting chunks into embeddings

Embeddings are numerical representations that capture the semantic essence of words, phrases, or sentences. The idea is to create vectors in a high dimensional space such that the distance between the vectors have some meaning.

LangChain provides an abstraction for interfacing with the embedding model via the Embeddings class. We will be using the embeddings model provided by OpenAI.

We then use LangChain’s abstraction over FAISS and pass it the chunks and the embedding model and it converts it to vectors. These vectors can fit into memory or can also be persisted to local disk.

A vector is a fundamental mathematical concept that represents both magnitude and direction. In simpler terms, you can think of a vector as an arrow in space, where the length of the arrow represents the magnitude of the vector, and the direction in which it points indicates its orientation. In the context of natural language processing and embeddings, vectors are used to represent words, sentences, or documents in a numerical format. These vectors capture semantic information, allowing computers to perform operations like measuring similarity or performing mathematical computations on text data.

Performing a similarity search on the embeddings

We can use advanced algorithms and tools like FAISS (Facebook AI Similarity Search) to conduct this search. Imagine you need an answer for a question from a specific document. FAISS acts like a guide, helping you identify embeddings that are closest in resemblance to what you’re seeking.

Similarity search on embeddings helps us find articles, paragraphs, or sentences that are closely related to the question at hand. It’s as if we’re using a telescope to spot constellations of relevant information amidst the vast universe of data. Similarity search on embeddings transforms language and data into a space where we can measure how similar things are. This enables us to sift through information, pinpoint relevant content, and ultimately deliver accurate answers that align with the context of our questions.

Generate answers to questions using an LLM

Once the most similar chunks have been found, the next step is to generate an answer to the question using a LLM. Here is where LangChain shines as it does all the heavy lifting for us. It orchestrates the whole process. In order to generate an answer to the question, LangChain pass the given the question and the most similar chunks as input it got from FAISS to the LLM. The LLM then uses the input to generate a text response that is relevant to the question. We use LangChain’s RetrievalQA chain to accomplish this.

Putting it all together, as we discussed the steps involved above, here is an example of chatting with a pdf document in python using LangChain, OpenAI and FAISS. Below I have provided a pdf document containing the constitution of the United States.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI


def query_pdf(query):
    # Load document using PyPDFLoader document loader
    loader = PyPDFLoader("pdf/constitution.pdf")
    documents = loader.load()
    # Split document in chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=30, separator="\n")
    docs = text_splitter.split_documents(documents=documents)

    embeddings = OpenAIEmbeddings()
    # Create vectors
    vectorstore = FAISS.from_documents(docs, embeddings)
    # Persist the vectors locally on disk
    vectorstore.save_local("faiss_index_constitution")

    # Load from local storage
    persisted_vectorstore = FAISS.load_local("faiss_index_constitution", embeddings)

    # Use RetrievalQA chain for orchestration
    qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=persisted_vectorstore.as_retriever())
    result = qa.run(query)
    print(result)


def main():
    query = input("Type in your query: \n")
    while query != "exit":
        query_pdf(query)
        query = input("Type in your query: \n")


if __name__ == "__main__":
    main()

Conclusion

In conclusion, we have discussed the topic of answering questions from documents using LangChain, FAISS, and OpenAI. We have seen how LangChain drives the whole process, splitting the PDF document into smaller chunks, uses FAISS to perform similarity search on the chunks, and OpenAI to generate answers to questions. We have also seen how these technologies can be combined to create a system that can answer questions from a PDF document in a natural and informative way.