Question Answering over Documents

Published in

The Emburse Tech Blog

9 min readJul 19, 2023

As with most organizations these days, recently we have been spending a lot of time researching, investigating, and playing around with Large Language Models (LLMs). LLMs are an amazing technology that can solve a variety of problems but due to things such as complexity, cost, and size, they are not for every organization. Although I believe we have multiple use cases where an LLM could be beneficial, one immediate example is being able to ask questions over documents. These documents could be anything from a contract, to a textbook or our any of our internal documentation sites.

This initial use case provides a great introduction to getting started with LLMs and showcases their potential. This post describes some of the research we are doing around LLMs, our findings, and how to get started using them to do “Question Answering over Documents”.

This is not a primer explaining how “Question Answering over Documents” works, as there are a ton of resources out there that already describe this, but rather explains the technologies we used to accomplish each aspect.

If you wish to follow along, each section contains code blocks that can be run in order from a Jupyter Notebook.

Overview

It was decided early on that we would not use an external LLM such as ChatGPT due to data privacy concerns and that ALL of our data must be kept in-house. We also decided that we would invest in open source (Apache 2.0) everywhere that we could do so.

We utilize LangChain as a way to set up indexing for our documents, set up a vector store, and also for retrieving the answer from an LLM. We investigated several models locally from Hugging Face (Dolly, Falcon, MPT, and Flan) and very quickly decided that we were getting the best and most consistent results from Falcon Instruct models. For this post, we will be looking at the 7B instruct model.

The following was tested with Python 3.9.16 and requires the following packages installed:

pip install -U pypdf torch transformers langchain ipywidgets accelerate \
  sentence_transformers pyarrow pandas bitsandbytes einops xformers

Documents

We begin with PDFs. PDFs are the most common file types that we see and are the most readily available to us for testing. LangChain has several different PDF loader modules but the one that we found the most success with was the PyPDFLoader which appears to provide the best structure for our documents, especially for the ones that contain tables. One other advantage of this loader is the fact that it provides metadata about the document, such as page number, which we can return back as part of the answer.

As an example PDF, we will use an eBook from NVIDIA called “A Beginner’s Guide to Large Language Models”. You can download your own copy from this link.

from langchain.document_loaders import PyPDFLoader

# Load the PDF file from current working directory
loader = PyPDFLoader("llm-ebook-part1.pdf")

# Split the PDF into Pages
pages = loader.load_and_split()

Because our PDF file contains several pages of text, it is likely that we won’t be able to send all the pages or even possibly an entire page to the model at one time. Doing so would likely run us into token length errors.

In order to avoid the error and fit the text into a context window that fits the Falcon model, we are going to split the pages into chunks of text.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define chunk size, overlap and separators
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=64,
    separators=['\n\n', '\n', '(?=>\. )', ' ', '']
)

# Split the pages into texts as defined above
texts = text_splitter.split_documents(pages)

The following chunks the page text into sizes no greater than 1024, with a bit of overlap to allow for some continued context (See video here for a better understanding). We also split the text when it finds new lines first, then periods, spaces, and as last resort, anything.

If you are curious you can actually see the length of pages and texts using the following:

# Print the number of pages and number of texts
print(len(pages), len(texts))
# 27, 72

Also, if you want to take a look at the content of how the pages were loaded and how the text was split you can print the page content on both using the following sample code:

# Print a full page of text 
print(pages[2].page_content)

# Print the split text 
print(texts[50].page_content)

# Print the len of the above split text. Note it will not be more than 1024.
print(len(texts[50].page_content))
# 1023

LangChain contains a ton of other loaders outside of PDFs. Others of interest to us include Snowflake, CSV, Confluence, GitHub, Word, DataFrame, PySpark DataFrame, and more. We did investigate LlamaIndex as well due to its ability to index but did not have much success with some of the loaders. LlamaHub is definitely something we will be keeping our eye on but at this time LangChain does what we need and more.

Embeddings

I will not go into detail about what an embedding model is but I want to note here that the point of an embedding model is to provide a vector representation of a piece of text. The embedding model will be used alongside our vector store to return pieces of text that are most closely related in a numeric value based on the question that we ask.

An example of similarity search will be in the next section to make this more clear, but let’s go ahead and bring in our embedding model. We decided to use the mpnet-base-v2 sentence transformer model from Hugging Face and had pretty good success so we didn’t investigate others.

from langchain.embeddings import HuggingFaceEmbeddings

# Load embeddings from HuggingFace
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

Vector Store

We decided to utilize the SKLearnVectorStore as our vector store to index our documents. There was no other reason to utilize this module for experimentation outside of the fact that we could persist the vectors to a parquet file and then load the file into Pandas for investigation. Being able to do this helped us to better understand how vector stores work and what is being stored in them.

As we got started and allowed users to begin uploading documents, we would save the document as well as the vector store in parquet file. We could then better understand feedback from the user. We could take a look at and compare documents as well as the embedding data in the vector store.

from langchain.vectorstores import SKLearnVectorStore
# Set the persisted vector store
vector_db_path = "./document_vector_db.parquet"

# Create the vector store
vector_db = SKLearnVectorStore.from_documents(texts, embedding=embedding, persist_path=vector_db_path,
                                             serializer="parquet")

Coming back to similarity search, to make it more clear how embedding models work alongside vector stores, you can execute the below code. The code will return four “texts” that have the four highest scores. This is the context that will be passed to the LLM to make its prediction.

print(vector_db.similarity_search("What is a large language model?"))

As mentioned above, you can also view the vector store using pandas. First, persist the store locally. When it is persisted it creates a file locally which we named “document_vector_db.parquet” and set with variable “vector_db_path”. Second, load it as a parquet in pandas.

import pandas as pd

# persist the store
vector_db.persist()

# load into pandas
df = pd.read_parquet(vector_db_path)

# Have a look at the store
df

# Show first row of store
# df.iloc[0]
#ids                        94fb087f-6206-430d-8fb3-2549b1c64d1a
#texts         A Beginner’s Guide to \nLarge Language Models\...
#metadatas          {'page': 0, 'source': 'llm-ebook-part1.pdf'}
#embeddings    [0.033385541290044785, 0.020513998344540596, -...
#Name: 0, dtype: object

In here you will find the texts as well as their embeddings and metadata for page number and sources.

Model

As mentioned, we tested several models but decided Falcon 7B instruct to be the best and most consistent. Obviously, Falcon 40B instruct would likely provide even better results but requires far too much hardware for testing and investigation. Falcon 7B can very simply be run locally and can easily fit on a single GPU machine with lower memory and CPU footprints.

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain import HuggingFacePipeline

#Set model
model = "tiiuae/falcon-7b-instruct"
# If GPU available use it
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model)
# Load model
model = AutoModelForCausalLM.from_pretrained(
    model,
    trust_remote_code=True,
    load_in_8bit=True,
    device_map='auto'
)
# Set to eval mode
model.eval()
# Create a pipline
generate_text = pipeline(task="text-generation", model=model, tokenizer=tokenizer, 
                         trust_remote_code=True, max_new_tokens=100, 
                         repetition_penalty=1.1, model_kwargs={"device_map": "auto", 
                          "max_length": 1200, "temperature": 0.01, "torch_dtype":torch.bfloat16}
)
# LangChain HuggingFacePipeline set to our transformer pipeline
hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

I will not go into all the different options here but want to point out a couple of things:

We define in our transformers pipeline config “max_new_tokens” to 100. This means that the model will only generate up to 100 tokens. This is fine for what we are testing but could possibly not be enough for what you are doing. It can be modified to a lower or higher number.

We also set “max_length” in our transformer pipeline as a model argument. This means that the model can only handle a max length of input tokens of 1,200. This is important because as a default the Falcon model is set much lower than this. Also, if you recall, when we split our text above, we set the “chunk_size” to 1,024. If we do not set the “max_length” higher than “chunck_size” + “chunk_overlap” we will get errors.

It is important to note here that tokens passed from a prompt must also be considered in the “max_length”. If you have a custom prompt or memory, the “max_length” would need to be even higher. So it’s best to consider “chunck_size” + “chunk_overlap” + prompt tokens. Another great video by James Briggs explains this.

Lastly, let’s point out that although we are using a pipeline from transformers, we must pass this pipeline into a “HuggingFacePipeline” from LangChain.

Retrieval

Finally the magical part. We use the Retrieval QA class to create a chain by combining our vector store with our embeddings and our model. Again we aren’t going to go into detail about how the retrieval process works but you can ready more about it in this article.

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=hf_pipeline, chain_type="stuff", 
                                 retriever=vector_db.as_retriever(search_kwargs={"k": 3}),
                                 return_source_documents=True,
                                 verbose=False,
)

Now we can send a query (question) to our chain and get an answer from our model. Retrieval QA expects a dictionary containing our query.

query = "Who is this book for?"
# Send question as a query to qa chain
result = qa({"query": query})

The response also comes back as a dictionary object. Below is trimmed down version of it.

print(result)
{'query': 'Who is this book for?', 
 'result': '\nThis book is for anyone interested in learning about large language models, including researchers, developers, and students. It covers the basics of language models, their applications, and their potential benefits for businesses and individuals.', 
 'source_documents': REMOVING STUFF FROM HERE AS WE WILL COVER IT BELOW)]}

If you notice above our chain returns the “query”, the “result” and “source_documents”. You can trim this down to just see the answer by only printing the “result”.

print(result["result"])
# \nThis book is for anyone interested in learning about large language models, including researchers, developers, and students. It covers the basics of language models, their applications, and their potential benefits for businesses and individuals.

Because we also specified “return_source_documents” in our chain, we also received additional information about sources used from our vector store. In our case since we specified our retriever to use 3 (search_kwargs={“k”: 3}).

print(len(result['source_documents']))
# 3

What is really cool about “source_documents” is that you can actually see additional metadata such as page number and the name of the document used to generate a result.

# Print page document source and page number of first source_document
print(result['source_documents'][0].metadata['source'], result['source_documents'][0].metadata['page'])
# llm-ebook-part1.pdf 7

Lessons Learned

Things that we learned along the way or found to be interesting include:

Loading in 8-bit has the fastest inference

4bit loader as well as quantized configuration using bitsandbytes were tested for inference. Loading in 8bit gave quite a bit of a speed boost compared to the others. On average 4bit responses were taking about 45 seconds to one minute whereas 8bit is normally between five and 20 seconds.

Structured documents give better results

This is pretty obvious. Books or PDFs that don’t contain things like images or tables tend to give much faster and better results. With the way that some of our contracts were structured, we got a mix of results with the more structured ones working better.

Perhaps some additional processing could be done to make the text more structured.

Prompting provided very little enhancement to our results

This is more for the unstructured documents that had either very poor results or mixed results but additional prompting did not prove to be helpful in this case. Several custom prompts were created to try and enhance context but were unsuccessful.

SKLearnVectorStore appears to be a bit slower than others

Above we used the SKLearnVector store to gain a better understanding of what was going on. We also investigated DocArrayInMemorySearch which keeps the vector store in memory. DocArrayInMemorySearch proved to be faster on queries and when loading documents.

Next Steps

We have already been investigating putting a UI in front of the vector store and model using Streamlit so users can more easily upload documents and have a more chat like experience.

Future research will go toward:

Conversation memory
Other document types
Few shot prompting to see if it will improve unstructured document predictions
Flexible UI for users

At Emburse, we are always looking for ways to utilize AI and Data Science to improve technology internally and for our customers. We will continue our pursuit of LLMs and fit them in where beneficial. We are already finding many ways where technology can be affective.