A LangChain chatbot using PDFs

Param Mehta
USF-Data Science
Published in
14 min readFeb 1, 2024

Table of Contents

· Abstract
·
Problem Statement
·
A brief about RAG(Retrieval Augmented Generation)
·
RAG Components
·
Coding
·
Parsing PDFs
·
Splitting
·
Embedding
·
Storing in a Vector Database
·
Prompt
·
Conversational Memory
·
Model
·
Conversational Chain
·
Inference
·
Conclusion
·
Read more
·
References

Abstract

This blog talks about:

  • Building an LLM chatbot for customer support.
  • Text extraction from PDFs
  • Information search from documents
  • Coding all of this through Python and LangChain

The entire code can be found as a Google Colab notebook here. Feel free to fork it and play around.

Problem Statement

My company had over 100 pages of unstructured documents that contained information about topics and issues that most customer queries revolve around. They wanted to integrate a chatbot in their app that can answer these queries using the given documents. Once deployed, the chatbot is expected to automate the query resolution of over 5000 app users, bringing down average response time by 66%.

So how do we make an LLM use our private knowledge base to answer user questions?

A brief about RAG(Retrieval Augmented Generation)

There are two ways of imparting new knowledge to an LLM:

  • Fine-tuning - where you train the model on labelled data(mapping of user query and desired AI answer) and the weights of the model’s last few layers encode the new information.
  • RAG - is like having a personalized search engine that searches information from your own documents rather than the internet. You ask a question, it retrieves relevant information from your documents, and then you ask an LLM to craft a thoughtful answer using the question and the retrieved information.

I will be using the second method.

Why RAG

  • In most practical use cases, fine-tuning is not possible due to lack of clean labeled data.
  • Fine-tuning models on limited labeled data can lead to overfitting.
  • Fine-tuning also requires significant computational resources, expertise, and time while RAG is comparatively easy to setup.
  • Fine-tuned models need to be retrained multiple times when the underlying information is dynamic. For RAG, you just need to modify the source documents or add new ones.
  • Fine-tuned LLMs often make up answers or answer from their previous knowledge which might be factually incorrect. In RAG, you can limit this(hallucination) to some extent by prompting LLMs to answer strictly from the retrieved information and not from their previous knowledge.

RAG Components

A typical rag pipeline looks like the following:

Image Source: https://python.langchain.com/docs/use_cases/question_answering/
  1. Loading structured or unstructured data (json, pdf, images) and converting them into desirable format (txt).
  2. Splitting the documents into smaller chunks as it’s easier to search and compare smaller pieces of text with the user query. Also, larger chunks don’t fit into the context window of certain language models.
  3. Embedding the chunks as word vectors
  4. Storing the word vectors in a vector database

Once you have completed the above steps, this is how you get an answer from a rag system:

Image Source: https://python.langchain.com/docs/use_cases/question_answering/

The user question is converted into embeddings and compared with all the other pieces of text stored in the vector database. Whichever chunk is the closest to the user question, is retrieved and fed to an LLM model along with the user question. (This can be the top k closest chunks based on vector similarity but we will stick to top 1). The model can be prompted with specific instructions on how to use this context and produce an answer.

Coding

Getting back to the problem in our hand, let’s see how to implement such a system using Python and Langchain.

Note: The reason I have used Langchain is that it’s perfect for simple use cases where you want to setup a quick baseline. In practice, it’s advisable not to rely on any external frameworks as these have too many complicated abstractions that will makes it difficult to debug or customize code. For example, when I switched from using a normal RetrievalQA chain to a ConversationalRetrievalChain, I had to dig deep into stack-overflows and the Langchain repo to understand how to pass the system prompt and integrate conversation memory with the chain. So it’s better to write your own helper functions, although you may copy some useful wrappers from the source code of these frameworks.

Alright, let’s get started!

Parsing PDFs

The pdf documents that I was working with had a fairly complex layout with multiple tables, nested sidebars, graphical elements and a multi column structure. At first, I tried the document_loader module of Langchain that uses pypdf to parse the documents but it wasn’t giving good results. Here’s the comparison of a sample page from a pdf along with the parsed output in a text file. If you see the highlighted section in the text file, it’s combining the first bullet points of both the tables into one sentence that makes no logical sentence.

Since it wasn’t giving desirable results, I tried Google Cloud’s Vision API. To use that, the pdf documents must be stored in a GCP bucket. Also, you need to get an API key from google cloud and enable the Cloud Vision API. Set the environment variable ‘GOOGLE_APPLICATION_CREDENTIALS’ as the path to the json file of your API key.

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path_to_your_api_key.json'

Next, we define a helper function that does all the heavy lifting for us. The function takes as input the url path of a pdf file and also the url path of a google storage folder where the parsed documents will be stored.

import re
from google.cloud import vision
from google.cloud import storage

def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""Asynchronous OCR with PDF/TIFF as source files on GCS.

Args:
gcs_source_uri (str): The GCS URI of the source document.
gcs_destination_uri (str): The GCS URI for storing the OCR results.

Returns:
list: A list of extracted text from the document.
"""

# Define MIME type for PDF documents
mime_type = "application/pdf"

# Set batch size for asynchronous processing
batch_size = 1

# Create a Vision API client
client = vision.ImageAnnotatorClient()

# Define the feature type for document text detection
feature = vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)

# Define GCS source and input configuration
gcs_source = vision.GcsSource(uri=gcs_source_uri)
input_config = vision.InputConfig(gcs_source=gcs_source, mime_type=mime_type)

# Define GCS destination and output configuration
gcs_destination = vision.GcsDestination(uri=gcs_destination_uri)
output_config = vision.OutputConfig(
gcs_destination=gcs_destination, batch_size=batch_size
)

# Create an asynchronous annotation request
async_request = vision.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config, output_config=output_config
)

# Submit the asynchronous request and get the operation
operation = client.async_batch_annotate_files(requests=[async_request])

# Wait for the operation to finish (timeout: 7 minutes)
print("Waiting for the operation to finish.")
operation.result(timeout=420)

# Initialize Storage client
storage_client = storage.Client()

# Extract bucket name and prefix from GCS destination URI
match = re.match(r"gs://([^/]+)/(.+)", gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)

# Get the bucket
bucket = storage_client.get_bucket(bucket_name)

# List all blobs (files) in the specified prefix
blob_list = [
blob
for blob in list(bucket.list_blobs(prefix=prefix))
if not blob.name.endswith("/")
]

# Extract text from each OCR response and store in a list
docs = []
for filename in blob_list:
json_string = filename.download_as_bytes().decode("utf-8")
response = json.loads(json_string)
response = response["responses"][0]
annotation = response["fullTextAnnotation"]
docs.append(annotation['text'])

return docs

We iterate over all the documents, pass them to the above function and get a list of strings where each string is the text belonging to a single page.

filenames = ['gs://input_folder/doc1.pdf',
'gs://input_folder/doc2.pdf',
'gs://input_folder/doc3.pdf',
'gs://input_folder/doc4.pdf']

output_path = 'gs://output_folder/'

docs = []

for filename in filenames:
docs.extend(async_detect_document(filename,output_path))

I store these strings as text files in a local directory. The total number of text files will be equal to the total number of pages across all your pdfs.

for i, text in enumerate(docs):
filename = f"/parsed_documents/{i}.txt"
with open(filename, "w") as file:
file.write(text)

Here is the result of performing OCR on the sample page that I showed above. The bounding boxes are fairly accurate and seem aware of the multi column layout.

OCR Results for a sample page

Shown below is the output txt file, compared with the original image. If you look at the highlighted section in the txt file, you can see that all the bullet points of the left table are together under the word ‘Previous’ and all the bullet points of the right table are together under the word ‘New’.

Note: If your documents are already in text format, you can skip the preceding steps and begin directly with the splitting stage. It is essential to note that the cleaner and more well-formatted your text files are, the better the resulting model. Therefore, invest sufficient time in preprocessing your files, including tasks such as adding appropriate headers, adjusting line breaks, eliminating links, special characters, redundant text, etc.

Splitting

To split the text, we will be using RecursiveCharacterTextSplitter from Langchain’s text_splitter module. Before that we need to convert the text into a Document object.

from langchain.schema.document import Document

directory_path = '/parsed_documents'
docs = []

for filename in os.listdir(directory_path):
file_path = os.path.join(directory_path, filename)
with open(file_path, 'r') as file:
content = file.read()
docs.append(Document(page_content = content))

The RecursiveCharacterTextSplitter splits text on these characters [“\n\n”, “\n”, “ “, “”]. It first tries splitting on “\n\n” and if the split chunk is not less than chunk_size, it will split on “\n”. If it is, it makes the split and moves on to make the next chunk. You can also mention chunk_overlap to maintain a sense of connectivity between two chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=64)
texts = text_splitter.split_documents(docs)

It’s important to remember that the optimal chunk size can depend on various factors including the nature of your documents, the capabilities of your model, and your system’s memory constraints. It’s often a good idea to experiment with different values to find the one that works best for your specific use case. However, this is a general guideline that I keep in mind when deciding chunk_size:

  • Smaller chunks — can be beneficial when your task usually involves asking about specific details because they allow the model to focus on a smaller portion of the text, which might make it easier for the model to find the specific detail you’re asking about.
  • Larger chunks — can be better if your typical question is more of a summarization task, because they allow the model to take in more of the document’s context at once. This can help the model generate a more accurate and comprehensive summary.

Embedding

Next, we use the HuggingFaceInstructEmbeddings to convert our texts into numerical representations (embeddings). These embeddings capture the semantic meaning of the texts.

import torch
from langchain.embeddings import HuggingFaceInstructEmbeddings

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

embeddings = HuggingFaceInstructEmbeddings(
model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
)

Storing in a Vector Database

Langchain provides integration with a huge array of vector databases to store the embeddings. There are multiple factors that govern the choice of vector databases like scalability and retrieval efficiency. There are benchmarks you can refer to that inform which database offers the fastest performance for specific distance operations. I am using Chroma but there’s no specific reason why I opted for this. The retrieval seems quite efficient so far but there might be a better option.

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings, persist_directory="db")
  • persist_directory: the path where you want to store the db files

Prompt

Every time we want an LLM model to make inference, we pass a prompt to it. This prompt is comprised of two prompts: one is the system prompt which is the instruction you give to the llm that stays constant throughout the task. The second is the input_prompt which is a combination of the user question and retrieved context. This changes every time a user asks a new question. We define a function to create a template for this prompt.

def generate_prompt(
input_prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:

return f"""
{system_prompt}

{input_prompt}
""".strip()

Next we define the system prompt that is the most appropriate for your task. This is where you can get really creative and descriptive. Improving the quality of this prompt can single-handedly boost the performance of your model.

from langchain import PromptTemplate

SYSTEM_PROMPT = """
Use the following pieces of context to answer the question
at the end. Each retrieved context will have a symptom that
best describes the issue that user is facing with his device.
The context will also have the solution. Return only this
solution broken down into nicely formatted steps. If you don't
know the answer, just say that you don't know, don't try to
make up an answer.
"""

template = generate_prompt(
"""
{context}

Question: {question}
""",
system_prompt=SYSTEM_PROMPT,
)

prompt = PromptTemplate(
template=template, input_variables=["context", "question"])

Notice how I explicitly instruct the model to extract the symptom and solution from the context. This would change depending on your use case.

Conversational Memory

Stateless models like LLMs do not inherently track the conversation hence every time you invoke these models, they do not have access to previous messages in the chat. To handle this, we create an instance of ConversationBufferMemory

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
memory_key="chat_history", output_key='answer', return_messages=False)
  • memory_key="chat_history": Specifies the key for accessing the conversation history in the memory.
  • output_key='answer': Specifies the key for accessing the output (answer) stored in the memory.
  • return_messages=False: Determines whether the module returns individual messages or only the final output.

Model

Choosing which LLM to use depends on several factors like use case, resource constraints, etc which, again, is a separate topic in itself. I am using Gemini pro that you can access through the ChatVertexAI integration of Langchain. You will need to enable the Vertex AI API on your GCP. Using this model is currently free as long as you don’t make more than one requests per second. So it’s a good place to start if you are just playing around with your pipeline and want to generate quick inferences in the absence of computing resources.

from google.cloud import aiplatform
from langchain.chat_models import ChatVertexAI

llm = ChatVertexAI(model="gemini-pro")

In case you don’t want to expose your data to close sourced models or don’t want to pay for API calls, you can opt for an open source alternative. I will be using a quantized version of LLaMa 2 13B whose size is 7.26 GB which fits well within Google Colab’s default memory constraints. It also exhibits fast inference using the default Tesla T4 GPU.

(If you don’t mind using the Gemini pro model, skip directly to the Conversational Chain section)

First we specify the pre-trained model to be loaded. In this case, it’s the Llama-2–13B-chat-GPTQ model by TheBloke from the Hugging Face Model Hub. The base name of the model is used for constructing the local directory path where the downloaded model weights will be stored.

model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
model_basename = "model"

Then we use the auto_gptqlibrary to download the quantized model. It’s alright if you don’t understand all the parameters.

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
model_name_or_path,
revision="gptq-4bit-128g-actorder_True",
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
inject_fused_attention=False,
device=DEVICE,
quantize_config=None,
disable_exllama=True,
auto_devices=True
)
  • revision: Specifies the revision of the model and quantization details (bit precision = 4bit, group size = 128g, and activation order = True).
  • use_safetensors: Enables safe tensors for safer numerical operations.
  • trust_remote_code: Trusts the remote code for loading the model.
  • inject_fused_attention: Specifies whether to inject fused attention kernels.
  • device: Specifies the device (e.g., "cuda" for GPU) on which the model will be loaded.
  • quantize_config: Configuration for quantization; set to None for default.
  • disable_exllama: Disables ExLLAMA functionality.
  • auto_devices: Automatically sets devices for multi-GPU usage.

Next, we use the transformers library to setup the tokenizer and streamer. AutoTokenizer automatically selects the appropriate tokenizer for a given pre-trained model. TextStreamer is used for efficient streaming of text data through a tokenizer.

from transformers import AutoTokenizer, TextStreamer

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
  • model_name_or_path: Specifies the pre-trained model to be used by the tokenizer.
  • use_fast=True: Enables the use of a faster tokenizer implementation if available.
  • tokenizer: The tokenizer instance to be used for tokenizing input text.
  • skip_prompt=True: A boolean indicating whether to skip the prompt (if present) in the processed text.
  • skip_special_tokens=True: A boolean indicating whether to skip special tokens (e.g., [CLS], [SEP]) during processing.

Next, we setup a text generation pipeline, passing the instances of model, tokenizer and streamer. The other parameters are mostly the default parameters of the LlaMa 2 model

from transformers import pipeline 

text_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1024,
temperature=0,
top_p=0.95,
repetition_penalty=1.15,
streamer=streamer,
)
  • max_new_tokens=1024: Set the maximum number of new tokens (words or subwords) to generate in the output.
  • temperature=0: Set the temperature parameter to 0, which results in deterministic (non-random) output. Higher values (e.g., 1) introduce randomness into the generated text.
  • top_p=0.95: Set the top-p parameter, controlling the diversity of the generated text. Higher values (e.g., 0.95) allow for more diverse outputs.
  • repetition_penalty=1.15: Set the repetition penalty, discouraging the model from repeating the same tokens in the generated text.

Then we use the HuggingFacePipeline to wrap the above pipeline so that it’s compatible with a retrieval chain instance of Langchain.

from langchain import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

Conversational Chain

Now that we have every component of the pipeline ready, we can use Langchain’s chains module to tie everything up together. We will be creating an instance of ConversationalRetrievalChain which is meant specifically to handle retrieval based pipelines for conversations. The way this works under the hood is that it uses the chat history and the new question to create a standalone question. This standalone question is used for retrieval rather than the new user question that alone might not capture the context of the conversation and consequently perform inaccurate retrieval.

from langchain.chains import ConversationalRetrievalChain

qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
chain_type="stuff",
retriever=db.as_retriever(search_kwargs={"k": 1}),
get_chat_history=lambda o:o,
memory=memory,
combine_docs_chain_kwargs={'prompt': prompt})
  • llm=llm: Specifies the language model to be used in the chain.
  • chain_type="stuff": means that it uses all of the text from the retrieved document to answer the question
  • retriever=db.as_retriever(search_kwargs={"k": 1}): this is a vector store retriever used to retrieve documents. By default the retrieval method is similarity search and we are fetching the top k = 1 contexts.
  • get_chat_history=lambda o: o: this means that the model should fetch the history from the memory
  • memory=memory: passing the memory buffer object created above
  • combine_docs_chain_kwargs={'prompt': prompt}: passing the prompt along with the retrieved documents

Inference

Finally, we can use the qa_chain object created above to answer a user question.

result = qa_chain("What are yard moves?")
print(result['answer'])
OUTPUT:

Yard moves refer to the transfer of commercial motor vehicles (CMVs) between locations within a terminal or similar facility on private property. These movements must not occur on a highway, which is defined as any public or private road, street, or way that allows the public to operate four-wheeled vehicles without restrictions from signs or gates.

You can keep calling the same qa_chain object to ask follow-up questions. You can also access the chat history by using result[‘chat_history’]. It will return a string of text with each message on a new line preceded either by ‘Human:’ or ‘AI:’

Conclusion

That was a quick walkthrough of how you can setup a baseline RAG system that performs retrieval over a large number of documents and prompts an LLM to answer user questions. Note that I didn’t focus too much on how to optimize and evaluate a RAG system as these topics necessitate an in-depth discussion of their own. I will also write a follow up post on how I made a chatbot application using Django.

There’s plenty of room for experimentation in this task but that’s beyond the scope of this article. However, here’s a non-exhaustive list of things you can try to improve the system:

  • Experiment with different models, embeddings, retrieval techniques and prompts.
  • Few shot learning — pass examples of questions and desired answers as part of the system prompt.
  • Adding metadata to chunks
  • Summarize retrieved chunks to pass concise information to an LLM
  • Try reranker algorithms; ranking the top k retrieved documents
  • Summarize and condense chat history to ensure that the LLM captures relevant parts of the context.
  • A hybrid of fine-tuning and RAG

This blogpost is a great read about strategies to optimize RAG systems.

Read more

Companies across various industries are increasingly recognizing the value of incorporating RAG-based solutions into their workflows. It’s also the reason why the landscape of RAG models is expanding so quickly, with a plethora of novel techniques, cookbooks and algorithms showing up every day. LlamaIndex blog is a great resource to keep up with these developments. If you have specific questions about implementing LLMs, the LocalLLaMa subreddit is the perfect place to find the answer.

References

--

--