Integrating the ChatPDF Feature into a Local Streamlit Chat Interface, Including Non-OpenAI Models (Llama2)

Moto DEI
11 min readJul 27, 2023

--

In continuation of my last blog post, I’m now set to introduce a feature that allows the chat to utilize information from a loaded PDF while responding to user queries. This feature aptly corresponds to a.k.a. “ChatPDF”.

The image below provides a glimpse of the final look once all the steps outlined in this post are completed. Notice the “Document Upload” section positioned above the user chat area, where you can upload your PDF.

In this post, I utilized a PDF as the document source. However, with the LangChain data loader, a variety of other data types can be loaded, including text, HTML, CSV, Confluence pages, YouTube transcripts, and more.

My Related Posts

Setting the Stage: Essential Preparation

The foundational steps outlined in my previous post remain relevant: acquiring the OpenAI API key and downloading the Llama2 model, although the latter isn’t crucial for the discussions in this post.

Here’s the new requirements.txt :

langchain==0.0.234
openai==0.27.8
python-dotenv==1.0.0
streamlit==1.24.1
llama-cpp-python==0.1.65
PyPDF2==3.0.1
tiktoken==0.4.0
qdrant-client==1.3.1

Main Code

Now, our complete code in app.py looks like this:

# app.py
from typing import List, Union, Optional

from dotenv import load_dotenv, find_dotenv
from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import (SystemMessage, HumanMessage, AIMessage)
from langchain.llms import LlamaCpp
from langchain.embeddings import LlamaCppEmbeddings
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.text_splitter import TokenTextSplitter
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Qdrant
from PyPDF2 import PdfReader
import streamlit as st

PROMPT_TEMPLATE = """
Use the following pieces of context enclosed by triple backquotes to answer the question at the end.
\n\n
Context:
```
{context}
```
\n\n
Question: [][][][]{question}[][][][]
\n
Answer:"""


def init_page() -> None:
st.set_page_config(
page_title="Personal ChatGPT"
)
st.sidebar.title("Options")


def init_messages() -> None:
clear_button = st.sidebar.button("Clear Conversation", key="clear")
if clear_button or "messages" not in st.session_state:
st.session_state.messages = [
SystemMessage(
content=(
"You are a helpful AI QA assistant. "
"When answering questions, use the context enclosed by triple backquotes if it is relevant. "
"If you don't know the answer, just say that you don't know, "
"don't try to make up an answer. "
"Reply your answer in mardkown format.")
)
]
st.session_state.costs = []


def get_pdf_text() -> Optional[str]:
"""
Function to load PDF text and split it into chunks.
"""
st.header("Document Upload")
uploaded_file = st.file_uploader(
label="Here, upload your PDF file you want ChatGPT to use to answer",
type="pdf"
)
if uploaded_file:
pdf_reader = PdfReader(uploaded_file)
text = "\n\n".join([page.extract_text() for page in pdf_reader.pages])
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
return text_splitter.split_text(text)
else:
return None


def build_vectore_store(
texts: str, embeddings: Union[OpenAIEmbeddings, LlamaCppEmbeddings]) \
-> Optional[Qdrant]:
"""
Store the embedding vectors of text chunks into vector store (Qdrant).
"""
if texts:
with st.spinner("Loading PDF ..."):
qdrant = Qdrant.from_texts(
texts,
embeddings,
path=":memory:",
collection_name="my_collection",
force_recreate=True
)
st.success("File Loaded Successfully!!")
else:
qdrant = None
return qdrant


def select_llm() -> Union[ChatOpenAI, LlamaCpp]:
"""
Read user selection of parameters in Streamlit sidebar.
"""
model_name = st.sidebar.radio("Choose LLM:",
("gpt-3.5-turbo-0613",
"gpt-3.5-turbo-16k-0613",
"gpt-4",
"llama-2-7b-chat.ggmlv3.q2_K"))
temperature = st.sidebar.slider("Temperature:", min_value=0.0,
max_value=1.0, value=0.0, step=0.01)
return model_name, temperature


def load_llm(model_name: str, temperature: float) -> Union[ChatOpenAI, LlamaCpp]:
"""
Load LLM.
"""
if model_name.startswith("gpt-"):
return ChatOpenAI(temperature=temperature, model_name=model_name)
elif model_name.startswith("llama-2-"):
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
return LlamaCpp(
model_path=f"./models/{model_name}.bin",
input={"temperature": temperature,
"max_length": 2048,
"top_p": 1
},
n_ctx=2048,
callback_manager=callback_manager,
verbose=False, # True
)


def load_embeddings(model_name: str) -> Union[OpenAIEmbeddings, LlamaCppEmbeddings]:
"""
Load embedding model.
"""
if model_name.startswith("gpt-"):
return OpenAIEmbeddings()
elif model_name.startswith("llama-2-"):
return LlamaCppEmbeddings(model_path=f"./models/{model_name}.bin")


def get_answer(llm, messages) -> tuple[str, float]:
"""
Get the AI answer to user questions.
"""
if isinstance(llm, ChatOpenAI):
with get_openai_callback() as cb:
answer = llm(messages)
return answer.content, cb.total_cost
if isinstance(llm, LlamaCpp):
return llm(llama_v2_prompt(convert_langchainschema_to_dict(messages))), 0.0


def find_role(message: Union[SystemMessage, HumanMessage, AIMessage]) -> str:
"""
Identify role name from langchain.schema object.
"""
if isinstance(message, SystemMessage):
return "system"
if isinstance(message, HumanMessage):
return "user"
if isinstance(message, AIMessage):
return "assistant"
raise TypeError("Unknown message type.")


def convert_langchainschema_to_dict(
messages: List[Union[SystemMessage, HumanMessage, AIMessage]]) \
-> List[dict]:
"""
Convert the chain of chat messages in list of langchain.schema format to
list of dictionary format.
"""
return [{"role": find_role(message),
"content": message.content
} for message in messages]


def llama_v2_prompt(messages: List[dict]) -> str:
"""
Convert the messages in list of dictionary format to Llama2 compliant
format.
"""
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
BOS, EOS = "<s>", "</s>"
DEFAULT_SYSTEM_PROMPT = f"""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

if messages[0]["role"] != "system":
messages = [
{
"role": "system",
"content": DEFAULT_SYSTEM_PROMPT,
}
] + messages
messages = [
{
"role": messages[1]["role"],
"content": B_SYS + messages[0]["content"] + E_SYS + messages[1]["content"],
}
] + messages[2:]

messages_list = [
f"{BOS}{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} {EOS}"
for prompt, answer in zip(messages[::2], messages[1::2])
]
messages_list.append(
f"{BOS}{B_INST} {(messages[-1]['content']).strip()} {E_INST}")

return "".join(messages_list)


def extract_userquesion_part_only(content):
"""
Function to extract only the user question part from the entire question
content combining user question and pdf context.
"""
content_split = content.split("[][][][]")
if len(content_split) == 3:
return content_split[1]
return content


def main() -> None:
_ = load_dotenv(find_dotenv())

init_page()

model_name, temperature = select_llm()
llm = load_llm(model_name, temperature)
embeddings = load_embeddings(model_name)

texts = get_pdf_text()
qdrant = build_vectore_store(texts, embeddings)

init_messages()

st.header("Personal ChatGPT")
# Supervise user input
if user_input := st.chat_input("Input your question!"):
if qdrant:
context = [c.page_content for c in qdrant.similarity_search(
user_input, k=10)]
user_input_w_context = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"]) \
.format(
context=context, question=user_input)
else:
user_input_w_context = user_input
st.session_state.messages.append(
HumanMessage(content=user_input_w_context))
with st.spinner("ChatGPT is typing ..."):
answer, cost = get_answer(llm, st.session_state.messages)
st.session_state.messages.append(AIMessage(content=answer))
st.session_state.costs.append(cost)

# Display chat history
messages = st.session_state.get("messages", [])
for message in messages:
if isinstance(message, AIMessage):
with st.chat_message("assistant"):
st.markdown(message.content)
elif isinstance(message, HumanMessage):
with st.chat_message("user"):
st.markdown(extract_userquesion_part_only(message.content))

costs = st.session_state.get("costs", [])
st.sidebar.markdown("## Costs")
st.sidebar.markdown(f"**Total cost: ${sum(costs):.5f}**")
for cost in costs:
st.sidebar.markdown(f"- ${cost:.5f}")


# streamlit run app.py
if __name__ == "__main__":
main()

Code Breakdown

The PDF text data undergoes a series of processing stages, as depicted below: Load, Transform, Embed, Store, and Retrieve. Each of these steps has corresponding sections in the Python code.

Data load, embed, store, and retrieve. The image is from the LangChain page (https://python.langchain.com/docs/modules/data_connection/)

"get_pdf_text()" function ('load' and 'transform')

def get_pdf_text() -> Optional[str]:
"""
Function to load PDF text and split it into chunks.
"""
st.header("Document Upload")
uploaded_file = st.file_uploader(
label="Here, upload your PDF file you want ChatGPT to use to answer",
type="pdf"
)
if uploaded_file:
pdf_reader = PdfReader(uploaded_file)
text = "\n\n".join([page.extract_text() for page in pdf_reader.pages])
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
return text_splitter.split_text(text)
else:
return None

This function is to prepare the PDF file loader on Streamlit UI, read the text in the PDF file, and split the text into chunks by TokenTextSplitter (this).

This function is designed to set up a PDF file loader on the Streamlit UI, extract the text from the PDF file, and then segment the text into manageable chunks using the TokenTextSplitter (this), so that the LLM can consume the texts highly related to the user question within the maximum token limit of the LLM.

“build_vectore_store()” function (‘embed’ and ‘store’)

def build_vectore_store(
texts: str, embeddings: Union[OpenAIEmbeddings, LlamaCppEmbeddings]) \
-> Optional[Qdrant]:
"""
Store the embedding vectors of text chunks into vector store (Qdrant).
"""
if texts:
with st.spinner("Loading PDF ..."):
qdrant = Qdrant.from_texts(
texts,
embeddings,
path=":memory:",
collection_name="my_collection",
force_recreate=True
)
st.success("File Loaded Successfully!!")
else:
qdrant = None
return qdrant

This function takes the text chunks previously generated by the get_pdf_text() function, converts them into embedding vectors, and stores them in an embedding store. For the purpose of the demonstration in this blog post, I’ve selected qdrant as the vector store of choice.

Pull PDF text chunks similar to the user question and combine the texts and the user question (‘retrieve’)

PROMPT_TEMPLATE = """
Use the following pieces of context enclosed by triple backquotes to
answer the question at the end.
\n\n
Context:
```
{context}
```
\n\n
Question: [][][][]{question}[][][][]
\n
Answer:"""

......

if user_input := st.chat_input("Input your question!"):
if qdrant:
context = [c.page_content for c in qdrant.similarity_search(
user_input, k=10)]
user_input_w_context = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"]) \
.format(
context=context, question=user_input)
else:
user_input_w_context = user_input

This segment, found within the main() function, is designed to identify the 10 text chunks most similar to the user_input using the qdrant.similarity_search() function. These chunks are then merged into an enriched question text using the PromptTemplate() function. Here’s an example of what the final question might look like. The user simply provides the question after the “Question:” section, such as “Who is Xenobi Amilen?” However, additional enriching context is added to it. The enrichment portion, “Xenobi Amilen (27 January 1756–5 December 1791) was a prolific and influential composer…” is derived from the contents of the PDF file.

Use the following pieces of context enclosed by triple backquotes to answer the question at the end.

Context:

['Xenobi Amilen (27 January 1756 – 5 December 1791) was a prolific and influential composer of the Classical period. Despite his short life, his rapid pace of composition resulted in more than 800 works of virtually every genre of his time. Many of these composition are acknowledged as pinnacles of the symphonic, concertante, chamber, operatic, and choral repertoire. Amilen is widely regarded as among the greatest composers in the history of Western music,[', '1] with his music admired for its "melodic beauty, its formal elegance and its richness of harmony and texture".[2] Born in Salzburg, then in the Holy Roman Empire and currently in Austria, Amilen showed prodigious ability from his earliest childhood. Already competent on keyboard and violin, he composed from the age of five and performed before European royalty. His father took him on a grand tour of Europe and then three trips to Italy. At 17, he was a musician', ' at the Salzburg court but grew restless and travelled in search of a better position. While visiting Vienna in 1781, Amilen was dismissed from his Salzburg position. He stayed in Vienna, where he achieved fame but little financial security. During his final years there, he composed many of his best-known symphonies, concertos, and operas. His Requiem was largely unfinished by the time of his death at the age of 35, the circumstances of which are', ' uncertain and much mythologized. ']

Question: [][][][]Who is Xenobi Amilen?[][][][]

Answer:

By the way, don’t fret about who Xenobi Amilen is! We’ll get to why this individual is mentioned soon.

After the enriched question is sent to the LLM, the rest is the same; just getting the LLM answer, displaying it, and waiting for the next user question.

The chat UI can launch with the bash command:

streamlit run app.py

Testing the Chat with an Example PDF File

To test the new feature, I crafted a PDF file to load into the chat. However, given that the LLM is already quite knowledgeable about the world, I decided to create an entirely fictitious document. This document discusses a made-up individual and their imaginary achievements.

Here’s a glimpse of what it looks like:

# Contents in Xenobi Amilen.pdf
Xenobi Amilen (27 January 1756 – 5 December 1791) was a prolific and influential composer of the Classical period. Despite his short life, his rapid pace of composition resulted in more than 800 works of virtually every genre of his time. Many of these composition are acknowledged as pinnacles of the symphonic, concertante, chamber, operatic, and choral repertoire. Amilen is widely regarded as among the greatest composers in the history of Western music,[1] with his music admired for its "melodic beauty, its formal elegance and its richness of harmony and texture".[2]

Born in Salzburg, then in the Holy Roman Empire and currently in Austria, Amilen showed prodigious ability from his earliest childhood. Already competent on keyboard and violin, he composed from the age of five and performed before European royalty. His father took him on a grand tour of Europe and then three trips to Italy. At 17, he was a musician at the Salzburg court but grew restless and travelled in search of a better position.

While visiting Vienna in 1781, Amilen was dismissed from his Salzburg position. He stayed in Vienna, where he achieved fame but little financial security. During his final years there, he composed many of his best-known symphonies, concertos, and operas. His Requiem was largely unfinished by the time of his death at the age of 35, the circumstances of which are uncertain and much mythologized.

Xenobi Amilen is most likely a non-existent person, as his name was the result of my random keystrokes on the laptop keyboard. The accomplishments attributed to him were actually sourced from Mozart’s Wikipedia page.

Using this text file saved as “Xenobi Amilen.pdf”, I had a short interaction with the chat:

Yes! Now, Xenobi Amilen appears to be quite an impressive individual!

There are a few inaccuracies, particularly concerning his gender and whether he’s still alive. These types of errors are quite common in LLM chat, highlighting areas where parameter tuning might be necessary. For instance:

  • LLM may not be smart enough,
  • The text chunk size and overlap, which represents how much of text the neighbor chunks can both have, (attributes `chunk_size` and `chunk_overlap` in `TokenTextSplitter()` class) could be too small or too large,
  • The wording of the prompt template (`PROMPT_TEMPLATE`) might not be quite right,
  • The number of text chunks to pull to the prompt (parameter `k` in `qdrant.similarity_search()` function) is too small or too large.

Here’s another example using locally-hosted llama-2-7b-chat.ggmlv3.q2_K . This looks good enough but also seems a bit chattier than the previous result. It may indicate adding some restrictions to the answers in the prompt such as “Answer concisely within 30 words.” etc.

Here’s another example using the locally-hosted llama-2-7b-chat.ggmlv3.q2_K , just as we implemented in my previous post. While the results are sufficiently correct, they seem a bit more verbose than the previous one. This could suggest the need for adding certain restrictions to the answers in the prompt, such as “Please provide a concise answer within 30 words,” and so on.

Looking Toward the Next Steps

There are several possibilities I want to try:

  • I plan to deploy this setup on a cloud instance with more robust specifications, thereby enhancing the chat provision. Hugging Face Inference Endpoints offer another promising avenue to explore. AWS also offers SageMaker JumpStart to one-click deploy and Amazon Bedrock for a fully-managed LLM host.
  • The importance of LLM fine-tuning cannot be overstated. It’s a crucial step in tailoring its capabilities to specific tasks or domains, optimizing the model’s performance to improve its response accuracy, thereby increasing its effectiveness and utility. OpenAI has a nice page to show how we can fine-tune some of their models.
  • Lastly, as discussed in the previous post, ChatGPT’s Code Interpreter feature has been a remarkable addition to the chat functionality. A potential extension of my Streamlit chat could merge these features with an open-source LLM to emulate the Code Interpreter. This would enable users to run confidential data through the LLM for analysis, enhancing the utility and versatility of the chat interface.

My Related Posts

--

--

Moto DEI

Principal Engineer/Data Scientist and Actuary with 20 yrs exp in media, marketing, insurance, and healthcare. https://www.linkedin.com/in/moto-dei-358abaa/