Building a Local PDF Chat Application with Mistral 7B LLM, Langchain, Ollama, and Streamlit

Harjot
7 min readNov 2, 2023

Introduction:

PDFs are a common way to share documents and information. However, they can be difficult to navigate and search, especially if they are large or complex. Chatbots can provide a more user-friendly way to interact with PDFs.

A PDF chatbot is a chatbot that can answer questions about a PDF file. It can do this by using a large language model (LLM) to understand the user’s query and then searching the PDF file for the relevant information. The chatbot can then generate a response to the user’s query in plain language.

PDF chatbots can be used for a variety of purposes, such as:

  • Answering questions about the content of a PDF file
  • Providing summaries of a PDF file
  • Searching for specific keywords or phrases in a PDF file
  • Translating a PDF file into another language

In this article, I will show you how to make a PDF chatbot using the Mistral 7b LLM, Langchain, Ollama, and Streamlit.

Mistral 7b

Mistral 7b is a 7-billion parameter large language model (LLM) developed by Mistral AI. It is trained on a massive dataset of text and code, and it can perform a variety of tasks. Mistral 7b is still under development, but it has already achieved state-of-the-art results on a variety of benchmarks. For example, it outperforms all other pre-trained LLMs of similar size and is even better than larger LLMs such as Llama 2 13B. It is designed to be deployed on commodity hardware, such as GPUs and CPUs, without the need for expensive TPU clusters. This makes it more accessible to a wider range of users and businesses.

Prerequisites: Running Mistral7b locally using Ollama🦙

Ollama allows you to run open-source large language models, such as Llama 2, locally. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile. It optimizes setup and configuration details, including GPU usage.

For Mac and Linux Users: Ollama effortlessly integrates with Mac and Linux systems, offering a user-friendly installation process. Mac and Linux users can swiftly set up Ollama to access its rich features for local language model usage. Detailed instructions can be found here: Ollama GitHub Repository for Mac and Linux.

For Windows users, the process involves a few additional steps, to ensure a smooth Ollama experience:

1. Install WSL 2: To enable WSL 2, kindly refer to the official Microsoft documentation for comprehensive installation instructions: Install WSL 2.

2. Install Docker: Docker for Windows is a crucial component. Installation guidance is provided in the official Docker documentation: Install Docker for Windows.

3. Utilize Docker Image: Windows users can access Ollama by using the Docker image provided here: Ollama Docker Image.

Now we can easily use Mistral in CMD using the following command:

docker exec -it ollama ollama run mistral

Langchain 🦜

LangChain is a framework for developing applications powered by language models. It makes it very easy to develop AI-powered applications and has libraries in Python as well as Javascript. I have used langchain to integrate Ollama with my application.

I am using the concept of RAG(Retrieval-augmented generation) to generate response in the context of a particular document. Retrieval Augmented Generation (RAG) applications are a type of large language model (LLM) application that augment their generation capabilities by retrieving relevant information from an external knowledge base. This allows RAG applications to produce more informative and comprehensive responses to a wider range of prompts and questions.

In this article, I will show how to build a RAG application with the Mistral 7B model and a Chroma Vector database. The architecture will be the following

Architecture

The code for the RAG application using Mistal 7B,Ollama and Streamlit can be found in my GitHub repository here.

Lets Code 👨‍💻

Let us start by importing the necessary libraries:

# Import required modules
from langchain import hub
from langchain.chains import RetrievalQA
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.callbacks.manager import CallbackManager
from langchain.llms import Ollama
from langchain.embeddings.ollama import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
import streamlit as st
import os
import time

Load the document and split the documents into chunks before embedding them in your vector database. I chose chunk size of 1500 tokens. You can change this to fit your specific use-case.

loader = PyPDFLoader("example.pdf")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500, chunk_overlap=100)
all_splits = text_splitter.split_documents(data)

Persist database to your disc to save it using vectorstore.persist() so that you don’t have preprocess the document everytime.

persist_directory = 'jj'

vectorstore = Chroma.from_documents(
documents=all_splits, embedding=OllamaEmbeddings(model="mistral"),persist_directory=persist_directory)

vectorstore.persist()

After saving, you can choose the persistence directory and load it from the disk. Now we can load the persisted database from disk, and use it as normal. Remember to choose the same embedding model as before.

vectorstore = Chroma(persist_directory=persist_directory,
embedding_function=OllamaEmbeddings(model="mistral")
)

Now initialize the llm and create a retriever

llm = Ollama(base_url="http://localhost:11434",
model="mistral:instruct",
verbose=True,
callback_manager=CallbackManager(
[StreamingStdOutCallbackHandler()])
)

retriever = vectorstore.as_retriever()

Conversation buffer memory is used to maintain a history of chats so that the LLM can also reffer to the previous chats in the prompt.

Intializing Conversation buffer memory and prompt template.

template = """
You are a knowledgeable chatbot, here to help with questions of the user. Your tone should be professional and informative.

Context: {context}
History: {history}

User: {question}
Chatbot:""
"""
prompt = PromptTemplate(
input_variables=["history", "context", "question"],
template=template,
)

memory = ConversationBufferMemory(
memory_key="history",
return_messages=True,
input_key="question"
)

Creating a Q&A chain:

qa_chain = RetrievalQA.from_chain_type(
llm=st.session_state.llm,
chain_type='stuff',
retriever=st.session_state.retriever,
verbose=True,
chain_type_kwargs={
"verbose": True,
"prompt": prompt,
"memory": memory,
}
)

Now u can just query the llm directly:

while True:
query = input("Ask a question: ")
response = qa_chain(query)

Make UI using Streamlit

Streamlit is an open-source app framework for Machine Learning and Data Science teams. Create beautiful web apps in minutes.

First, we will use a file uploader component to upload a pdf file and preprocess it.

import streamlit as st
uploaded_file = st.file_uploader("Upload your PDF", type='pdf')

We will be using chat_history variable to maintain chat history across the streamlit session.

# Initialize the chat history
if 'chat_history' not in st.session_state:
st.session_state.chat_history = []

NOTE: We cannot directly use simple variables to store the values because streamlit reintialize every variable on every run so we have to store it as session variables. ( It took me 4hrs to realise this 😢😓)

So we have to create session variables for memory ,prompt,llm and vector store like this:

if 'prompt' not in st.session_state:
st.session_state.prompt = PromptTemplate(
input_variables=["history", "context", "question"],
template=template,
)

# Initialize the memory for conversation history
if 'memory' not in st.session_state:
st.session_state.memory = ConversationBufferMemory(
memory_key="history",
return_messages=True,
input_key="question"
)

# Initialize the vector store for document embeddings
if 'vectorstore' not in st.session_state:
st.session_state.vectorstore = Chroma(persist_directory='jj',
embedding_function=OllamaEmbeddings(
model="mistral:instruct")
)

# Initialize the Ollama large language model (LLM)
if 'llm' not in st.session_state:
st.session_state.llm = Ollama(base_url="http://localhost:11434",
model="mistral:instruct",
verbose=True,
callback_manager=CallbackManager(
[StreamingStdOutCallbackHandler()])

Streamlit file upload doesn’t give us the path so we have to get its bytes first and then save it to the files directory.

bytes_data = uploaded_file.read()
f = open("files/"+uploaded_file.name+".pdf", "wb")
f.write(bytes_data)
f.close()
loader = PyPDFLoader("files/"+uploaded_file.name+".pdf")
data = loader.load()

Now we will repeat the same steps for processing our data and make vectorstore of it using ChromaDb.

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=200,
length_function=len
)
all_splits = text_splitter.split_documents(data)

# Create and persist the vector store
st.session_state.vectorstore = Chroma.from_documents(
documents=all_splits,
embedding=OllamaEmbeddings(model="mistral")
)
st.session_state.vectorstore.persist()

Initializing QA Chain

if 'qa_chain' not in st.session_state:
st.session_state.qa_chain = RetrievalQA.from_chain_type(
llm=st.session_state.llm,
chain_type='stuff',
retriever=st.session_state.retriever,
verbose=True,
chain_type_kwargs={
"verbose": True,
"prompt": st.session_state.prompt,
"memory": st.session_state.memory,
}
)

Now we have to handle user queries and response by the llm. We do this by assigning roles to messages like “user” , “assistant” in the prompt. Also, we will be appending each message to the chat history to maintaining the chats in the session.

if user_input := st.chat_input("You:", key="user_input"):
user_message = {"role": "user", "message": user_input}
st.session_state.chat_history.append(user_message)
with st.chat_message("user"):
st.markdown(user_input)
with st.chat_message("assistant"):
with st.spinner("Assistant is typing..."):
response = st.session_state.qa_chain(user_input)
message_placeholder = st.empty()
full_response = ""
for chunk in response['result'].split():
full_response += chunk + " "
time.sleep(0.05)
# Add a blinking cursor to simulate typing
message_placeholder.markdown(full_response + "▌")
message_placeholder.markdown(full_response)

chatbot_message = {"role": "assistant", "message": response['result']}
st.session_state.chat_history.append(chatbot_message)

You can find the complete code in my github repo.

https://github.com/SonicWarrior1/pdfchat

Conclusion

In this guide, we’ve unlocked the potential of AI to revolutionize how we engage with PDF documents. Our PDF chatbot, powered by Mistral 7B, Langchain, and Ollama, bridges the gap between static content and dynamic conversations.

By understanding the capabilities of Retrieval-Augmented Generation (RAG) and leveraging open-source tools, you now have the foundation to create intelligent document interaction solutions. The possibilities are limitless.

Take this knowledge, customize it for your specific needs, and explore the endless applications of AI in document interactions. We’re excited to see how you’ll shape the future of PDF interactions.

Thank you for embarking on this AI journey with me😉.

--

--