Sitemap

Chat with your PDF files using Mistral-7B and Langchain

5 min readOct 27, 2023

I am an academician. Reading from and creating PDF files is an important part of my life. My students also get to read from a lot of pdfs.

In this article, I have created a simple Python program using LangChain, HuggingFaceEmbeddings and Mistral-7B LLM from HuggingFace to answer my questions from any pdf file. I have referred to a few online blogs to create this code (see references).

Photo by Maria Lysenko on Unsplash

Let us first see all the libraries used:

What is LangChain?

LangChain is a framework for developing context aware applications on top of language models. Such applications use an LLM along with prompts, few shot examples to provide relevant responses and to reason. LangChain has prominent use cases in document question answering, chatbots, analysing structured data etc. LangChain provides components which are abstractions and their implementation to work with LLMs, it also provide chains of components for higher level tasks.

You can install langchain like this:

pip install langchain

Modules in LangChain: Model I/O, Retrieval, Chains, Agents, Memory, Callbacks

Model I/O module

Model I/O is the core element of the application. Using LangChain you can interface with any language model. This interface requires three components: Languge Models, Prompts and Output Parsers.

LangChain offers many classes and functions to construct your prompts and work with them. It offers readymade prompt templates for various tasks. You can also create custom prompt templates. You can use Feature Stores to store relevant data.

LangChain can work with LLMs or with chat models that take a list of chat messages as input and return a chat message. It can work with many LLMs including OpenAI LLMS and opensource LLMs.

Output Parsers are used to structure the response received from LLMs. PydanticOutputParser is the main type of output parser in LangChain.

Retrieval module

Retrieval module implements Retrieval Augmented Generation (RAG) to access user-specific data that is not the part of the model’s training data. Retrieval includes following steps — loading data, transforming data, creating or obtaining embeddings, storing embeddings and retrieving embeddings. LangChain has arount 100 Document loaders to read documents of all major formats- CSV, HTML, pdf, code etc. It can transform data using different algorithms. LangChain has integration with over 25 embedding providers, over 50 vector stores, both opensource and proprietary. After your data is in the database, you can retrieve it using many retrieval algorithms implemented in LangChain.

Chains Module

Complex applications often need combining or chaining of calls to several LLMs to construct a more relevant response. A “chain” is a sequence of calls to components which can include other chains. You can use “Chain” interface or LangChain Expression Langauge to create chains using LangChain.

Agents Module

An agent is the chain responsible for deciding what step to take next. It is powered by a langauge model and a prompt. It needs following inputs: list of available tools, user inputs and previously executed steps(if any). Agent cals functions known as “Tools”. Agents use an LLM to decide what action to take and in what order. Actions include — using a tool, observing the output of a tool, returning a response to user.

Memory Module

Memory module enables a system to remember past information, this may be past conversations for a conversational bot.

Callbacks Module

Callbacks mechanism allows you to go back to different stages of your LLM application using ‘callbacks’ argument of the API. It is used for logging, monitoring, streaming etc.

Mistral-7B

Mistral-7B is a powerful language model (opensourced currently) with 7.3 Billion parameters that outperforms many state of the art models with higher number of parameters. It can be downloaded for offline use, used in cloud or from HuggingFace. Using HuggingFaceHub from langchain you can load and use Mistral-7B using following code:

repo_id = "mistralai/Mistral-7B-v0.1"
llm = HuggingFaceHub(huggingfacehub_api_token='your huggingface access token here',
repo_id=repo_id, model_kwargs={"temperature":0.2, "max_new_tokens":50})

Hugging Face Embeddings

An embedding is a numerical representation of a piece of data in the form of multidimentional floating point type vectors. You can have embeddings for text, images, audio, video, documents etc. An embedding is not just a numerical representation, it is a numerical representation that captures the contextual and semantic meaning of the data it represents.

A pretrained model can be used for creating embeddings. Sentence Transformers library from HuggingFace offers many such models. You can install it like this:

pip install -U sentence-transformers

Then use it to load a pre-trained model to encode your text sentences.

chromadb Vector Store

chroma is an opensource embedding database (vector store) for creating, storing, retrieving, doing semantic search for your embeddings. You can install it like this:

pip install chroma

It allows you to connect to a chroma client, create a collection, add documents with metadata and ids to colection (this step creates embeddings), then query this collection(semantic retrieval).

pypdf library

I have used pypdf library to read, split, merge, crop, transform pages of pdf files, add custom data, change viewing options, add passwords to pdf files, retrieve text and metadata from pdf files. You need to install it before use.

pip install pypdf

For using pypdf with AES encryption or decryption, install extra dependencies:

pip install pypdf[crypto]

Code:

# Install dependencies
!pip install huggingface_hub
!pip install chromadb
!pip install langchain
!pip install pypdf
!pip install sentence-transformers
# import required libraries
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFaceHub
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
# Load the pdf file and split it into smaller chunks
loader = PyPDFLoader('report.pdf')
documents = loader.load()

# Split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
# We will use HuggingFace embeddings 
embeddings = HuggingFaceEmbeddings()
#Using Chroma vector database to store and retrieve embeddings of our text
db = Chroma.from_documents(texts, embeddings)
retriever = db.as_retriever(search_kwargs={'k': 2})
# We are using Mistral-7B for this question answering 
repo_id = "mistralai/Mistral-7B-v0.1"
llm = HuggingFaceHub(huggingfacehub_api_token='your huggingface access token here',
repo_id=repo_id, model_kwargs={"temperature":0.2, "max_new_tokens":50})
# Create the Conversational Retrieval Chain
qa_chain = ConversationalRetrievalChain.from_llm(llm, retriever,return_source_documents=True)
#We will run an infinite loop to ask questions to LLM and retrieve answers untill the user wants to quit
import sys
chat_history = []
while True:
query = input('Prompt: ')
#To exit: use 'exit', 'quit', 'q', or Ctrl-D.",
if query.lower() in ["exit", "quit", "q"]:
print('Exiting')
sys.exit()
result = qa_chain({'question': query, 'chat_history': chat_history})
print('Answer: ' + result['answer'] + '\n')
chat_history.append((query, result['answer']))

There you have your own question-answering system that answers all your questions from a long, difficult pdf. Now, go on and explore more.

References:

  1. https://medium.com/@woyera/how-to-chat-with-your-pdf-using-python-llama-2-41df80c4e674 https://www.shakudo.io/blog/build-pdf-bot-open-source-llms
  2. https://www.shakudo.io/blog/build-pdf-bot-open-source-llms

--

--

Responses (5)