Generative AI QA Model Using Chroma and Mistral 7B

Alekh Sinha
DataSeries
Published in
6 min readFeb 15, 2024

I recently completed a course, ‘Generative AI with Large Language Models’ on Coursera and it just inspired me into doing a hobby project in that field. I am writing this blog to explain my hobby project.

Books contain vast reservoirs of knowledge and play a crucial role in enriching our understanding and learning. This project is designed to tap on this knowledge of books. In this project, on receiving a query from the user, it finds the relevant extracts from the book and generates answer based on the query and the extracts. I have kept all my codes in https://github.com/Alekh-sinha/Generative-AI-QA-Model.

Fig1 Process flow diagram

This project can be divided into following segment:-

  • Data Acquisition
  • Updating Vector database
  • RAG (Retrieval Augmented generation)- Max marginal relevance
  • LLM Model inference
  • Training — LORA (Low Rank Adaptation)

Data Acquisition- In this project data acquisition is done by extracting data from internet and also by uploading pdf into cloud bucket. While searching for easier ways to extract data from website, I came across MarkupLM (https://huggingface.co/docs/transformers/model_doc/markuplm). This is a huggingface model which has a preprocessor- MarkupLM Feature Extractor (https://huggingface.co/docs/transformers/v4.37.2/en/model_doc/markuplm#transformers.MarkupLMFeatureExtractor). This extractor can extract text from all xpaths. We can also build a QA model (Extractive QA) by using this model.

MarkupLm Diagram source:- https://huggingface.co/docs/transformers/model_doc/markuplm
from transformers import AutoProcessor, MarkupLMForQuestionAnswering
import torch

processor = AutoProcessor.from_pretrained("microsoft/markuplm-base-finetuned-websrc")
model = MarkupLMForQuestionAnswering.from_pretrained("microsoft/markuplm-base-finetuned-websrc")

html_string = "<html> <head> <title>My name is Niels</title> </head> </html>"
question = "What's his name?"

encoding = processor(html_string, questions=question, return_tensors="pt")

with torch.no_grad():
outputs = model(**encoding)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = encoding.input_ids[0, answer_start_index : answer_end_index + 1]
processor.decode(predict_answer_tokens).strip()

Above shown code block demonstrates the capability of MarkupLM model to generate answer directly from a HTML file. This is very useful, as in HTML design, information will be scattered among different xpaths and so reading text from each xpath separately may provide better insight into HTML page. However in this project I have used this, just to extract data from HTML page.

For PDF, I am using pypdf (https://pypi.org/project/pypdf/) to extract text from a pdf file.

Updating Vector database- For storing vectors, I am using chroma database. For this, I write a code which checks if the database is present or not and in case it is present, this code updates it with the new set of text data, after converting them to vector. In order to keep track of files , I created two folder. One where all new files were kept and once database was updated with those files, they were moved to another folder.

class embedding:
def __init__(self,cfg):
self.character_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ". ", " ", ""],chunk_size=1000,chunk_overlap=0)
self.embedding_function = SentenceTransformerEmbeddingFunction()
self.chroma_client = chromadb.PersistentClient(path="/content/chromadb")
self.token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=cfg.tokens_per_chunk)
def collection_update(self,cfg,text):
self.character_split_texts = self.character_splitter.split_text(text)
token_split_texts = []
for text in self.character_split_texts:
token_split_texts += self.token_splitter.split_text(text)
self.chroma_collection = self.chroma_client.get_or_create_collection(cfg.collection, embedding_function=self.embedding_function)
l=self.chroma_collection.count()
print(l)
ids = [str(i+l) for i in range(len(token_split_texts))]
self.chroma_collection.add(ids=ids, documents=token_split_texts)

This code first divide the text into chunks and then uses sentence transformers to get embedding for sentences. Chroma get or create method, creates a database in case its not already present and chroma collection.add updates it.

RAG (Retrieval Augmented generation)- This is a very important piece for this project. When user gives a query as a text, text is converted to vector by using the same tokenizer. After that a vector similarity score is obtained between the query vector and each vector of the vector database. Documents with the best similarity scores are selected as an output. Chroma has an option for three types of similarity scores- Squared L2, Cosine similarity, Inner product.

Now suppose I have requested for top 5 documents but it may well happen that all those documents have the same information. To avoid such scenarios, we can use Max marginal relevance (MMR) .

MMR(D) = λ * Sim(D, Q) — (1 — λ) * max(Sim(D, Di))

Where:

  • D is the document being scored.
  • Q is the query.
  • Di represents each previously selected document.
  • Sim(D, Q) is the similarity score between document D and the query Q, typically computed using a similarity metric such as cosine similarity.
  • max(Sim(D, Di)) represents the maximum similarity score between document D and any previously selected document Di.
  • λ is a parameter that controls the trade-off between relevance and diversity. It typically ranges between 0 and 1, where higher values prioritize relevance over diversity, and lower values prioritize diversity over relevance.

As indicated by the formula, MMR ensures that the selected documents are similar to query but are not very similar to each other.

class inference:
def __init__(self,cfg):
self.chroma_client = chromadb.PersistentClient(path="/content/chromadb")
self.embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
self.chroma_collection = self.chroma_client.get_collection(cfg.collection)
self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
self.model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",low_cpu_mem_usage=True)
self.langchain_chroma = Chroma(client=self.chroma_client,collection_name=cfg.collection,embedding_function=self.embedding_function)
self.retriever = self.langchain_chroma.as_retriever(search_type="mmr")
def generate_text(self,query):
#self.results = self.chroma_collection.query(query_texts=[query], n_results=5)
#self.retrieved_documents = self.results['documents'][0]
self.retrieved_documents = self.retriever.get_relevant_documents(query,n_results=5)
self.retrieved_documents=[i.page_content for i in self.retrieved_documents]
self.information = "\n\n".join(self.retrieved_documents)
prompt_template = """
Answer the question based on the context below. Respond "Unsure about answer" if not sure about the answer.
question: {question}
context: {context}
"""
message=prompt_template.format(question=query,context=self.information)
inputs = self.tokenizer(message, return_tensors="pt")
outputs = self.model.generate(**inputs, max_new_tokens=256,do_sample=True,temperature=0.5)
return(self.tokenizer.decode(outputs[0], skip_special_tokens=True).replace(message,''))

LLM Model inference- The extracted document, along with the query is combined with a prompt, first passed through a tokenizer and then it is passed through the LLM model. Model used in this case is Mistral-7B. One of the major concern in any LLM project , is the memory required to host this model. Mistral-7B has approximately 7 Billion parameter and if we read it in 32 bit format we will have 4 bytes for each parameter which roughly translates to 28 GB. I had chosen a 50GB system, so that was sufficient.

If you are training and using 32bits system, then consider 4bytes *number of parameter+ 20bytes*number of parameter (additional memory for computation) as minimum memory requirement. This leads to a problem on how to train a large model. To solve this problem, we now have LORA (low rank adaptation ) (https://arxiv.org/pdf/2106.09685.pdf).

Training — LORA (Low Rank Adaptation)- This is based on a matrix decomposition. Suppose you have a matrix of dimension 300*300. Now this can be represented by two matrix of dimension 300*12 and 12*300. This decomposition has reduced total number of parameters from 90000 to 2*(300*12)=7200 which is a significant reduction.

source- https://www.youtube.com/watch?v=dA-NhCtrrVE

In terms of weight update h = W0x + ∆W x = W0x + BAx . Here BA is the two decomposed matrix which are getting added to the pre-trained weight. Pretrained weights are fixed here. They have also used a scaling factor by which they scale the calculated weights.

config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
bias="none",
lora_dropout=0.05, # Conventional
task_type="CAUSAL_LM")

Here lora_alpha is scaling factor, r is the rank of metrics. In the example (300*12) 12 is the rank of the metrics. In target modules we need to give name of modules for which this decomposition needs to be carried on. This information is in the LLM model. Its the name of the module which has attention layers.

In terms of evaluation metrics , I have used rouge evaluation metrics (https://pypi.org/project/rouge-score/)

Now lets see how these things adds up.

Output after deployment

In this image, blue box indicates context or relevant text extracted from book (relevancy is based on the question). Lower box is the answer which is generated by the LLM model based on the context.

--

--