Optimizing Your Query and Getting Relevant Answers with Chroma DB Vector Database

8 min readJan 15, 2024

When it comes to accomplishing the desired output through a language model, improvement in fine-tuning is key. Common tasks like sentiment analysis and named entity recognition often yield satisfactory results without the need for additional knowledge data. However, for more intricate or knowledge-intensive tasks, building a model that leverages external sources becomes necessary to achieve optimal results. This is where RAG steps in and plays a crucial role.

In the provided image, the robot is trying to find a relevant book to answer its query in a library. credit:https://wepik.com/ai

Overview of Embedding-Based Retrieval:

RAG, which stands for Retrieval Augmented Generation, is a technique designed to extract relevant information from a given set of documents or a knowledge base. It typically employs techniques like information retrieval or semantic search to identify the most relevant pieces of information based on a given query.

RAG allows the generation of contextually relevant and domain-specific information. Currently, many employ simple RAG techniques based on semantic similarity and embedding. I would like to show you the overall system diagram to illustrate how it works in practice.

“The process of retrieval augmented generation involves receiving a user query. You already have a set of documents that you’ve previously embedded and stored in your retrieval system. Next, you run the user’s query through the same embedding model used for the documents, generating an embedding for the query.

After embedding the query, the retrieval system identifies the most relevant documents by finding the nearest neighbor embeddings. The system then returns both the query and the pertinent documents to the LLM (Language Model). The LLM synthesizes information from the retrieved documents to generate a comprehensive answer.”

There are challenges encountered when retrieving query answers from embedding vectors, as it may return similar topics without providing exact answers. In this article, you will explore a highly sophisticated technique that delivers more accurate and precise results.

Croma DB

Chroma DB is an open-source vector storage system, also known as a vector database, created to store and retrieve vector embeddings. Its main purpose is to store embeddings along with their associated metadata for future utilization by extensive language models.

Additionally, it serves as a robust foundation for semantic search engines that operate on textual data. Vector data provides an optimal solution for handling substantial amounts of unstructured and semi-structured data, allowing you to store embeddings and their metadata, embed documents and queries, and efficiently search embeddings.

In this article, I’ll demonstrate a simple usage of ChromDB. If you wish to delve deeper, please refer to the links I’ve included at the end of this article. To begin, the initial step involves installing the most up-to-date ChromDB Python library.

Install

Install the module using the following command in your terminal. For Windows users, you can simply use the pip command in the Command Prompt (cmd). If you are using a different operating system, please check for alternative commands tailored to your system.

pip install chromadb

2. Get the Croma client

Next, create an object for the Chroma DB client by executing the appropriate code.

import chromadb
chroma_client = chromadb.Client()

3. To create a collection

Collections serve as the repository for your embeddings, documents, and any supplementary metadata. To create a collection, specify a name using the following syntax:

collection = chroma_client.create_collection(name="Documents/external data source")

4. Add some text documents to the collections

Chroma will automatically handle the storage of your text, as well as the processes of tokenization, embedding, and indexing.

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

If you have previously generated embeddings, you can load them directly into the system using:

collection.add(
    embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

5. Query the collection

You can query the collection with a list of query texts, and Chroma will return the most similar results. It’s that easy!

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

Now, let’s dive in and demonstrate how this works in practice.

I have PDF documents containing the annual report of Microsoft in 2022. I’ll read it, convert it into an embedding vector, and attempt to retrieve query answers. If you’re interested, you can view the PDF in your browser here.

Let’s begin with the code section. Firstly, we import all the necessary libraries required for the entire code execution.

from helper_utils import word_wrap # helper functions from utilities
from pypdf import PdfReader # read the pdf files 
import chromadb
import os
import openai
from openai import OpenAI

from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0]))

To begin, we’ll import some helper functions from our utilities. This function is essentially a basic word-wrap function that enables us to view documents in a neatly formatted manner.

We’re going to read from a PDF, so we’ll import the PDF Reader. This is a straightforward Python package that you can easily import to extract all the text into the pdf_texts variable.

Now, I’m going to utilize some useful utilities from LangChain. We’ll use LangChain text splitters, recursive character text splitters, and the sentence transformers token text splitters.

The character_splitter enables us to recursively divide text based on specified divider characters in each presented piece of text. The recursive character text splitter will identify double newlines and split on them. If the resulting chunks are still larger than our target chunk size, further splitting will occur.

from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")

We utilize the sentence transformer token text splitter, and I’ll explain the reasoning behind this choice in just a moment.

token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")

The text splitter according to the character, recursive character text splitter. And there are 347 chunks in total in this annual report PDF. So now we’ve split by character, but character text splitting isn’t quite enough. The reason for that is that the embedding model, which we use called sentence transformers, has a limited context window width. It uses 349 characters and there are 349 chunks. That’s the maximum context window length of our embedding model. This is a minor pitfall. that I may use a sentence transformer.

The Sentence Transformer Embedding Model is essentially an extension of the BERT transformer architecture. While the BERT architecture embeds each token individually, the Sentence Transformer model extends this concept.

Now, that’s the first step in any retrieval augmented generation system. The next step is to load all the chunks into our retrieval. To utilize Chroma, we need to import Chroma itself.

We’re going to use the Sentence Transformer embedding model, which creates a long vector of a dense matrix. Each entry in the vector is associated with a number.


from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

So the next step is to set up Chroma. We’re going to use the default Chroma client. And we’re going to make a new Chroma collection. And the collection is going to be called Microsoft Annual Report 2022. And I am going to pass on our embedding function, which we defined before. also, create IDs for each of the text chunks that we’ve created. And they’re just going to be the string of the number of their position in the total token split texts.

chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

Now We can pass our query texts. And we’re asking for five results.

query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

And what we’re going to output now are essentially the retrieved documents themselves.

Now that you understand how to retrieve relevant answers from the embedding vector database using Chroma DB, the next step is to use these results in conjunction with a Language Model (LLM) to answer our query. We’re going to use GPT for this, and we need to do a little bit of setup to have an OpenAI client. We’ll load our OpenAI API key from the environment to authenticate.

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

Now, we’re going to define a function that allows us to call out to the model using our retrieved results along with our query. We’re going to use GPT-3.5 Turbo, which does a reasonably good job in RAG (Retrieval-Augmented Generation) loops and is fairly quick and efficient. The first thing is to pass in our query and the retrieved documents.

def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

You can find all the code in this Notebook.

In summary,

In this comprehensive guide, the article introduces Chroma DB, an open-source vector storage system tailored for managing vector embeddings. It details the installation of the Chroma DB Python library, the creation of a Chroma DB client object, and the establishment of a Chroma collection. Exploring text-splitting techniques from LangChain, the article addresses limitations related to the embedding model’s context window width.

It demonstrates querying Chroma DB, retrieving results and integrating the obtained data with GPT-3.5 Turbo for answer synthesis. The article, step by step, equips readers with the tools to effectively utilize Chroma DB in a retrieval augmented generation system, showcasing its role in storing, retrieving, and synthesizing information for enhanced language models.

Reference :

To learn more about the croma embedding vector database

Connect with me:https://www.linkedin.com/in/suraj-meshram-009090168/

Connect with me For the Full Code:https://github.com/surajmesh/Genrative_AI/blob/master/cromaDB.ipynb

Optimizing Your Query and Getting Relevant Answers with Chroma DB Vector Database

Reference :

Written by Suraj Meshram