Building a RAG System with Gemini and ChromaDB

Published in

Data on Cloud: GenAI, Data Science, and Data Engineering Insights

7 min readApr 24, 2024

In today’s digital landscape, Large Language Models like Google’s Gemini are transforming industries by enhancing intelligent question-answering systems. This article explores the development of a Retrieval Augmented Generation (RAG) system using Gemini and ChromaDB, demonstrating how their integration improves the reliability and efficiency of QnA platforms.

Gemini Overview

Originally launched as Bard in March 2023, Google’s Gemini was a strategic development aimed at competing with OpenAI’s ChatGPT. Renamed in February 2024, Gemini has been made accessible in additional countries, showcasing its advanced capabilities in generating human-like text based on sophisticated language models. As a pivotal component in Retrieval Augmented Generation (RAG) systems, Gemini’s API processes extensive datasets to create coherent and contextually relevant text, significantly enhancing the functionality of QnA systems. This integration is crucial not only for addressing the inherent limitations of LLMs, such as producing outdated or overly generic responses but also for ensuring that the responses are both accurate and timely, which is essential for maintaining the relevance of information in dynamic interaction environments.

source:https://blog.google/technology/ai/google-gemini-ai/#sundar-note

Vector Databases and ChromaDB’s Role in Enhancing RAG Systems

The efficiency of RAG systems heavily relies on vector databases, which store and retrieve embeddings — vector representations of text data. These databases are crucial for the rapid retrieval of pertinent information, thereby enhancing system responsiveness without the need for reprocessing large datasets.

ChromaDB, designed specifically for AI applications that involve embeddings, integrates seamlessly into RAG systems. With built-in functionalities optimized for local operations and a hosted version on the horizon, ChromaDB stands out as an exemplary choice for developers looking to leverage the full potential of RAG systems. Its ability to efficiently handle embeddings makes it indispensable for providing the memory capacity necessary to support the advanced computational requirements of systems like Gemini, further streamlining the process from data retrieval to response generation.

Implementation: Building a RAG System with Gemini and ChromaDB

The objective of this implementation is to construct a Retrieval Augmented Generation (RAG) system that leverages Google’s LLM Gemini and the vector database ChromaDB. This setup aims to enable the querying of a specific document — specifically, a Google white paper — and use a QnA format to search and retrieve information from the document using an LLM.

Source:https://services.google.com/fh/files/misc/ai_adoption_framework_whitepaper.pdf

Step-by-Step Implementation Guide

1. Data Acquisition and Text Extraction

Downloading the Document: The first step involves fetching the document via a URL. This is done using the requests library, which handles HTTP requests and saves the file locally.
PDF link: here.

def download_pdf(url, save_path):
    response = requests.get(url)
    with open(save_path, 'wb') as f:
        f.write(response.content)
download_pdf(pdf_url, pdf_path)

# URL and local path for the PDF
pdf_url = "https://services.google.com/fh/files/misc/ai_adoption_framework_whitepaper.pdf"
pdf_path = "ai_adoption_framework_whitepaper.pdf"  # You can change this to a specific path if needed
download_pdf(pdf_url, pdf_path)

Loading and Extracting Text from PDF: The PdfReader from the pypdf library is utilized to load and extract text from each page of the PDF. This process converts the document's content into a string format that can be processed further.

def load_pdf(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text
    return text
pdf_text = load_pdf(pdf_path)

2. Text Processing and Embedding

Text Splitting: The extracted text is then split into manageable chunks using regular expressions. This helps in processing large documents efficiently by breaking them into smaller sections based on paragraphs.

def split_text(text):
    return [i for i in re.split('\n\n', text) if i.strip()]
chunked_text = split_text(pdf_text)

Embedding Generation: Using the GeminiEmbeddingFunction, each text chunk is converted into a vector representation (embedding) using Gemini's API. This step is crucial as it transforms textual data into a format that can be easily queried and compared.

class GeminiEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        gemini_api_key = os.getenv("GEMINI_API_KEY")
        genai.configure(api_key=gemini_api_key)
        model = "models/embedding-001"
        title = "Custom query"
        return genai.embed_content(model=model, content=input, task_type="retrieval_document", title=title)["embedding"]

3. Vector Database Setup

Database Creation: ChromaDB is set up to store these embeddings. The database is created locally, and each document’s embedding is stored with a unique identifier.

def create_chroma_db(documents: List[str], path: str, name: str):
    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.create_collection(name=name, embedding_function=GeminiEmbeddingFunction())
    for i, d in enumerate(documents):
        db.add(documents=[d], ids=[str(i)])
    return db, name
db, db_name = create_chroma_db(chunked_text, db_path, db_name)

4. Query Processing and Response Generation

Loading the Collection and Retrieving Data: Based on user queries, the system searches the database to find the most relevant text passages.

def get_relevant_passage(query: str, db, n_results: int):
    results = db.query(query_texts=[query], n_results=n_results)
    return [doc[0] for doc in results['documents']]

Prompt Construction and Answer Generation: A prompt is constructed to contextualize the query with the retrieved passage, which is then fed into Gemini to generate a coherent and relevant response.

def make_rag_prompt(query: str, relevant_passage: str):
    escaped_passage = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
    prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below...
    QUESTION: '{query}'
    PASSAGE: '{escaped_passage}'
    ANSWER:
    """
    return prompt
def generate_answer(prompt: str):
    gemini_api_key = os.getenv("GEMINI_API_KEY")
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel('gemini-pro')
    result = model.generate_content(prompt)
    return result.text

5. Interactive User Interface

Interactive Query Handling: An interactive function allows users to input their queries, which the system processes to generate and display answers.

def process_query_and_generate_answer():
    query = input("Please enter your query: ")
    ...
    print("Generated Answer:", answer)
process_query_and_generate_answer()

Interactive Demonstration: Query and Response Visualization

For a practical demonstration, we will show how the RAG system processes a query about the “AI Maturity Scale” from a Google white paper.

Query Submitted: “Explain The AI Maturity Scale in the paper.”

System Response:

Corresponding White Paper Segment:

The RAG system effectively summarizes the “AI Maturity Scale” from the white paper, accurately describing its structure and purpose through clear and concise language. The response demonstrates the system’s ability to distill complex information into accessible and contextually relevant insights, aligning closely with the source material.

Full codes here

!pip install pypdf
!pip install google-generativeai
!pip install chromadb
!pip install typing
import requests
from pypdf import PdfReader
import os
import re
import google.generativeai as genai
from chromadb import Documents, EmbeddingFunction, Embeddings
import chromadb
from typing import List

# Download the PDF from the specified URL and save it to the given path
def download_pdf(url, save_path):
    response = requests.get(url)
    with open(save_path, 'wb') as f:
        f.write(response.content)

# URL and local path for the PDF document
pdf_url = "https://services.google.com/fh/files/misc/ai_adoption_framework_whitepaper.pdf"
pdf_path = "ai_adoption_framework_whitepaper.pdf"
download_pdf(pdf_url, pdf_path)

# Load the PDF file and extract text from each page
def load_pdf(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text
    return text

pdf_text = load_pdf(pdf_path)

# Set and validate the API key for Gemini API
os.environ['GEMINI_API_KEY'] = '<your api key>'
gemini_api_key = os.getenv("GEMINI_API_KEY")
if not gemini_api_key:
    raise ValueError("Gemini API Key not provided or incorrect. Please provide a valid GEMINI_API_KEY.")
try:
    genai.configure(api_key=gemini_api_key)
    print("API configured successfully with the provided key.")
except Exception as e:
    print("Failed to configure API:", str(e))

# Split the text into chunks based on double newlines
def split_text(text):
    return [i for i in re.split('\n\n', text) if i.strip()]

chunked_text = split_text(pdf_text)

# Define a custom embedding function using Gemini API
class GeminiEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        gemini_api_key = os.getenv("GEMINI_API_KEY")
        genai.configure(api_key=gemini_api_key)
        model = "models/embedding-001"
        title = "Custom query"
        return genai.embed_content(model=model, content=input, task_type="retrieval_document", title=title)["embedding"]

# Create directory for database if it doesn't exist
db_folder = "chroma_db"
if not os.path.exists(db_folder):
    os.makedirs(db_folder)

# Create a Chroma database with the given documents
def create_chroma_db(documents: List[str], path: str, name: str):
    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.create_collection(name=name, embedding_function=GeminiEmbeddingFunction())
    for i, d in enumerate(documents):
        db.add(documents=[d], ids=[str(i)])
    return db, name

# Specify the path and collection name for Chroma database
db_name = "rag_experiment"
db_path = os.path.join(os.getcwd(), db_folder)
db, db_name = create_chroma_db(chunked_text, db_path, db_name)

# Load an existing Chroma collection
def load_chroma_collection(path: str, name: str):
    chroma_client = chromadb.PersistentClient(path=path)
    return chroma_client.get_collection(name=name, embedding_function=GeminiEmbeddingFunction())

db = load_chroma_collection(db_path, db_name)

# Retrieve the most relevant passages based on the query
def get_relevant_passage(query: str, db, n_results: int):
    results = db.query(query_texts=[query], n_results=n_results)
    return [doc[0] for doc in results['documents']]

query = "What is the AI Maturity Scale?"
relevant_text = get_relevant_passage(query, db, n_results=1)

# Construct a prompt for the generation model based on the query and retrieved data
def make_rag_prompt(query: str, relevant_passage: str):
    escaped_passage = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
    prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below.
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and
strike a friendly and conversational tone.
QUESTION: '{query}'
PASSAGE: '{escaped_passage}'

ANSWER:
"""
    return prompt

# Generate an answer using the Gemini Pro API
def generate_answer(prompt: str):
    gemini_api_key = os.getenv("GEMINI_API_KEY")
    if not gemini_api_key:
        raise ValueError("Gemini API Key not provided. Please provide GEMINI_API_KEY as an environment variable")
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel('gemini-pro')
    result = model.generate_content(prompt)
    return result.text

# Construct the prompt and generate the answer
final_prompt = make_rag_prompt(query, "".join(relevant_text))
answer = generate_answer(final_prompt)
print(answer)

# Interactive function to process user input and generate an answer
def process_query_and_generate_answer():
    query = input("Please enter your query: ")
    if not query:
        print("No query provided.")
        return
    db = load_chroma_collection(db_path, db_name)
    relevant_text = get_relevant_passage(query, db, n_results=1)
    if not relevant_text:
        print("No relevant information found for the given query.")
        return
    final_prompt = make_rag_prompt(query, "".join(relevant_text))
    answer = generate_answer(final_prompt)
    print("Generated Answer:", answer)

# Invoke the function to interact with user
process_query_and_generate_answer()

Conclusion

The integration of Google’s Gemini with ChromaDB to create a Retrieval Augmented Generation (RAG) system demonstrates its effectiveness in delivering precise and contextually relevant answers from extensive documents. This successful application underscores the potential of advanced LLMs and vector databases in transforming information retrieval and interaction across various sectors.