Building a Naive RAG from scratch with LlamaIndex, OpenAI, and Chroma

5 min readSep 29, 2024

As AI continues to revolutionize various sectors, integrating knowledge retrieval with language models to build powerful applications is becoming increasingly popular. One such approach is Retrieval-Augmented Generation (RAG), which combines information retrieval and generative models to deliver contextually accurate responses.

In this guide, I will walk you through building a simple RAG system from scratch using LlamaIndex, OpenAI, and Chroma. The system will retrieve relevant content from indexed documents and generate answers using an OpenAI model.

What We Will Cover

Architecture of a RAG
Setting up the environment
Extracting URLs from a sitemap
Fetching and processing web content
Vectorizing the content using LlamaIndex
Storing vectors in Chroma
Running a RAG query to generate responses

Let’s dive in!

Architecture of a RAG

The architecture of a Retrieval-Augmented Generation (RAG) combines information retrieval techniques with text generation powered by language models (LLMs). The goal is to provide more accurate and relevant answers by leveraging external data or documents, rather than generating responses solely based on the model’s parameters. Here are the main components of this architecture:

1.Data Source / Knowledge Base

The RAG process starts with a structured or unstructured knowledge base, which can include:

Vector database: Stores vector representations of documents, web pages, articles, etc.
Document sources: Text files, relational databases, PDFs, websites, or other repositories.

2.Indexing and Embedding

The documents or data are transformed into vector representations (embeddings). These vectors are created by a transformer model (like OpenAI, Llama, etc.) that encodes textual information into a usable numeric form. These vectors are then stored in a vector database (such as Chroma) that enables fast and efficient similarity searches.

3.Retrieval of Relevant Documents

When a question is asked to the RAG architecture, it is converted into a vector. This vector is compared to those stored in the vector database. The system retrieves the most relevant documents (often using a cosine similarity measure between vectors) based on the query. This is the retrieval stage.

4.Answer Generation via LLM

After retrieving relevant documents, this information is passed to a text generation model (LLM), such as GPT, to produce an answer.

The model uses the retrieved documents as context to generate a more accurate response, with the possibility of citing the sources used to answer the question. The generated response combines the model’s ability to reason over retrieved data and generate coherent text.

Setting Up the Environment

To begin, you need to install the following libraries:

pip install requests beautifulsoup4 lxml
pip install llama_index
pip install chromadb llama-index-vector-stores-chroma

Content preparation

— Extracting URLs from the Sitemap

We’ll start by parsing a sitemap to extract URLs that will be indexed as part of our knowledge base.

import requests
from bs4 import BeautifulSoup

# Sitemap URL
sitemap_url = 'https://www.eurelis.com/page-sitemap.xml'

# Fetch and parse the sitemap
response = requests.get(sitemap_url)
soup = BeautifulSoup(response.content, 'xml')

# Extract URLs
urls = [url.text for url in soup.find_all('loc')]
print(f"URLs found: {urls}")

— Fetching Content from the URLs

Now that we have the URLs, we’ll fetch the content from each page. The goal is to retrieve the main text from these pages, which we will later vectorize.

def get_page_content(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Extract the main text
    paragraphs = soup.find_all('p')
    content = '\n'.join([p.get_text() for p in paragraphs])
    
    return content

contents = [(url, get_page_content(url)) for url in urls]

Vectorizing Content with LlamaIndex

Once we have the content from the URLs, we’ll convert it into document vectors using LlamaIndex and the OpenAI model for embeddings.

— Loading the Documents

from llama_index.core import Document

# Create a list of Document objects
documents = [Document(text=content[1], metadata={'source':content[0]}) for content in contents]

— Storing Vectors with Chroma

For this example, we’ll use Chroma to store the vectors. You can choose between a persistent or ephemeral setup depending on your use case.

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore

# Initialize Chroma client
chroma_client = chromadb.EphemeralClient()

# Create a collection for storing vectors
chroma_collection = chroma_client.get_or_create_collection("knowledge_base")

# Create the vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)py

— Setting up the Storage Context

from llama_index.core import StorageContext

# Initialize the storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

— Embedding Content

Next, we’ll embed the content using OpenAI’s embedding model.

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Embed sample text
embeddings = embed_model.get_text_embedding("OpenAI’s new embedding models are fantastic!")
print(len(embeddings))

This model converts text into a dense vector representation that will be used for similarity search later.

— Indexing the Documents

We’ll now index the documents using VectorStoreIndex to prepare them for retrieval.

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Create a sentence splitter for chunking text
parser = SentenceSplitter(chunk_size=768, chunk_overlap=56)

# Build the index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, transformations=[parser], show_progress=True)

Querying the RAG System

Now that our content is indexed, we can query it using Retrieval-Augmented Generation. This process retrieves the most relevant vectors for a given query and then generates an answer using a language model.

— Define the Query

query = "What is Drupal?"

— Setup the Retriever and Query Engine

We will set up the retriever and the query engine using OpenAI as the language model.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

retriever = VectorIndexRetriever(index, similarity_top_k=10, filter=None)
llm = OpenAI(model="gpt-4o")
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)

— Customizing the Prompt

You can customize the prompt template used for generating answers:

from llama_index.core import PromptTemplate

new_prompt_template_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context and not prior knowledge, "
    "answer the query in less than 15 words.\n"
    "Query: {query_str}\n"
    "Answer: "
)

new_prompt_template = PromptTemplate(new_prompt_template_str)
query_engine.update_prompts({"response_synthesizer:text_qa_template": new_prompt_template})

— Run the Query

Finally, we run the query and print the response:

response = query_engine.query(query)
print(str(response))

Conclusion

In this tutorial, we’ve built a simple RAG system from scratch using LlamaIndex, OpenAI, and Chroma. This system retrieves relevant knowledge and uses a language model to generate answers based on the retrieved content. The possibilities for extending this are vast, from integrating more complex retrieval mechanisms to deploying the system in production environments.

Thanks for reading, you can follow me on LinkedIn or X. More about Eurelis on LinkedIn or X.