Multistage RAG with LlamaIndex and Cohere Reranking: A Step-by-Step Guide

Michael Ryaboy
KX Systems
Published in
5 min readApr 19, 2024

Retrieval Augmented Generation (RAG) is a powerful technique that allows language models to draw upon relevant information from external knowledge bases when generating responses. However, the effectiveness of RAG heavily relies on the quality of the retrieved results. In this article, we’ll explore an advanced multistage RAG architecture using LlamaIndex for indexing and retrieval and Cohere for semantic reranking. We’ll provide a detailed, step-by-step guide to implementing this architecture, complete with code snippets from the accompanying Colab notebook.

To provide the best possible context for our LLM, we need to surface the most relevant snippets possible. When our document store is large, it’s very difficult to retrieve very relevant documents in just one step. To remedy this, we will retrieve in two stages: first by searching individual sentences to find any docs relevant to our query vector, and then reranking the wider context in which the sentence was found. Luckily, the SentenceWindowParser from LlamaIndex allows us to not only separate our document into sentences, but also add some metadata — in this case three sentences on each side of our target sentence. This will come in handy in our reranking step!

Here’s a full image of our pipeline. Don’t be intimidated! We can achieve this with very little code:

Multistage RAG Pipeline

Step 1: Set Up the Environment

First, let’s install the necessary libraries:

!pip install cohere spacy llama-index kdbai_client llama-index-vector-stores-kdbai llama-index-embeddings-fastembed

Then, import the required modules:

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core import Document, VectorStoreIndex
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.core import SimpleDirectoryReader
from llama_index.core.llama_dataset import LabelledRagDataset
import kdbai_client as kdbai
import cohere

Step 2: Data Preparation

We’ll be using the Paul Graham Essay Dataset as our knowledge corpus. Download the dataset:

!llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

Step 3: KDB.AI Setup

First, sign up for KDB.AI. We’re using KDB.AI here due to its fast insertion speeds and support for metadata filtering. However, if you have only a few thousand documents, you might not need multistage retrieval or even a vector database — Cohere reranking on its own can be a perfectly reasonable solution.

Grab your endpoint and API key from the KDB.AI cloud console:

KDB.AI Cloud Console

Create a KDB.AI session and table to store the embeddings:

session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY)
if KDBAI_TABLE_NAME in session.list():
session.table(KDBAI_TABLE_NAME).drop()
schema = dict(
columns=[
dict(name="document_id", pytype="bytes"),
dict(name="text", pytype="bytes"),
dict(
name="embedding",
vectorIndex=dict(type="flat", metric="L2", dims=384),
),
]
)
table = session.create_table(KDBAI_TABLE_NAME, schema)

Step 4: Parsing Documents into Sentences

The core idea behind our multistage RAG approach is to index and retrieve at the granularity of individual sentences, while providing the language model with a broader sentence window as context for generation.

We use LlamaIndex’s SentenceWindowNodeParser to parse documents into individual sentence nodes, while preserving metadata about the surrounding sentence window.

node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
nodes = node_parser.get_nodes_from_documents(docs)
parsed_nodes = [node.to_dict() for node in nodes]

Here, we use a window_size of 3, meaning for each sentence, we keep the 3 sentences before and 3 sentences after as its "window". This window is stored in the node metadata.

Here is an example pipeline for Sentence Window Retrieval without reranking:

It’s worth noting that sentence window parsing is just one type of small-to-big retrieval. Another approach is to use smaller chunks referring to bigger parent chunks. This strategy isn’t included in this notebook, but here is a diagram of chunking based small-to-big retrieval:

Chunking Based Small-to-Big Retrieval Pipeline

Step 5: Indexing and Storing Embeddings

Next, we generate embeddings for each sentence node using FastEmbed and store them in our KDB.AI table. FastEmbed is a fast and lightweight library for generating embeddings, and supports many popular text models. The default embeddings come from the Flag Embedding model which has 384 dimensions, but many popular embedding models are supported.

parent_ids = []
sentences = []
embeddings = []
embedding_model = TextEmbedding()for sentence, parent_id in sentence_parentId:
parent_ids.append(parent_id)
sentences.append(sentence)
embeddings = list(embedding_model.embed(sentences))records_to_insert_with_embeddings = pd.DataFrame({
"document_id": parent_ids,
"text": sentences,
"embedding": embeddings
})
table = session.table(KDBAI_TABLE_NAME)
table.insert(records_to_insert_with_embeddings)

Step 6: Querying and Reranking

With our knowledge indexed, we can now query it with natural language questions. The retrieval process has two stages:

  1. Initial sentence retrieval
  2. Reranking based on sentence windows

For the initial retrieval, we generate an embedding for the query and use it to retrieve the 1500 most similar sentences from the vector database. 1500 is arbitrary — but it’s good to go big because you don’t want to miss any sentences which might have a relevant window.

query = "How do you decide what to work on?"
embeddings = fastembed.get_text_embedding(query)
search_results = session.table(KDBAI_TABLE_NAME).search([embeddings], n=1500)

Performing this first-pass retrieval at the sentence level ensures we don’t miss any potentially relevant windows.

The second stage is where the magic happens. We take the unique sentence windows from the initial retrieval results and rerank them using Cohere’s powerful reranking model. By considering the entire window, the reranker can better assess the relevance to the query in context.

unique_parent_ids = search_results_df['document_id'].unique()
texts_to_rerank = [parentid_parentTexts[id] for id in unique_parent_ids
if id in parentid_parentTexts]
reranked = co.rerank(
model='rerank-english-v3.0',
query=query,
documents=texts_to_rerank,
top_n=len(texts_to_rerank)
)

After reranking, the top sentence windows provide high-quality, contextually relevant information to be used for generating the final response.

This multistage approach offers several key advantages:

  1. Indexing and initial retrieval on the sentence-level is fast and memory efficient.
  2. The initial sentence retrieval stage is highly scalable and can support very large knowledge bases.
  3. Reranking based on sentence windows allows incorporating broader context without sacrificing the specificity of the initial retrieval.
  4. Using an external reranking model allows leveraging a larger, more powerful model for assessing relevance, while keeping the main generative model lightweight.
  5. Providing sentence windows as context to the generative model strikes a balance between specificity and sufficient context.

Multistage RAG with LlamaIndex and Cohere showcases the power of thoughtful retrieval architectures for knowledge-intensive language tasks. By indexing at a granular sentence level, performing efficient initial retrieval, and reranking with a powerful model, we can provide high-quality, contextually relevant information to generative language models — enabling them to engage in grounded, information-rich conversations without sacrificing specificity or efficiency.

To learn more about optimizing RAG for production and making the most of vector databases, check out the KDB.AI Learning Hub, chocked full of useful resources.

I also encourage you to experiment with this approach on your own datasets and knowledge domains. The full code is available in the accompanying Colab notebook below.

https://colab.research.google.com/drive/1r-4g-r9JphE6qEKX4Vap-DupZ3AWH2nW?usp=sharing

--

--