Enhancing Retrieval Augmented Generation with — ReRanker and UMAP visualization: Llama_Index and Llama 2

10 min readMar 21, 2024

Hello Folks !

Continuing the journey with RAG — Retrieval Augmented Generation, I explore UMAP (for visualization) and reranking techniques. Basically, I delve into one of the ways of reranking — using LLM itself to re-rank the documents and provide more relevant chunks than just vector search. As always, I will also address the challenges I face. I will be using llama-index and Llama 2-chat 7B.

This is not an original work but more of understanding the implementation of current available techniques and see for oneself the pros and cons of the same. some of the references/sources —
https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/3/pitfalls-of-retrieval---when-simple-vector-search-fails
https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/LLMReranker-Gatsby.html

Related posts that I have written —

Exploring RAG Implementation with Metadata Filters — llama_Index
Langchain agents and function calling using Llama 2 locally
PandasDataFrame and LLMs: Navigating the Data Exploration Odyssey
Advance RAG — Query Augmentation using Llama 2 and LlamaIndex
Starting to Learn Agentic RAG

Topics —

using UMAP to visualize the embeddings
Re-ranking — LLMRerank
a. creating prompt for re-ranking
b. parsing the output of re-ranker

Broadly this is what I will be covering.

UMAP
As per google search — Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. There is a Python package that helps convert vector indexes into UMAP embeddings, which can then be used for plotting the data. It functions like a MinMax transform: we first fit a transform on all the available nodes/chunks, and then use the fitted transform on the query and/or relevant chunks to visualize how they relate to actual nodes. You can find complete installation instructions in the code file shared (link at the end). Remember, UMAP is just one way to visualize our embedding space. Embedding vectors can be 300–700 size vectors, and we are condensing them into 2D space, so we need to be careful in interpreting the plots.

# Read document(s)
loader = PDFReader()
docs0 = loader.load_data(file=Path(r"all_post_in_one.pdf"))

doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]

# Create nodes from the loaded documents
node_parser = SimpleNodeParser.from_defaults(chunk_size=600, chunk_overlap=100)
base_nodes = node_parser.get_nodes_from_documents(docs)

# setting context and selecting wembedding model
embed_model = resolve_embed_model("local:BAAI/bge-small-en")
service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

# Create embeddings for each node and storing it in embeddings
base_index = VectorStoreIndex(base_nodes, service_context=service_context)
base_retriever = base_index.as_retriever(similarity_top_k=5)
base_node_text = [base_nodes[i].text for i in range(len(base_nodes))]

embeddings = embed_model.get_text_embedding_batch(base_node_text)

# FInally fit an UMAP transform
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)

# Define a funtion to create embeddings for any data chunk
def project_embeddings(embeddings, umap_transform):
    umap_embeddings = np.empty((len(embeddings),2))
    for i, embedding in enumerate(tqdm(embeddings)): 
        umap_embeddings[i] = umap_transform.transform([embedding])
    return umap_embeddings

# Get embeddings for all the chunks
projected_dataset_embeddings = project_embeddings(embeddings, umap_transform)

# Note - you can save the embeddings as a file so that next time you can directly load,
# instead of creating embeddings everytime.

# Plot the embeddings
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.gca().set_aspect('equal', 'datalim')

PLOTTING the Query and Retrieved Chunks —

# Sample Query and its embedding
query = 'tell me about hobbies of author'
query_embedding = embed_model.get_text_embedding_batch([query])
projected_original_query_embedding = project_embeddings(query_embedding, umap_transform)

# Use retriever and get embeddings for each retrieved chunks
retrievals = base_retriever.retrieve(query)
retrieved_text = [retrievals[i].text for i in range(len(retrievals))]

retrieve_embedding = embed_model.get_text_embedding_batch(retrieved_text)
projected_retrieved_embeddings = project_embeddings(retrieve_embedding, umap_transfor


# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{query}')

# Grey are base nodes
# Red is the query
# Green are the retrieved chunks

ReRanker

I am uncertain about how or where to begin. Below is the syntax of LLMRerank that we will be utilizing. When we use the regular retriever, we typically retrieve the top few values — say, the top 5 or top 7 chunks. However, with the reranker, we first employ the usual retriever, but now we retrieve a larger number of chunks, perhaps 20 or 25. Subsequently, we apply LLMRerank to these chunks to obtain a relevance score. This scoring is conducted by an LLM, and the resultant relevance score is used to select the top ranks.

In essence, we are engaging in a two-step retrieval process. Initially, we utilize a vector-based method and retrieve a large number of chunks. Then, we employ an LLM to identify relevant chunks. Consequently, we incur additional LLM calls, necessitating caution regarding the time and cost implications. The LLM will have a standard context limit, which is where the first parameter, choice_batch_size, comes into play — this parameter determines how many documents will be processed together to obtain the relevance score. The parameter top_n represents the final number of chunks we desire from the initial set of, for example, 20 or 25.

from llama_index.postprocessor import LLMRerank

A = LLMRerank(
            choice_batch_size=5,
            top_n=5,
            parse_choice_select_answer_fn=custom_parse_choice_select_answer_fn)

First, we need relevant scores in particular format and so we will prompt LLM in a particular way. There is a default prompt available but I had to modify it. I assume default prompt will work fine with openAI, but I had to modify it for local implementation. The default prompt can be modified as below —

# change the default prompt of LLMReranker

A.choice_select_prompt.template = '''[INST]<<SYS>>
You are helpful assistant who rates the relevancy of given context based on query asked.
GIve relevance score from 1 to 10 with 10 being very relevant. No need to give answer to question.
If particular document answers the query than it has higher relevance otherwise it will have lower relevance.
Just tell me the relevance score.. ORDER the documents from most relevant to least relevant.<</SYS>>
Sanmple question and answer format - 

Doc 1:<context 1>
Doc 2:<context 24>
Question: <question>
Answer:
Doc: 2, Relevance: 9/10
Doc: 3, Relevance: 7/10
Doc: 4, Relevance: 4/10
Doc: 1, Relevance: 3/10

STRICTLY follow the above 
format and order the documents in descending order of relevance.
ie answer as - 
Document: <number>, Relevance: <score>

Rememebr to answer in the above format only - Document number and relevance score. Nothing else.

Output only Document number and Relevance score as shown above.

--------------------------------

Let's try this now: 

{context_str}

Below is the question. Give relevance score for each of the document based on the question.

Question: {query_str}
Answer:
[/INST]'''

Also, the output of reranker calls need to be in particular format and for me even that was not correct and so I kept getting error. I used regex for now to capture most of the unique cases I was getting. We need to extract relevant scores with their corresponding document number. Below is the function I use to overwrite the default output parser.

def custom_parse_choice_select_answer_fn(answer: str, num_choices: int, raise_error=False):
    # Split the answer into lines
    answer_lines = answer.split("\n")
    answer_nums = []
    answer_relevances = []       

    doc_pattern = r'Document (\d+)'
    relevance_pattern = r'(\d+)/10'

    document_numbers = []
    relevance_scores = []

    for line in answer_lines:
        doc_match = re.search(doc_pattern, line)
        if doc_match:
            document_number = int(doc_match.group(1))
            document_numbers.append(document_number)

            relevance_match = re.search(relevance_pattern, line)
            if relevance_match:
                relevance_score = int(relevance_match.group(1))
                relevance_scores.append(relevance_score)
            else:
                relevance_scores.append(0)
#         print('\n--------------------')
#         print(line)

    print('\n--------------------')

    answer_nums = document_numbers
    answer_relevances = relevance_scores
    
    # Sort answer_relevances in descending order and get the corresponding sorted indices
    sorted_indices = sorted(range(len(answer_relevances)), key=lambda i: answer_relevances[i], reverse=True)

    # Sort answer_nums based on the sorted indices of answer_relevances
    answer_nums = [answer_nums[i] for i in sorted_indices]

    # Sort answer_relevances in descending order
    answer_relevances = sorted(answer_relevances, reverse=True)
        
    print(answer)
    print(answer_nums)
    print(answer_relevances)

    return answer_nums, answer_relevances

Putting it together — one of the way to do so is as below —

base_retriever_2 = base_index.as_retriever(similarity_top_k=20)

query_bundle = QueryBundle(query)

retrievals_rerank = base_retriever_2.retrieve(query_bundle)

reranker = A

retrievals_rerank = reranker.postprocess_nodes(retrievals_rerank, query_bundle)

retrieved_text_rerank = [retrievals_rerank[i].text for i in range(len(retrievals_rerank))]
retrieve_embedding_rerank = embed_model.get_text_embedding_batch(retrieved_text_rerank)
projected_retrieved_embeddings_rerank = project_embeddings(retrieve_embedding_rerank, umap_transform)


# Sample Output - 

--------------------
Document Numbers: [1, 2, 3, 4, 5, 3]
Relevance Scores: [9, 7, 6, 4, 8, 0]
  I can certainly help you with that! Based on the information provided in the documents, here are the relevance scores for each document:
Document 1: Relevance score = 9/10
The document mentions the author's experience with running and their goal of completing a full marathon within three hours. This is directly related to the question asked, making it highly relevant.
Document 2: Relevance score = 7/10
The document mentions the author's experience with running and their goal of completing a full marathon within three hours. However, it also includes information about the author's personal life and interests, which is less relevant to the question asked.
Document 3: Relevance score = 6/10
The document mentions the author's experience with running and their goal of completing a full marathon within three hours. However, it also includes information about the author's personal life and interests, which is less relevant to the question asked.
Document 4: Relevance score = 4/10
The document does not mention anything related to the author's running experience or full marathon time, making it less relevant to the question asked.
Document 5: Relevance score = 8/10
The document mentions the author's experience with running and their goal of completing a full marathon within three hours. It also includes information about the author's personal life and interests, which is less relevant to the question asked.
Overall, Documents 1 and 5 are the most relevant to the question asked, followed by Document 3.
[1, 5, 2, 3, 4, 3]
[9, 8, 7, 6, 4, 0]


# NOTE
There is an issue that the document number 3 is getting repeated 
so there is a scorpe for improvement.

# Plot the projected query and retrieved documents in the embedding space
plt.figure(figsize=(12, 6))

# Subplot for the original query and retrieved embeddings
plt.subplot(1, 2, 1)
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.gca().set_aspect('equal', 'datalim')
plt.title('Original Query and Retrieved Embeddings')

# Subplot for the augmented query embeddings
plt.subplot(1, 2, 2)
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings_rerank[:, 0], projected_retrieved_embeddings_rerank[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.gca().set_aspect('equal', 'datalim')
plt.title('Reranked-Original Query and Retrieved Embeddings')

plt.tight_layout()
plt.show()

On the left is the basic retrieval, while on the right is the ReRanker, and you can observe that we obtain a different set of chunks as output. Now, based on LLM accuracy and prompting, the second chunk should be superior. However, what I observed was that the final response was not consistently good with reranking — this may be because my queries had data concentrated in one region. Ideally, we should have a diverse set of documents and queries that require chunks from various parts of the document.

You are welcome to take the code and experiment with different queries and prompts. Additionally, in the notebook, I have included some more methods. I will now demonstrate how to use the ReRanker with our query and then finally obtain a response. Also, I have modified the default prompt of our LLM and passed it to the query engine. Since we are using the same LLM for reranking and final answer generation, I have a separate prompt for the ReRanker, as shown above. Below, I will illustrate how we can pass a modified prompt to the query engine, which will be used to generate the final answer.

# Modify default prompt to suit Llama 2 template

new_summary_tmpl_str = (
    '''[INST]<<SYS>> You are give bunch of context text and asked a query.\
Answer the query based on documents anddo not bring in outside knowledge.\
Be precise and concise and crisp.<</SYS>>\n
Context information from multiple sources is below.\n
---------------------\n
{context_str}\n
---------------------\n
Given the information from multiple sources and not prior knowledge, answer the query.\n
Query: {query_str}\n
Answer: \n
[/INST]'''
)
new_summary_tmpl = PromptTemplate(new_summary_tmpl_str)



query = 'Which all cities has the author travel too? Fetch all the city names mentioned.'


# Non reranker answer and embeddings
query_engine = base_index.as_query_engine(similarity_top_k=5,
                                          response_mode="tree_summarize")

query_engine.update_prompts(
    {"response_synthesizer:summary_template": new_summary_tmpl}
)

response = query_engine.query(query)
    
retrievals = response.source_nodes
retrieved_text = [retrievals[i].text for i in range(len(retrievals))]
retrieve_embedding = embed_model.get_text_embedding_batch(retrieved_text)
projected_retrieved_embeddings = project_embeddings(retrieve_embedding, umap_transform)

# Reranker based response
query_engine = base_index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[A],  # A is our reranker used above
    response_mode="tree_summarize",
)

query_engine.update_prompts(
    {"response_synthesizer:summary_template": new_summary_tmpl}
)

response_reranked = query_engine.query(query)

retrievals_rerank = response_reranked.source_nodes
retrieved_text_rerank = [retrievals_rerank[i].text for i in range(len(retrievals_rerank))]
retrieve_embedding_rerank = embed_model.get_text_embedding_batch(retrieved_text_rerank)
projected_retrieved_embeddings_rerank = project_embeddings(retrieve_embedding_rerank, umap_transform)


# Plotting all together
# Plot the projected query and retrieved documents in the embedding space
plt.figure(figsize=(12, 6))

# Subplot for the original query and retrieved embeddings
plt.subplot(1, 2, 1)
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.gca().set_aspect('equal', 'datalim')
plt.title('Original Query and Retrieved Embeddings')

# Subplot for the augmented query embeddings
plt.subplot(1, 2, 2)
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings_rerank[:, 0], projected_retrieved_embeddings_rerank[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.gca().set_aspect('equal', 'datalim')
plt.title('Reranked-Original Query and Retrieved Embeddings')

plt.tight_layout()
plt.show()

# OUTPUT

tell me about hobbies of author
------------------------
NON RERANKER RESPONSE
  Based on the provided text, the author has traveled to the following cities:

1. Mumbai
2. Pondicherry
3. Chennai
4. Lonavala
5. Mumbai (twice)
6. Bombay (twice)
7. Pune

Note: The author has mentioned these cities in their text, but it is not clear if they have traveled to all of these cities personally or if they are just mentioning them in their text.


------------------------
RERANKER RESPONSE
  Based on the provided text, the author has traveled to the following cities:

1. Bangalore
2. Chennai
3. Mumbai
4. Pondicherry
5. Mahindra World City
6. Zuca (mentioned as a chocolate shop)
7. Promenade
8. Mahabalipura (mentioned as a lunch spot)

These cities are mentioned in the text as places the author has visited or driven through during their trips.

For some query I get better result with reranker and for some the result is worse :D

I wil close now with some remarks —
1. Prompting is key again here
2. We can use UMAP to get some sense of our data and retrieval
3. We need to connect this, with say small to big retrieval, sentence window parser, and so many other techniques.
Watch out this space for more code blocks.
Code — https://github.com/SandyShah/llama_index_experiments

Enhancing Retrieval Augmented Generation with — ReRanker and UMAP visualization: Llama_Index and Llama 2

Written by Sandeep Shah