Improving Retrieval Augmented Generation (RAG) Performance through Hybrid Search and Reranking Techniques: Mistral-7B in Colab

7 min readApr 3, 2024

Recently, there has been a surge of interest in Retrieval Augmented Generation (RAG) because it’s been shown to boost the capabilities of Language Models (LLMs) in specialized fields. Essentially, RAG supplements LLMs with extra knowledge from external sources, opening up exciting possibilities for more accurate and dependable outcomes in domain-specific tasks. Whether driven by cost-efficiency or privacy concerns, more individuals and companies are opting for open-source LLMs like Mistral and Llama-2.

In this article, I’ll walk you through building an advanced RAG pipeline using Mistral-7b in Colab environment, offering a step-by-step guide to implementation.

Install packages

!pip install -q langchain pypdf langchain-community \
sentence_transformers chromadb transformers bitsandbytes accelerate
!pip install rank_bm25
!pip install --upgrade --quiet  cohere

Document Retrieval

Here we build the RAG pipeline with LangChain and Chroma as vector database.

Semantic Search

Semantic search is an approach that retrieve the relevant information that can comprehend the intent and contextual meaning of queries more than keywords matching.

Taking Andrew Ng’s renowned machine learning course (first 3 lectures) as a case study. Let’s ask a simple question: “What are major topics for this class?” and evaluate the performance of our RAG pipeline.

Our retrieval process involves three key steps:

Segmenting the documents into smaller chunks.
Embedding these chunks and preparing them for retrieval.
Storing the chunks in a vector database and employing a cosine similarity approach to retrieve the top k most relevant chunks, while using Maximum Marginal Relevance (MMR) to mitigate redundancy.

import os
import json
from os import walk
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from google.colab import userdata

def docs_retriever(question, k=5):
    """
    Load original documents and retrieve relevant documents
    """

    # load documents
    file_path = "./data/"
    f = []
    for (dirpath, dirnames, filenames) in walk(file_path):
        f.extend(filenames)
        break
    loaders = []
    for filename in f:
        loaders.append(PyPDFLoader(file_path + filename))
    docs = []
    for loader in loaders:
        docs.extend(loader.load())
   
    # semantic search
    # split loaded docs to chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000,
        chunk_overlap = 100
    )
    splits = text_splitter.split_documents(docs)

    # embedding chunks
    # embedding models can change to other models, 
    # e.g., 'sentence-transformers/all-mpnet-base-v2'
    embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

    # create vector database for retrieval
    vectordb = Chroma.from_documents(
                    documents=splits,
                    embedding=embeddings)
    # search for top k and search type as maximum marginal relevance 
    sm_retriever = vectordb.as_retriever(search_kwargs={"k": k},
                                         search_type="mmr")

  # # keyword search
  # keyword_retriever = BM25Retriever.from_documents(docs, 
  #                       search_kwargs={"k": k})
  
  # # ensemble semantic search and keyword search
  # ensemble_retriever = EnsembleRetriever(retrievers=[keyword_retriever, 
  #                         sm_retriever], weights=[0.5, 0.5])

  # # cohere rerank
  # compressor = CohereRerank(cohere_api_key=userdata.get('cohere_api_key'), 
  #                           top_n=10)
  # compression_retriever = ContextualCompressionRetriever(
  #     base_compressor=compressor, base_retriever=ensemble_retriever
  # )

    return sm_retriever
   # return comression_retriever

Generate Response

Upon retrieving the documents, we add a prompt instructing the model to act as an efficient question-answer assistant. Then, we merge this prompt with the relevant documents into a quantized LLM (Mistral-7b). This helps us generate answers to the question.

Load the quantized Mistral-7b model

from transformers import BitsAndBytesConfig, AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig
import transformers
from langchain import HuggingFacePipeline

def load_llm():
    """
    Load Mistral-7B and quantize the model
    """

    # set model name
    MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

    # set quantize parameters, quantize model in 4-bit NormalFloat (NF4) format
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
    )

    # tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, torch_dtype=torch.float16,
        trust_remote_code=True,
        device_map="auto",
        quantization_config=quantization_config
    )

    # set hte generation_config. 
    # For better answer generation, set the context_length to 4096 (mistral maximum is 8192)
    generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
    generation_config.max_new_tokens = 4096
    generation_config.context_length = 4096
    generation_config.temperature = 0.0001
    generation_config.top_p = 0.95
    generation_config.do_sample = True
    generation_config.repetition_penalty = 1.15

    # build the pipeline
    pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        return_full_text=True,
        generation_config=generation_config,
    )

    llm = HuggingFacePipeline(
    pipeline=pipeline,
    )
    return llm

Generate Answer

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

def get_result(question, k=5, top_n=5):
    """
    Get answer and relevant documents
    """

    # load documents retriever
    retriever = docs_retriever(question, k=k, top_n=top_n)

    # Build prompt
    template = """
    [INST] <>
    Act as a helpful question answer assistant. Use the following information to answer the question at the end. 
    <>

    {context}

    {question} [/INST]
    """
    QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

    # load llm
    llm = load_llm()

    # build question answer chain
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=retriever,
        chain_type="stuff",
        return_source_documents=True,
        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
    )

    # get answer and relevant documents
    result = qa_chain(question)
    answer = result["result"].strip()
    rel_docs = result["source_documents"]

    return answer, rel_docs

# ask questions
ans, rel_docs = get_result(question, k=10)

What is the answer for the question:

This class covers various topics related to Machine Learning. Some of the major topics include:
    1. Linear Regression
    2. Gradient Descent
    3. Normal Equations
    4. Convex Optimization
    5. Hidden Markov Models
    6. Extensions to the materials covered in the main lectures
    7. Projects on various applications of Machine Learning such as musical instrument detection, irony sequence alignment, understanding the brain using neuroscience, market making, financial trading, etc.

Regarding the student's question, "Are these group projects?" - Yes, the instructor encourages forming study groups and working together on projects. However, the size of the group is not specified.

Looks good, but not get to the point. Upon thorough review of lecture 1, the answer is indeed present within the lecture, however, it is scattered across several pages. The summarized answer is provided below:

 1. Supervised Learning
 2. Unsupervised Learning
 3. Reinforcement Learning
 4. Learning Theory

After closely examining the relevant documents (rel_docs), it appears that the necessary context was not retrieved. Let’s revisit the retrieval process and work on enhancing it.

Enhance Answer with Relevant Documents Reranking

We re-rank the relevant documents with Cohere’s re-rank method which gives a relevance score to each of the chunks and then sort the documents based on these scores in descending order.

# Uncomment the rerank code below in the semantic search section
# cohere rerank
compressor = CohereRerank(cohere_api_key=userdata.get('cohere_api_key'), 
                        top_n=5)
compression_retriever = ContextualCompressionRetriever(
  base_compressor=compressor, base_retriever=sm_retriever
 )

return comression_retriever

question = “What are major topics for this class?”
ans2, rel_docs2 = get_result(question, k=20, top_n=10)


    1. Supervised Learning: This includes understanding various types of supervised learning algorithms such as regression, decision trees, and neural networks.
    2. Learning Theory: This topic covers the theoretical foundations of machine learning, including concepts like bias-variance tradeoff, overfitting, and underfitting.
    3. Unsupervised Learning: This involves exploring techniques for finding patterns and relationships in data without labeled examples, such as clustering and dimensionality reduction.
    4. Deep Learning: This advanced topic focuses on deep neural networks, which are a subset of artificial neural networks with multiple hidden layers.
    5. Reinforcement Learning: This topic deals with training agents to perform tasks based on rewards and punishments.
    6. Support Vector Machines (SVM): SVM is a popular supervised learning algorithm used for classification and regression analysis.
    7. Convex Optimization: This topic covers optimization techniques used in machine learning, particularly in solving large-scale optimization problems.
    8. Hidden Markov Models: These are statistical models used for modeling sequential data, where the underlying state transitions are unknown.

These topics cover both the fundamental concepts and advanced techniques in machine learning.

Much better this time, yet the top-ranking documents fail to capture the utmost relevance, and reinforcement learning hasn’t been addressed. Let’s continue refining.

Enhance Answer with Hybrid Search

Let’s ensemble semantic search and classical keyword search together, re-rank the relevant documents.

# Uncomment the code below in the semantic search section
# keyword search
keyword_retriever = BM25Retriever.from_documents(docs,
                      search_kwargs={"k": k})
keyword_docs = retriever.get_relevant_documents(question)
# ensemble semantic search and keyword search
ensemble_retriever = EnsembleRetriever(retrievers=[keyword_retriever,
                      sm_retriever], weights=[0.5, 0.5])
# cohere rerank
compressor = CohereRerank(cohere_api_key=userdata.get('cohere_api_key'), 
                        top_n=5)
compression_retriever = ContextualCompressionRetriever(
  base_compressor=compressor, base_retriever=sm_retriever
 )

return comression_retriever

question = “What are major topics for this class?”
ans3, rel_docs3 = get_result(question, k=10, top_n=10)


 1. Supervised Learning: This topic covers various types of supervised learning algorithms, including linear regression, logistic regression, and support vector machines.
 2. Unsupervised Learning: Techniques for finding patterns and relationships within data without labeled examples, such as clustering and dimensionality reduction.
 3. Reinforcement Learning: Algorithms for making sequences of decisions based on maximizing long-term rewards.
 4. Learning Theory: Understanding the theoretical foundations of machine learning, including concepts like bias-variance tradeoff, overfitting, and underfitting.

Additionally, there will be discussions on extensions to the material covered in the main lectures, such as convex optimization, hidden Markov models, and other advanced topics.

Cool! However, the response contains information not present in the relevant documents. For example, the bias-variance tradeoff isn’t discussed in the retrieved documents.

Enhance Answer with Citations

Let’s revise one more component, add citations to validate the answer. There are various ways to cite documents. Here, we’ll utilize modified prompts to include citations.

template = """
      [INST] 
      Act as an question answer assistant. Use the documents snippets to answer the question at the end. Think step by step. \
      If there's not enough information in the context provided, just say that you don't know, DO NOT try to make up an answer.
      
      Remember, you must return both an answer and citations. A citation consists of a VERBATIM and DETAILED quote. \
      Return a citation for every quote across all documents \
      that JUSTIFY the ANSWER.  Use the following format for your final output:
      
      <cited_answer>
          <answer></answer>
          <citations>
              <citation><quote></quote></citation>
              <citation><quote></quote></citation>
              <citation><quote></quote></citation>
              ...
          </citations>
      </cited_answer>
      
      Here are the documents snippets:{context}
      
      {question} [/INST]
    """

The answer now is:


 <cited_answer>
     <answer>This class covers four major topics: supervised learning, learning theory, unsupervised learning, and reinforcement learning.</answer>
     <citations>
         <citation>"So let’s talk about the first of the four major topics in this class, which is supervised learning."</citation>
         <citation>"The second of the four major topics in this class is learning theory."</citation>
         <citation>"Last of the four major topics I wanna tell you about is reinforcement learning."</citation>
     </citations>
 </cited_answer>

Great, A concise answer and well-supported with citations! One minor modification can be made is to ensure that the third citation directly references unsupervised learning, which is mentioned in the documents. The third citation could be revised as follows:

<citation><quote>So that was unsupervised learning, and then the last of the four major topics I wanna tell you about is reinforcement learning.</quote></citation>

In the following blogs, we will go back to the documents citations topic and explore more techniques.

Summary

In this article, we built a RAG pipeline using LangChain. We employed a hybrid search approach, combining keyword search and semantic search, to improve the retrieval of relevant documents. Subsequently, we utilized a lightweight LLM, specifically quantized Mistral-7b, to generate answers to questions based on the retrieved documents. We further enhanced the answers by incorporating supporting citations.

Improving Retrieval Augmented Generation (RAG) Performance through Hybrid Search and Reranking Techniques: Mistral-7B in Colab

Install packages

Document Retrieval

Generate Response

Summary

Written by Jane Yang