RAG V: Fusion Retrieval
Fusion Retrieval in Document Search: Combining Semantic and Keyword-Based Search
This article is built on top of NirDiamant’s work in this GitHub repository.
In the evolving landscape of document search, striking a balance between understanding the meaning behind a query and matching exact keywords is key to delivering precise, relevant results. Fusion Retrieval is a cutting-edge approach that achieves this by blending vector-based semantic similarity search with traditional keyword-based BM25 retrieval. By fusing these two powerful methods, the system enhances document retrieval quality, making it more robust and versatile for a wide range of search scenarios.
Overview
Fusion Retrieval merges the strengths of both semantic and keyword-based search methods to improve the relevance of retrieved documents. This system processes documents into manageable chunks, creates vector representations for semantic similarity searches, and builds a BM25 index for exact keyword matching. Through a custom fusion function, the system combines the results from both methods to deliver more accurate and context-aware document retrieval.
Motivation
Traditional document retrieval systems typically use either:
- Semantic Search: Leveraging vector-based methods that understand conceptual relationships and meaning behind words.
- Keyword Search: Relying on BM25, a ranking function that matches exact keywords in the documents to those in the query.
Each method has its advantages, but also limitations. Vector-based methods excel in capturing conceptual similarity but may miss important keyword matches, while BM25 is great for exact keyword matches but struggles with nuanced or context-dependent queries. Fusion Retrieval aims to combine these two approaches, creating a more versatile and accurate retrieval system that handles a broader spectrum of queries.
Key Components
- PDF Processing and Text Chunking:
The system begins by processing documents (such as PDFs) and splitting them into manageable text chunks for easier indexing and retrieval. - Vector Store Creation:
OpenAI embeddings are used to create vector representations of the text chunks, which are then stored in a FAISS vector store for efficient semantic similarity search. - BM25 Index Creation:
A BM25 index is built from the same text chunks to support keyword-based retrieval alongside the vector-based method. - Fusion Retrieval Function:
This function combines the results of both vector-based and BM25 searches, normalizing and weighting their scores to rank documents effectively.
Method Details
1. Document Preprocessing
The system first processes documents by splitting them into manageable text chunks. This step is handled by a RecursiveCharacterTextSplitter, which breaks the document into smaller, structured segments. A text-cleaning process is applied to address formatting issues, such as replacing unnecessary characters like ‘t’ with spaces.
2. Vector Store Creation
Using OpenAI embeddings, the system generates vector representations for each text chunk. These embeddings capture the semantic meaning of the text, allowing the system to perform similarity searches based on conceptual understanding rather than exact keyword matches. The vectors are stored in a FAISS (Facebook AI Similarity Search) vector store for fast, scalable similarity search operations.
3. BM25 Index Creation
Simultaneously, a BM25 index is created from the same text chunks. BM25 ranks documents based on how well the keywords in the query match the keywords in the document. This allows for precise keyword-based retrieval, providing a complement to the semantic vector search.
4. Fusion Retrieval Function
The core of the Fusion Retrieval system is its custom fusion_retrieval function, which combines the strengths of both methods:
- Dual Retrieval: The function takes a query and performs both vector-based (semantic) and BM25-based (keyword) retrieval.
- Score Normalization: Scores from both methods are normalized onto a common scale to ensure fair comparison.
- Weighted Scoring: A weighted combination of the two scores is computed using an alpha parameter, which controls the balance between semantic similarity and keyword relevance.
- Document Ranking: Based on the combined scores, the documents are ranked, and the top-k results are returned to the user.
This approach allows the system to adaptively balance between conceptual and exact match search depending on the nature of the query and the specific use case.
Benefits of This Approach
- Improved Retrieval Quality:
By combining both semantic and keyword-based search methods, Fusion Retrieval captures the broader meaning of queries while still honoring exact keyword matches, leading to more relevant search results. - Flexibility:
The system can adjust the balance between vector and keyword search using the alpha parameter, making it adaptable for different types of queries or application needs. This flexibility is particularly useful in scenarios where either conceptual relevance or exact keyword matching is more important. - Robustness:
Fusion Retrieval can handle a wide variety of queries, effectively addressing the limitations of individual methods. Whether the query requires understanding context and meaning or matching exact terms, this combined approach delivers reliable results. - Customizability:
The system can be easily modified to incorporate different vector stores, embedding models, or keyword-based retrieval methods, allowing it to evolve and improve as new technologies and techniques emerge.
Implementation
The following demonstrates the a step-by-step implementation of the algorithm (as illustrated in the architecture above). The source GitHub repository shows the complete work with the required requirements.txt file as well as the helper functions, evaluation pipeline, and the data used. Note that this implementation was developed using Python 3.11.0.
- Import Necessary Libraries an Set up Environment Variables
import os
import sys
from dotenv import load_dotenv
from langchain.docstore.document import Document
from typing import List
from rank_bm25 import BM25Okapi
import numpy as np
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks
from helper_functions import *
from evaluation.evaluate_rag import *
# Load environment variables from a .env file
load_dotenv()
# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
- Encode the pdf to Vector Store and Return Split Document from the Step before Creating BM25 Instance
"""
Encode the pdf to vector store and return split document from the step before to create BM25 instance
"""
def encode_pdf_and_get_split_documents(path, chunk_size=1000, chunk_overlap=200):
"""
Encodes a PDF book into a vector store using OpenAI embeddings.
Args:
path: The path to the PDF file.
chunk_size: The desired size of each text chunk.
chunk_overlap: The amount of overlap between consecutive chunks.
Returns:
A FAISS vector store containing the encoded book content.
"""
# Load PDF documents
loader = PyPDFLoader(path)
documents = loader.load()
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
)
texts = text_splitter.split_documents(documents)
cleaned_texts = replace_t_with_space(texts)
# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(cleaned_texts, embeddings)
return vectorstore, cleaned_texts
- Load Data and Create Vector Store and get the Chunked Documents
"""
Load data and create vectorstore and get the chunked documents
"""
path = "data/Understanding_Climate_Change.pdf"
vectorstore, cleaned_texts = encode_pdf_and_get_split_documents(path)
- Create a Best Matching 25 (BM25) Index for Retrieving Documents by Keywords
"""
Create a bm25 index for retrieving documents by keywords
"""
def create_bm25_index(documents: List[Document]) -> BM25Okapi:
"""
Create a BM25 index from the given documents.
BM25 (Best Matching 25) is a ranking function used in information retrieval.
It's based on the probabilistic retrieval framework and is an improvement over TF-IDF.
Args:
documents (List[Document]): List of documents to index.
Returns:
BM25Okapi: An index that can be used for BM25 scoring.
"""
# Tokenize each document by splitting on whitespace
# This is a simple approach and could be improved with more sophisticated tokenization
tokenized_docs = [doc.page_content.split() for doc in documents]
return BM25Okapi(tokenized_docs)
bm25 = create_bm25_index(cleaned_texts) # Create BM25 index from the cleaned texts (chunks)
- Define a Function that Retrieves both Semantically and by Keyword, Normalizes the Scores and Gets the Top k Documents
"""
Define a function that retrieves both semantically and by keyword, normalizes the scores and gets the top k documents
"""
def fusion_retrieval(vectorstore, bm25, query: str, k: int = 5, alpha: float = 0.5) -> List[Document]:
"""
Perform fusion retrieval combining keyword-based (BM25) and vector-based search.
Args:
vectorstore (VectorStore): The vectorstore containing the documents.
bm25 (BM25Okapi): Pre-computed BM25 index.
query (str): The query string.
k (int): The number of documents to retrieve.
alpha (float): The weight for vector search scores (1-alpha will be the weight for BM25 scores).
Returns:
List[Document]: The top k documents based on the combined scores.
"""
# Step 1: Get all documents from the vectorstore
all_docs = vectorstore.similarity_search("", k=vectorstore.index.ntotal)
# Step 2: Perform BM25 search
bm25_scores = bm25.get_scores(query.split())
# Step 3: Perform vector search
vector_results = vectorstore.similarity_search_with_score(query, k=len(all_docs))
# Step 4: Normalize scores
vector_scores = np.array([score for _, score in vector_results])
vector_scores = 1 - (vector_scores - np.min(vector_scores)) / (np.max(vector_scores) - np.min(vector_scores))
bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))
# Step 5: Combine scores
combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores
# Step 6: Rank documents
sorted_indices = np.argsort(combined_scores)[::-1]
# Step 7: Return top k documents
return [all_docs[i] for i in sorted_indices[:k]]
- Test the Algorithm — Add a Usage Example
"""
Test the Algorithm - Add a Usage Example
"""
# Query
query = "What are the impacts of climate change on the environment?"
# Perform fusion retrieval
top_docs = fusion_retrieval(vectorstore, bm25, query, k=5, alpha=0.5)
docs_content = [doc.page_content for doc in top_docs]
show_context(docs_content)
Conclusion
Fusion Retrieval is a powerful solution for document search, blending the strengths of semantic understanding and keyword matching to create a more flexible, accurate, and adaptable system. By leveraging both vector-based and BM25 retrieval methods, it offers a more comprehensive and nuanced approach to information retrieval. Whether applied in academic research, legal document search, or general-purpose search engines, this method provides superior search quality and adaptability.
The fusion of these two retrieval techniques represents a significant advancement in document search systems, paving the way for more effective and intelligent query handling in a wide range of fields.
References and Resources