The aRt of RAG Part 2: Hybrid Retrieval with Atlas

Ross Ashman (PhD)
7 min readJan 7, 2024

--

Photo by Pierre Bamin on Unsplash

Introduction

In part one we looked at using MongoDB Atlas, Textract and LlamaIndex for building the “R”treival part of a RAG system. This time we are combining the both vector search and the built in keyword search fuctionality of MongoDB Atlas.

Hybrid retrieval

Hybrid search is a search methodology that integrates multiple search algorithms to enhance the accuracy of search outcomes. While the specific algorithms involved may not be explicitly defined, hybrid search typically involves combining traditional keyword-based search with contemporary vector search.

In the past, search engines predominantly relied on keyword-based search as the primary option. However, the development of word and sentence embedding algorithms introduced vector embeddings, giving rise to a novel search approach known as vector or semantic search. This method enables semantic searching across data, presenting a contrast to the traditional keyword-based approach. Both search techniques come with inherent tradeoffs:

Keyword-based search excels in precise keyword matching, making it beneficial for specific terms like product names or industry jargon. Nevertheless, the use of specific words makes it is susceptible to typos and synonyms. Specific words can produce results that are to broard (ie specific words that can appear anywhere) or too specific (miss passages with similar meaning but different words)

Vector or semantic search, on the other hand, leverages semantic meaning for multi-lingual and multi-modal search capabilities, proving resilient to typos. However, it may overlook crucial keywords, relying heavily on the quality of generated vector embeddings and being sensitive to out-of-domain terms.

By amalgamating keyword-based and vector-based searches into a hybrid search, one can capitalize on the strengths of both techniques, thereby enhancing the relevance of search results, particularly in text-search scenarios.

Atlas keyword search (bm25 index)

The Atlas search engine for MongoDB is built upon Lucene, the same engine used for many keyword search applications such as Elasticsearch, Solar, and Opensearch. The now popular BM25 method has become the default scoring formula in Lucene and is the scoring formula used by Atlas Search. BM25 stands for “Best Match 25” (the 25th iteration of this scoring algorithm).

In order to build a hybrid system we will need to build a keyword index. Creating keyword search indexes in Atlas is very simple. In part 1 of “aRt of RAG” we created a database called medium01 and a collection called test_collection. Using this collection, we can easily create a keyword index.

Login to Mongo and select “Database”, which you can see top left. Then select “Collections” to find our “text_collection”. You can see to the right of “Collections” a tab called “Atlas Search”.

After selecting the “Atlas Search” tab you will see a button called “Create Search Index”. After pushing the button you will see the following page:

Select “JSON Editor”, then add the following json:

{
"mappings": {
"dynamic": true,
"fields": {
"text": {
"type": "string"
}
}
}
}

Save with the name “test_collection_keyword_index”, then Atlas will start building your keyword index for the collection “text_collection”.

Atlas vector search (vector index)

In part one we built a vector index and search engine utilising the in built capabilities of Atlas. To see how this was done you can go here. This was a very generic implementation so you can see what goes on under the hood in regard to creating your collection, generating the embeddings and performing the search.

Reciprocal Rank Fusion

Reciprocal Rank Fusion is a method used in information retrieval and search engine result merging to combine ranked lists of items from multiple sources. The goal is to improve the overall ranking by considering the search scores from multiple, previously ranked results to produce a unified result set.

Here’s a breakdown of the key components:

  1. Ranked Lists: In information retrieval, various algorithms or sources may generate ranked lists of items (such as documents, search results, recommendations, etc.) based on their perceived relevance to a given query or user.
  2. Reciprocal Rank: Reciprocal rank is a metric that assesses the effectiveness of a ranked list by considering the position of the first relevant item. The reciprocal rank is calculated as the reciprocal of the rank of the first relevant item. In other words, if the first relevant item is at position k, the reciprocal rank is 1/k.
  3. Fusion: Reciprocal Rank Fusion involves merging or combining multiple ranked lists by giving higher priority to items that have a higher reciprocal rank across the individual lists. The idea is to boost the relevance of items that appear earlier in the rankings of more reliable or accurate sources.

The fusion process typically involves assigning weights to different sources based on their historical performance or reliability. These weights influence how much influence each source has on the final merged ranking.

Reciprocal Rank Fusion is often used in collaborative filtering, meta-search engines, and recommendation systems where information from multiple sources needs to be integrated to provide a more accurate and comprehensive ranking of items. By leveraging the reciprocal rank, we can prioritise items that are consistently ranked higher across different sources, improving the overall quality of the merged list.

Amalgamating keyword and vector search

We are going to use the RRF algorithm to combine the keyword and vector search results to produce our final Hybrid search result. To do this we use two function calls. The first we will call weighted_reciprical_rank. As input it takes a list of lists and returns a sorted list.

def weighted_reciprocal_rank(doc_lists):
"""
This is a modified version of the fuction in the langchain repo
https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/retrievers/ensemble.py

Perform weighted Reciprocal Rank Fusion on multiple rank lists.
You can find more details about RRF here:
https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

Args:
doc_lists: A list of rank lists, where each rank list contains unique items.

Returns:
list: The final aggregated list of items sorted by their weighted RRF
scores in descending order.
"""
c=60 #c comes from the paper
weights=[1]*len(doc_lists) #you can apply weights if you like, here they are all the same, ie 1

if len(doc_lists) != len(weights):
raise ValueError(
"Number of rank lists must be equal to the number of weights."
)

# Create a union of all unique documents in the input doc_lists
all_documents = set()
for doc_list in doc_lists:
for doc in doc_list:
all_documents.add(doc["text"])

# Initialize the RRF score dictionary for each document
rrf_score_dic = {doc: 0.0 for doc in all_documents}

# Calculate RRF scores for each document
for doc_list, weight in zip(doc_lists, weights):
for rank, doc in enumerate(doc_list, start=1):
rrf_score = weight * (1 / (rank + c))
rrf_score_dic[doc["text"]] += rrf_score

# Sort documents by their RRF scores in descending order
sorted_documents = sorted(
rrf_score_dic.keys(), key=lambda x: rrf_score_dic[x], reverse=True
)

# Map the sorted page_content back to the original document objects
page_content_to_doc_map = {
doc["text"]: doc for doc_list in doc_lists for doc in doc_list
}
sorted_docs = [
page_content_to_doc_map[page_content] for page_content in sorted_documents
]

return sorted_docs

The second function is atlas_hybrid_search, which takes our query and a few other parameters, which then calls weighted_reciprical_rank, to finally produce our hybrid search result.

def mongo_connect(uri):
"""
Args:
uri

Returns:
client
"""

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

# Send a ping to confirm a successful connection
try:
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
client.admin.command('ping')
print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
print(e)

return client

def generate_embedding(text):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
embedding = model.encode(text)
return embedding.tolist()

def atlas_hybrid_search(query, top_k, db_name, collection_name, vector_index_name, keyword_index_name):
"""
Retrieve the results of the retrievers and use rank_fusion_func to get
the final result.

Args:
query: The query to search for.

Returns:
A list of reranked documents.
"""

# vector search
query_vector = generate_embedding(query)

vector_results = mycollection.aggregate([
{
"$vectorSearch":
{
"queryVector": query_vector,
"path": "embedding",
"numCandidates":10,
"limit": top_k,
"index": vector_index_name
},
},
{
"$project":
{
"_id": 1,
"page":1,
"text":1,
"score":{"$meta":"vectorSearchScore"}
}
}
])
x= list(vector_results)

#keyword search
keyword_results = mycollection.aggregate([{
"$search": {
"index": keyword_index_name,
"text": {
"query": query,
"path": "text"
}
}
},
{ "$addFields" : { "score": { "$meta": "searchScore" } } },
{ "$limit": top_k }
])
y= list(keyword_results)

doc_lists = [x,y]
# Enforce that retrieved docs are the same form for each list in retriever_docs
for i in range(len(doc_lists)):
doc_lists[i] = [
{"_id":str(doc["_id"]), "text":doc["text"], "score": doc["score"]}
for doc in doc_lists[i]]


# apply rank fusion
fused_documents = weighted_reciprocal_rank(doc_lists)

return fused_documents

Example usage

Now we have everything setup, its time for us to put it into action. To do so we create the parameters needed, and pass then to our shiny new function atlas_hybrid_search.

from pymongo import MongoClient

uri = "mongodb+srv://blah_blah_blah" #your connection credentials
client = mongo_connect(uri)
database_names = client.list_database_names() #just to test you are connected


db_name='medium01'
collection_name='test_collection'
vector_index_name = 'test_index'
keyword_index_name = 'test_collection_keyword_index'
db = client.get_database(db_name)
mycollection = db.get_collection(collection_name)

query = "Our Monte Carlo model predicts that protons are easily accelerated beyond the knee \
in the cosmic ray gy density as the plasma expands downstream from the spectrum; the high magnetic fields"
top_k = 4


result = atlas_hybrid_search(query, top_k, db_name, collection_name, vector_index_name, keyword_index_name, embed_model)

Happy Hybrid searching with Atlas

--

--

Ross Ashman (PhD)

Data Scientist Lead, AI/ML/DL, Unstructured data specialist