Enhancing RAG with Hybrid Search Using Qdrant and llamaIndex 🚀

Samvardhan V G
5 min readApr 21, 2024

--

In the Retrieval-Augmented Generation (RAG) architecture, three main components — ingestion, retrieval, and synthesis — work together to enhance information processing and response generation:

1. Ingestion 📥: In this first step, documents are collected, broken into smaller parts, turned into embeddings, and saved in a vector database. This helps make sure they can be easily found and used later.

2. Retrieval 🔍: Here, the system retrieves relevant information based on user queries, using techniques like vector searches or hybrid searches to ensure relevance.

3. Synthesis 🧬: The final stage integrates the retrieved data with a language model to generate coherent and contextually enriched responses.

Retrieval 🔍

In the retrieval part, which plays a crucial role, there are two popular methods: keyword search and semantic search.

  • Keyword Search 🔑: This method retrieves chunks that exactly match the specified keywords.
  • Semantic Search 🧠: This method uses vector similarity to retrieve chunks. It employs the k-nearest neighbors algorithm to find the most relevant matches based on the query’s context.

Hybrid Search:

We can also increase accuracy using hybrid search, which combines both sparse (exact match or keyword-based search) and dense vectors (semantic search) for retrieval. This approach leverages the strengths of both methods to improve the relevance and precision of the results.

In this article, we’ll explore how to build a straightforward RAG (Retrieval-Augmented Generation) pipeline using hybrid search retrieval, utilizing the Qdrant vector database and the llamaIndex framework. 🛠️

Access the code on Github 📌

Objective 🔮

Let’s get started

First let’s install Qdrant using

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

reference link

Let’s take a web page, for instance, this one: https://www.thoughtworks.com/en-in/insights/blog/data-strategy/building-an-amazon-com-for-your-data-products which discusses Building An “Amazon.com” For Your Data Products

To process the content from the webpage, we’ll first extract the article’s main text, then divide it into manageable sections, and convert these sections into vector embeddings using the “avsolatorio/GIST-Embedding-v0” model. Finally, these embeddings will be ready to load into a vector database for advanced retrieval applications. 📝🔄🗃️

After loading the embeddings, information can be retrieved using two methods provided by llamaIndex: SPARSE and HYBRID.

1. SPARSE (keyword-based search): This method is aligned with keyword search. It uses sparse vectors ideal for exact matches or keyword-based searches, relying on precise query-data matches.

2. HYBRID: Combining both sparse vectors (keyword-based search) and dense vectors (semantic search), this method retrieves information based on both exact keyword matches and semantic similarities. It integrates results from both vector types using a fusion algorithm to rank and organize the retrieved data effectively.

Let’s define a function called execute_and_compare to evaluate and contrast the retrieval results of the hybrid and sparse methods for the specific question “What are Data products?”

Hybrid Results: Text

Data product characteristics as originally defined by Zhamak Dehghani. In the focus on building data trust with business stakeholders, it’s easy to lose sight of the importance of also building trust with data teams across different domains. However, a data product must be trustworthy if it’s to encourage the reuse of data products. This is what ultimately **separates data mesh from data silo.** The data product is trustworthy if data consumers are confident in the accuracy and reliability of the data. Data products should be transparent with regards to information quality metrics and performance promises. Creating a central marketplace or catalog of internal data products is a great first step to raising awareness, but more is needed to convince skeptical data consumers to actually start using them. For this, we can take a page out of Amazon.com’s playbook. Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench

Score: 0.8020849742832907

Sparse Results:

Data product characteristics as originally defined by Zhamak Dehghani. In the focus on building data trust with business stakeholders, it’s easy to lose sight of the importance of also building trust with data teams across different domains. However, a data product must be trustworthy if it’s to encourage the reuse of data products. This is what ultimately **separates data mesh from data silo.** The data product is trustworthy if data consumers are confident in the accuracy and reliability of the data. Data products should be transparent with regards to information quality metrics and performance promises. Creating a central marketplace or catalog of internal data products is a great first step to raising awareness, but more is needed to convince skeptical data consumers to actually start using them. For this, we can take a page out of Amazon.com’s playbook. Amazon provides an incredible amount of detail to help consumers purchase products from unknown third-parties. Take the example of something as simple as a wrench

Score: 0.8020849742832907

If both the hybrid and sparse methods retrieve the same chunk with the same relevance score, it suggests that both keyword relevance and semantic context are strongly aligned in this scenario, leading to similar outcomes from both retrieval strategies. However, this doesn’t always happen, indicating a unique alignment for this specific query.

Conclusion

In theory and practice, hybrid search generally outperforms both semantic and keyword searches by combining their strengths for more accurate and relevant results.

Reference

--

--