Qdrant Hybrid Search under the hood using Haystack

Nicola Procopio
8 min readJun 6, 2024

--

Image from: Haystack blog

Gentle Introduction to Hybrid Search

Semantic search seeks to improve search accuracy by understanding the content of the search query.
In contrast to traditional search engines which only find documents based on lexical matches (i.e. semantic search can also find synonyms).
With the RAG explosion semantic search is increasingly being used in more and more specific contexts, this brought out a problem with the models, they haven’t access to domain — specific vocabulary.
Let’s think for example of healthcare or legal where technical terms and acronyms are used that have hardly been in the retriever train set.
The problem is that the retrievers are generally trained on general domain sentences, and if you don’t to fine-tune they you can boost their performance using the classical keyword based search.
Hybrid search merges dense and sparse vectors together to deliver the best of both search methods.
Generally speaking, dense vectors excel at understanding the context of the query, whereas sparse vectors excel at keyword matches.
Consider the query: “How to catch an Alaskan Pollock”.
The dense vector representation is able to disambiguate “catch” as meaning fishing rather than baseball or sickness. The sparse vector search will match the words “Alaskan Pollock” only.
This example query shows where hybrid search combines the best of both sparse and dense vectors.

TL;DR: Sparse & Dense Vectors

Classical search engines use and are based on sparse representations for retrieval. Documents containing the query words are considered relevant. If you search for “manual”, you don’t find documents containing “handbook” or “guide”.
Bag-of Words (BoW) is the simplest sparse representation. It needs to build a vocabulary composed of all the words in your corpus and for each document, construct a vector with the same length as the vocabulary. Put a 1 for each word that appears in the document and a 0 for
each missing word.

Image from: neural search pills

The most widely used algorithm for sparse retrieval is BM25, it’s based on TF-IDF. SPLADE is a neural retrieval model which learns query/document sparse expansion.

In Neural Search, a deep learning model is used to represent both the documents and the query as vectors. At retrieval time, the query vector is compared against the document vectors to provide the most pertinent documents. The vectors can also capture semantic information and other information related to the context in which a given word appears.

Image from: dense vectors capturing meaning with code

Introduction to Haystack, Qdrant and FastEmbed

Haystack is an open source Python framework by deepset for building custom apps with LLMs.
IMHO Haystack is easy to use for fast prototyping, is clear because has a graph/pipeline-based logic, has a lot of integrations and a clear documentation.
Haystack supports several document stores, one of them is Qdrant.

Image from: choosing a document store

Qdrant is an open source vector database, is a pure vector database and focuses on vector storage, semantic search, and recommender systems for maximum performance and scalability.
We introduce the last component of our technology stack, embedders. To generate the vectors in our example we will use models supported by the fastemebed library.
FastEmbed is a library developed and maintained by Qdrant that aims to distribute quantized models, which are therefore lightweight and carry fewer dependencies, great for scaling applications in production.
From the most recent versions Qdrant also supports sparse vectors (and sparse retrieval), this makes it now possible to build hybrid search applications without resorting to workarounds.
Similarly, FastEmbed has also placed some embedders scattered throughout its repositories, currently SPLADE but only for English.

Step-by-step example

The beauty of Haystack is that it makes writing applications in a few lines and with the pipeline it all comes out very elegant, now let’s take the example for hybrid search with Qdrant and try to explain it step by step.

pip install haystack-ai qdrant-haystack fastembed-haystack

The complete example from QdrantHybridRetriever documentation:

# Imports

from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.components.embedders.fastembed import (
FastembedTextEmbedder,
FastembedDocumentEmbedder,
FastembedSparseTextEmbedder,
FastembedSparseDocumentEmbedder
)

# Indexing

document_store = QdrantDocumentStore(
":memory:",
recreate_index=True,
use_sparse_embeddings=True,
embedding_dim = 384
)

documents = [
Document(content="My name is Wolfgang and I live in Berlin"),
Document(content="I saw a black horse running"),
Document(content="Germany has many big cities"),
Document(content="fastembed is supported by and maintained by Qdrant."),
]

indexing = Pipeline()
indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1"))
indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5"))
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
indexing.connect("dense_doc_embedder", "writer")

indexing.run({"sparse_doc_embedder": {"documents": documents}})

# Querying

querying = Pipeline()
querying.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
querying.add_component("dense_text_embedder", FastembedTextEmbedder(
model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: ")
)
querying.add_component("retriever", QdrantHybridRetriever(document_store=document_store))

querying.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
querying.connect("dense_text_embedder.embedding", "retriever.query_embedding")

question = "Who supports fastembed?"

results = querying.run(
{"dense_text_embedder": {"text": question},
"sparse_text_embedder": {"text": question}}
)

print(results["retriever"]["documents"][0])

# Document(id=...,
# content: 'fastembed is supported by and maintained by Qdrant.',
# score: 1.0)

The example has three parts:

  • Imports: all libraries and methods to develop your pipeline
  • Indexing: the first pipeline, from the DocumentStore initialization until the documents are written to the index
  • Querying: the second pipeline, ask in natural language and retrieve documents

After Imports the first step is to initialize the DocumentStore, here QdrantDocumentStore is set to in-memory mode.
That’s a good choice for any test scenarios and quick experiments in which you do not plan to store lots of vectors.

document_store = QdrantDocumentStore(
":memory:",
recreate_index=True,
use_sparse_embeddings=True,
embedding_dim = 384
)

Another type of initialization is disk-persisted mode, it’s also useful for prototyping and experiments.

document_store = QdrantDocumentStore(
path="/home/qdrant/storage_local",
index="Document",
recreate_index=True,
use_sparse_embeddings=True,
embedding_dim = 384
)

Other connection that could be used are Qdrant Cloud, if you have an API Key and your cluster URL from the Qdrant dashboard.

from haystack.dataclasses.document import Document
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(
url="https://xxxxxx-xxxxx-xxxxx-xxxx-xxxxxxxxx.us-east.aws.cloud.qdrant.io:6333",
api_key="<your-api-key>",
)

We pay attention to two parameters in particular:

  • embedding_dim: the size of the vectors must be chosen during initialization
  • use_sparse_embeddings: you have to set the use of sparse vectors to True, by default Qdrant uses only dense vectors

If you want to use Document Store or collection previously created with this feature disabled, you must migrate the existing data. You can do this by taking advantage of the migrate_to_sparse_embeddings_support utility function.

Now we can build the pipeline, clearly we need both a model that produces dense vectors and one for sparse vectors. Both are initialized using fastembed, in particular the methods for transforming Documents (which are objects unlike flat text), then once the vectors are created and associated with the Document we can write it to the DocumentStore, as a strategy for duplicate documents we chose to overwrite.

indexing = Pipeline()
indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1"))
indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5"))
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
indexing.connect("dense_doc_embedder", "writer")

indexing.run({"sparse_doc_embedder": {"documents": documents}})
Image created with Haystack pipeline.draw()

The second part is the most interesting one because Qdrant has developed a component that directly acts as a Hybrid Retriever by accepting both sparse and dense vectors and has within it the mechanism of merging the result lists that come from both searches. Let’s look at the code and then start the analysis.

querying = Pipeline()
querying.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
querying.add_component("dense_text_embedder", FastembedTextEmbedder(
model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: ")
)
querying.add_component("retriever", QdrantHybridRetriever(document_store=document_store))

querying.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
querying.connect("dense_text_embedder.embedding", "retriever.query_embedding")

As with documents, here too we need models that transform the query into a vector, clearly the model is the same but accepts text as input.
There is to note that in FastembedTextEmbedder for better performance a prefix is added to the query to give context of what we want to do, in this case create a vector that “Represent this sentence for searching relevant passages:”.

Both vectors are passed as input to QdrantHybridRetriever, a component for retrieving documents from an QdrantDocumentStore using both dense and sparse vectors and fusing the results using Reciprocal Rank Fusion.
So we have two vectors representing the same query:

  • dense, like [0.75, -0.83, …, 0.21]
  • sparse, like [{0: 0.1}, {3: 0.5}, {5: 0.12}]

In the retrieve step the dense branch uses the similarity metric setted when the DocumentStore is initialized (in this case cosine similarity), the sparse branch uses the dot product, multiplying corresponding elements of the query and document vectors and summing these products.

To go deep in sparse retrieve and score formula check this Qdrant article.

now we have two lists and we need to merge them but the scores are not comparable or even combinable with linear functions.
Which Joiner to choose is an open question, and there are different currents of thought:

  1. Concatenate the two lists, remove duplicates and apply a cross-encoder to do a two stage retrieval. This is a good solution but you have to add a step to the pipeline and the order could be upset by the re-ranker
  2. Use a joiner based on the position of the document in the list, not the score.

Strategy 2 is the one used by Qdrant, which applies Reciprocal Rank Fusion (RRF) to the two lists.
The RRF score is calculated by taking the sum of the reciprocal rankings that is given from each list. By putting the rank of the document in the denominator, it penalizes the documents that are ranked lower in the list.
In the following example we have three documents, [A, B, C], and perform sparse and dense search, in the column Result we calculate the RRF.

If doing the calculations with the scores that come out of the pipeline and the table above do not add up for you, no problem.
The RRF formula, given r the position of the document in the list, is 1/(k+r), where k is a constant that mitigates the impact of high rankings by outlier systems.
In the original paper k has value 40 (as in Haystack’s DocumentJoiner), so you may have very low scores even in the top positions of the list for documents very similar to the query. For Qdrant this constant is set to 2.

With this step the pipeline has completed its work and returns to us as output the top-k documents most similar to our query.

Image created with Haystack pipeline.draw()

--

--

Nicola Procopio

Data Scientist and Open Source enthusiast. After many years of indecision whether to open a medium profile or not, here I am. I write in English and Italian.