How to perform hybrid search using watsonx.data Milvus

Published in

Milvus Meets Watsonx

8 min readJul 19, 2024

Introduction

Modern applications increasingly require efficient retrieval of information from vast, unstructured datasets. Vector search has emerged as a powerful solution, representing data as high-dimensional vectors to find similarities. However, traditional methods can struggle with real-world data complexities.

Enter hybrid search - combining the strengths of sparse and dense vectors for more accurate results. At the forefront of this technology is Milvus, an open-source vector database designed for scalability and high performance.

Milvus is integrated into watsonx.data, IBM’s open data lakehouse, offering enterprise-grade search capabilities directly within their scalable, efficient and secure data platform.

Background

Dense Vectors

Dense vectors are fixed-length numerical representations where every dimension has a non-zero value. These vectors are typically generated by Neural networks, enabling searches based on contextual similarity rather than just exact keyword matches. For example, in natural language processing, words or sentences with similar meanings will have dense vector representations that are close to each other in the vector space.

Key characteristics of dense vectors:

Fixed length (e.g., 384 dimensions)
All dimensions contain values
Capture semantic and contextual information
Efficient for nearest neighbor search algorithms

Sparse Vectors

Sparse vectors, on the other hand, have most of their values as zero. They are typically much higher-dimensional than dense vectors, but only a small fraction of these dimensions have non-zero values. Common methods for generating sparse vectors include TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25). Sparse vectors are excellent at capturing specific features or exact matches.

Key characteristics of sparse vectors:

Very high dimensionality (e.g., vocabulary size)
Most values are zero
Efficient for capturing exact matches or specific features
Commonly used in traditional information retrieval systems

Hybrid Search

Hybrid Search effectively combines both search technologies for improved recall.

Semantic Understanding: Dense vectors provide a deep semantic understanding of the data, allowing us to capture similarities that may not be apparent from exact word matches.
Exact Matching: Sparse vectors excel at capturing specific features or exact matches, which can be crucial for certain types of queries.
Flexibility: The hybrid approach allows for more flexible querying, catering to both semantic similarity and feature-specific searches.
Improved Accuracy: By considering both semantic context and specific features, hybrid search often yields more accurate and relevant results than either method alone.
Handling Cold Start: In recommendation systems, sparse vectors can help address the cold start problem where new items lack sufficient interaction data for dense vector representations.

In the following sections, we’ll explore how to implement this hybrid approach using Milvus, combining the semantic power of dense vectors with the precise matching capabilities of sparse vectors. We’ll set up a Milvus collection that accommodates both vector types, implement a hybrid search strategy, and apply reranking to refine our results.

Reranking

Now that we have retrieved documents from both the sparse and dense vector searches, we need to calculate the relevance score from both these searches. Based on the relevance score between the user’s question and each retrieved document from both searches, we assign weights to the retriever and tune these weights to return most relevant results.

Currently, Milvus offers two reranking strategies:

WeightedRanker: The weighted average of scores (or vector distances) from different vector searches is calculated and weights are assigned based on the significance of each vector field.
RRFRanker: Reciprocal Rank Fusion (RRF) is employed to consolidate and standardize the rankings from multiple, previously ranked results into a unified result set, which is then returned in the query response.

Having explored the theoretical landscape of vector similarity search and retrieval, it’s time to translate concept into code. Let’s dive into a hands-on implementation that will bring these abstract ideas to life.

Setting up the Environment

Prerequisites:

Create a watsonx.data account.
Create a user api key.
Create a Milvus Service in watsonx.data from infrastructure manager.
Grab the GRPC endpoint from the provisioned Milvus service.
Install python 3.12.2.
Install python client for Milvus pymilvus SDK (version 2.4.0 and above).
Install sentence-transformers and BM25.

Coding time!

Open a jupyter notebook and import necessary libraries that were installed in the prerequisites as below:

Data Preparation:

Prepare a sample dataset, here I have used few text and its detailed context about the language ‘python’ in a pandas dataframe of 10 rows and 3 columns viz. id, text and context.

Creating dense vectors:

Since dense vectors capture semantic meaning well, we will use sentence transformer to generate dense vectors of the data in ‘context’ column.

Creating sparse vectors:

While sparse vectors are well suited for keyword based searches we use BM25 model to generate sparse vectors of the data in ‘text’ column.

Unlike sentence transformers, which utilize pre-trained neural networks to generate dense vector embeddings, BM25 is a probabilistic ranking function based on the bag-of-words model. Thus, BM25 needs to be computed over the entire corpus to calculate term frequencies and inverse document frequencies (TF-IDF).

Connect to Milvus on IBM watsonx.data:

In watsonx.data, you have option to have a Saas or an on-premise instance of Milvus. The connection string should be provided in below format.

More more details refer the documentation.

Create a collection:

Collection is analogous to tables in mysql, which will have a schema, consisting of field names, datatypes and metadata about the fields. Milvus will search only one collection at a time.

Create Index:

Milvus automatically generates an index and loads it into memory when creating a collection if any of the following conditions are specified in the collection creation:

The dimensionality of the vector field and the metric type.
The schema and the index parameters.

Note: The sparse vectors have been indexed using ‘sparse_inverted_index’ and dense vectors have been indexed using ‘IVF_SQ8’. For more details on all available index types, refer the documentation.

You can index a collection before inserting data into it. In fact, creating an index before inserting data is one of the recommended workflows in Milvus.

Inserting into Milvus:

Milvus expects the sparse vector embeddings to be in a particular format i.e. a list of dictionaries such that non zero index being the key and the term weight being its value.

The below custom function converts the BM25 generated sparse embeddings into the format Milvus expects.

Final dataframe that will be inserted to milvus look like this:

Insert into milvus collection and load the collection.

Implementing Hybrid Search:

Hybrid search in Milvus allows you to perform searches on multiple vector fields within a single collection.

We define multiple annSearchRequest instances, one for dense and another for sparse vector search.

The query text also needs to be vectorised with the same embedding model before performing a search.

Configure a Reranking Strategy:

As discussed in the introduction, reranking in hybrid search combines results from multiple vector fields, ensuring that the final output is relevant and accurately prioritised.

Figure 12. Reranking Strategy

In our implementation, we’ve employed a WeightedRanker for the hybrid search results. This approach allows us to assign relative importance to the outputs from both our dense and sparse embedding models. Initially, we’ve allocated equal weights to maintain a balanced influence from each model. However, this weighting scheme is highly configurable and can be fine-tuned based on empirical performance and the specific characteristics of your queries.

The flexibility of the WeightedRanker enables us to optimize the search results by adjusting the model weights.

For instance, if we observe that certain types of queries benefit more from the semantic understanding of dense embeddings, we can increase their weight. Conversely, for queries where exact keyword matching is crucial, we might lean more heavily on the sparse embeddings.

This adaptive weighting strategy allows us to:

Calibrate the search algorithm to the nuances of our data and query patterns
Leverage the strengths of each embedding model dynamically
Implement a feedback loop for continuous improvement of search relevance
Potentially implement query-specific weighting for even more granular control

By fine-tuning these weights, we can progressively enhance the accuracy and relevance of our hybrid search system, ensuring it aligns closely with the specific requirements of our application and user expectations.

Performing the hybrid search in Milvus:

My query text is : “Why python is chosen for data science and AI?”.

It has both keywords (‘python’,’data science’) as well as contextual ‘why’ question. With equal weightage to both models we are able to get the desired results.

Figure 13. Hybrid search on both keyword and contextual query

While if I know the nature of my query is more keyword search, I will increase the weight of sparse vectors and rerank the results as below. For the query text: “What is python?”

Figure 14. Hybrid search on keyword based query

Conclusion:

In summary, Milvus hybrid search represents a significant advancement in data retrieval technology. By combining the strengths of sparse and dense vectors, it offers more accurate and efficient search results, addressing the complexities of real-world data. As data continues to grow in volume and complexity, this hybrid approach, integrated into a scalable and high-performance platform like Milvus, empowers organizations to unlock deeper insights and drive innovation.