Mastering Grouping Search with Milvus in watsonx.data

Published in

Milvus Meets Watsonx

7 min readAug 2, 2024

Learn how to enhance search relevance and reduce redundancy using the powerful grouping search feature in Milvus, on IBM watsonx.data. This article was co-authored by Swati Karot.

Introduction

Imagine you’re searching for the perfect recipe online. You type “chocolate- cake”, but instead of getting a diverse array of delicious options, you’re bombarded with 20 nearly identical versions of the same basic recipe. Frustrating, right? This is where the magic of grouping search comes in. In our data-rich world, finding truly diverse and relevant results can feel like searching for a needle in a digital haystack. But fear not! Milvus, a powerful vector database now part of IBM’s watsonx.data, has a secret weapon called grouping search. Let’s dive into how grouping search works its magic, why it’s a game-changer, and how it can make our digital explorations more rewarding.

Understanding Grouping Search

Grouping search in Milvus is analogous to the GROUP BY clause in traditional database. It lets you organize search results by a specific field, cutting down on duplicate results and providing more variety. This is especially helpful when each data item, like a document, is divided into smaller parts, such as passages, each represented by a vector embedding. By grouping results based on a common field, like a document ID, you can find the most relevant and unique items.

Practical Example: Grouping Search in Action

Consider a collection of customer reviews for various products. Each review is represented by a vector embedding and is linked to a specific product. When searching for unique relevant products with a particular type of review, you want to avoid getting multiple reviews from the same product. Instead, you want a diverse set of products to choose from. We’ll see a practical demo of this example in the section below.

Why Use Grouping Search?

No More Déjà Vu: Say goodbye to seeing the same information over and over. Grouping search is like having a personal assistant who filters out the duplicates for you.
Quality Over Quantity: It’s not about how many results you get, but how useful they are. Grouping search helps you find the gems in a sea of data.
Super Time-Saver: It’s like having a shortcut through a maze of information, helping you find what you need faster.

Now, let’s roll up our sleeves and see how this works in real life. We’re about to dive into a hands-on demo that’ll bring these concepts to life. Get ready to see grouping search work its magic!

Setting up the Environment

Prerequisites:

Create a watsonx.data account.
Create a user api key.
Create a Milvus Service in watsonx.data from infrastructure manager.
Grab the GRPC endpoint from the provisioned Milvus service.
Install python 3.12.2.
Install python client for Milvus pymilvus SDK (version 2.4.0 and above).
Install transformers and sentence-transformers .

Let’s turn theory into practice!

Open a Jupyter notebook and import the necessary libraries installed in the prerequisites as shown below:

Figure 1. Import Libraries

Data Preparation:

The dataset includes 5 products, each with 10 reviews.

BestProductA: All reviews are positive.
GoodProductB: Mostly positive reviews, with a few negative or neutral ones.
AverageProductC: Mostly mixed or neutral reviews.
BadProductD: Predominantly negative reviews, with a few positive or neutral ones.
WorstProductE: Mostly negative with few neutral reviews .

The product names have been chosen to give a hint about their overall quality, but this does not affect the similarity calculations.

We also added a primary key field ‘id’ to our dataset to uniquely identify each review. This allows us to accurately group search results based on product IDs, ensuring that we retrieve the most relevant and unique products without redundancy.

The below code generates vector embeddings for each review using a sentence transformer, adds these embeddings as a new column, and then reorders the DataFrame to include the primary key ‘id’, product details, review text, and the embeddings. This transformed DataFrame is then ready for further analysis or operations such as grouping search.

Connecting to Milvus on IBM watsonx.data:

To connect to Milvus on IBM watsonx.data, refer to this blog .

Create Collection in Milvus

Next step will be to create the schema for our Milvus collection. Five fields are added to the schema: id (primary key), product_id, product_name, reviews, and embeddings (vector field).

After defining the schema, we’ll set up two crucial indexes:

A scalar index on the “id” field for efficient sorting and retrieval.
A vector index on the “embeddings” field using “IVF_FLAT” index type and “COSINE” similarity metric, to enable fast similarity searches across product embeddings.

This schema and index configuration will allow us to efficiently store, retrieve, and search product data in our Milvus database.

After doing that , we’ll create a Milvus collection, which you can think of as a table in a traditional relational database. Our collection, named “Product_Reviews_Collection”, will use the schema and index parameters we defined earlier.

Then, we’ll prepare our data for insertion by:

Organizing our data into a dictionary that matches our schema structure.
Converting this data into a list of dictionaries, where each dictionary represents a single product entry.

This setup allows Milvus to efficiently store and search our product data within the collection.

Now, we’ll insert our prepared data into the Milvus collection. This is done using the client’s insert method, specifying the collection name and passing our data_list. The result shows that 50 records were successfully inserted, with Milvus automatically assigning unique IDs from 1 to 50 for each entry.

Query Text

We tested Milvus with various query texts to evaluate its grouping search capabilities. The queries included a mix of positive and negative sentiments, such as “Superb,” “worst experience in my life,” “lovely,” “terrible,” “yuck,” “falling in love with it,” and “was ok”. The results highlighted how Milvus effectively groups and retrieves relevant data based on these diverse inputs. Please note that results can vary depending on the quality of embeddings used, the length of the text, and the diversity of the data. For instance, consider the word “Superb” as our query text.

Without Grouping Search

In this example, we performed a search in Milvus using cosine similarity as the metric. Higher distance values indicate greater similarity between data points. We queried with the text “superb” without using the group_by_field parameter and observed significant repetition in the results. For instance, 'BestProductA' appeared three times out of the top ten results. This redundancy demonstrates the limitation of not grouping search results, as it can lead to less diverse and informative outputs.

Figure 9. Performing Search with group_by_field disabled

With Grouping Search

When we enabled grouping search in Milvus, we set the group_by_field parameter to group results by product_id. This adjustment significantly improved the diversity of our search results. For example, 'BestProductA' appeared only once at the top of the results list, making it the most relevant, while eliminating redundancy. The results now feature a range of products with reviews that closely match the query text “superb.” By grouping search results, Milvus provides a more comprehensive and meaningful set of data points tailored to the query. This approach ensures that each product appears only once in the results, preventing duplicate entries for the same product.

Figure 10. Performing Search with group_by_field enabled

In a real-time RAG use case, if a customer prompts “show me the products that have the best reviews” or a seller/website owner prompts “show me products where customers complain most about fitting issues,” you might think you can just directly query on a scalar field like avg_rating. We could, but the ratings are averaged, and crucial information gets generalized. We don't get details on why a rating is high or low. Grouping search on reviews directly helps us get specific insights.

Let’s consider another example with the query text ‘yuck.’ Below, we compare the results obtained using grouping search versus those from a non-grouped search.

Figure 12. Performing Search with group_by_field disabled

We can observe that ‘WorstProductE’ appeared four times out of the top ten results, when we disabled group_by_field .

Figure 13. Performing Search with group_by_field enabled

On the other hand , looking at the grouped results, it’s no surprise to see ‘WorstProductE’ sitting at the top of the list, without any repetition . This is exactly what we’d expect when we turn on the group_by_field feature.

Conclusion

Grouping search feature significantly enhances the quality and relevance of search results. By clustering data based on shared attributes, it reduces redundancy and delivers richer, more informative outputs. This is particularly valuable as data grows in volume and complexity. Integrated with IBM’s open lakehouse platform, watsonx.data, Milvus offers a secure, enterprise-grade solution. With its robust and scalable platform, businesses can easily navigate their data, uncover valuable insights, and stay ahead in the ever-evolving data landscape. Embracing Milvus with watsonx.data empowers businesses to turn data into their greatest asset.