Efficient Vector Similarity Search in Recommender Workflows using Milvus with NVIDIA Merlin

Burcin Bozkaya
NVIDIA Merlin
Published in
17 min readAug 28, 2023

--

by Burcin Bozkaya, Filip Haltmayer, William Hicks, Li Liu

Introduction

Modern recommender systems consist of training/inference pipelines that involve multiple stages of data ingestion, data preprocessing, model training and hyperparameter-tuning for retrieval, filtering, ranking and scoring of relevant items. The result is highly personalized recommendations to improve user experience and engagement. In a recent blog post, the NVIDIA Merlin team describes these workflows as multistage recommender systems implemented in various industries and use cases, such as e-commerce, streaming services, or social media.

An important component of a recommender system pipeline is the retrieval or discovery of items that are most relevant to a user, particularly in the presence of large item catalogs. This typically involves an approximate nearest neighbor (ANN) search over an indexed database of low-dimensional vector representations (i.e., embeddings) of product and user attributes, created from deep learning models that train on interactions between users and products/services. Vector representations can also be derived from various modalities of data, such as images, videos or textual descriptions of products and/or users, using computer vision algorithms or language models. A critical step is to conduct an efficient top-k (i.e., k most similar) search over a large set of vector embeddings in the order of hundreds of thousands (e.g., as in an e-commerce product inventory), if not millions/billions.

NVIDIA Merlin, an open-source framework developed for training end-to-end models to make recommendations at any scale, integrates with an efficient vector database index and search framework. One such framework that has gained much recent attention is Milvus, an open-source framework by Zilliz, offering fast index and query capabilities. Most recently, Milvus added GPU acceleration support that uses NVIDIA GPUs that sustain AI workflows. GPU acceleration support is great news because an accelerated vector search library makes fast concurrent queries possible, which has a huge positive impact on the latency requirements in today’s recommender systems, where a large number of concurrent requests are common. Milvus has had 3.4M docker pulls and ~21k stars on GitHub (as of July 2023) and has been used in many application use cases.

In this blog, we demonstrate how Milvus works with the Merlin recsys framework, both at training and inference time. We show how Milvus complements Merlin in the item retrieval stage with a highly efficient top-k vector embedding search and how it can be used with NVIDIA Triton Inference Server (TIS) at inference time (see Figure 1). Our benchmark results show an impressive 37x to 91x speedup with GPU-accelerated Milvus that uses NVIDIA RAFT with the vector embeddings generated by Merlin Models. The code we use to show Merlin-Milvus integration and detailed benchmark results are available here along with the library that facilitated our benchmark study.

Figure 1. Multistage recommender system with Milvus framework contributing to the retrieval stage. (Source for the original multistage figure: this blog post.)

The Challenge

Given the multistage nature of recommenders and the availability of various components and libraries integrated within, a major challenge is integrating all components seamlessly in an end-to-end pipeline. We aim to show that integration can be done with less effort in our example notebooks.

Another aspect of recommender workflows is the need to accelerate certain parts of the pipeline. While known to play a huge role in training large neural networks, GPUs are only recent additions to the world of vector databases and ANN search. With an increasing size of e-commerce product inventories or streaming media databases and the number of users using these services, CPUs fail to provide the required performance to serve millions of users in performant recsys workflows. To address this challenge, GPU acceleration in other parts of the pipeline has become necessary. The solution in this blog addresses this challenge by showing that ANN search is efficient when using GPUs.

Tech Stack

Let’s start by first reviewing some of the fundamentals needed to conduct our work. We need a recsys framework that we can base our work on, which is NVIDIA Merlin. Merlin is an open-source library with high-level APIs accelerating recommenders on NVIDIA GPUs. It enables data scientists, ML engineers, and researchers to build high performing recommenders at scale. We also use the following open source tools/libraries in this work:

  • NVTabular: for pre-processing the input tabular data and feature engineering.
  • Merlin Models: for training deep learning models, and to learn, in this case, user and item embedding vectors from user interaction data.
  • Merlin Systems: for combining a TensorFlow-based recommendation model with other elements (e.g., feature store, ANN search with Milvus) to be served with TIS.
  • Triton Inference Server: for the inference stage where a user feature vector is passed and product recommendations are generated.
  • Containerization: all of the above is available via container(s) provided by NVIDIA in the NGC catalog. We used the Merlin TensorFlow 23.06 container available here.
  • Milvus 2.3: for conducting GPU-accelerated vector indexing and querying.
  • Milvus 2.2.11: same as above, but for doing it on CPU.
    - pymilvus SDK: for connecting to Milvus server, creating vector database indexes and running queries via a python interface.
  • Feast: for saving and retrieving user and item attributes in an (open source) feature store as part of our end-to-end RecSys pipeline.

We also note that several underlying libraries and frameworks are also used under the hood. For example, Merlin relies on other NVIDIA libraries such as cuDF and Dask, both available under RAPIDS cuDF. Likewise, Milvus relies on NVIDIA RAFT for primitives on GPU acceleration and modified libraries such as HNSW and FAISS for search.

Understanding Vector Databases

Approximate nearest neighbor is a functionality that relational databases are not built to handle. Relational DBs are designed to handle tabular data with predefined structures and directly comparable values. Relational database indexes rely on this to compare data and create structures that take advantage of knowing if each value is less than or greater than the other. Embedding vectors cannot be directly compared to one another in this fashion, as we do not know what each value in the vector represents. They cannot say if one vector is necessarily less than the other. The only thing that we can do is calculate the distance between the two vectors. If the distance between two vectors is small, we can assume that the features they represent are similar, and if large, we can assume that the data they represent are more different. This is useful to us as, with this knowledge, we can create data structures that can provide efficient means to search through this data. These efficient indexes come at a cost; computing the distance between two vectors is computationally expensive, and vector indexes are not easily adaptable and sometimes not modifiable. Due to these two limitations, integrating these indexes is not simple in relational databases, which is why purpose-built vector databases are needed.

Milvus was created to solve the problems that relational databases hit with vectors and was designed from the ground up to handle these embedding vectors and their indexes at a large scale. To fulfill the cloud-native badge, Milvus separates compute and storage and different compute tasks — querying, data wrangling, and indexing. Users can scale each database part to handle other use cases, whether data insert-heavy or search-heavy. If there is a large influx of insertion requests, the user can temporarily scale the index nodes horizontally and vertically to handle the ingestion. Likewise, if no data is being ingested but there is a large amount of searches, the user can reduce the index nodes and instead scale up the query nodes for more throughput. This system design (see Figure 2) required us to think in a parallel compute mindset, ultimately resulting in a compute-optimized system with many doors open for further optimizations.

Figure 2. Milvus system design

Milvus also uses many state-of-the-art indexing libraries to give users as much customization for their system as possible. It improves them by adding the ability to handle CRUD operations, streamed data, and filtering. Later on, we will discuss how these indexes differ and what the pros and cons of each are.

Indexing Vector Data

Most vector indexes fall between two computation categories, clustering and graphs. IVF, an algorithm in the clustering bucket, uses k-means to calculate the clusters of closest neighbors. The query vector is then compared to the nearest cluster of centroids and searched during search time. HNSW, DiskANN, and other algorithms in the graph category mainly revolve around Navigating Spreading-out Graphs, which are graphs that are efficient at doing nearest-neighbor searches. Unfortunately, graph algorithms in this space are a bit more complex, so if you are interested, check out this post.

On top of these is another level of switches called product quantization (PQ). PQ is a way to compress your vector data to reduce resource usage and increase performance at the cost of recall/accuracy. Most algorithms in this space come with a quantized variant to allow toning down memory usage or increase the performance of their approach.

What’s the difference between all these algorithms and combinations? Why are there so many? This is due to the tradeoff between performance, recall, and memory usage. An index like IVF_FLAT is a balanced index that gives good results at a good speed without too much memory overhead. Compression-based indexes such as IVF_SQ8 and IVF_PQ are stronger regarding speed and decreased memory usage at the cost of heavy recalls reductions depending on the compression level used. HNSW is on the other side of the spectrum as it targets the best performance and recall at the cost of memory. DiskANN is unique compared to the rest as it is a disk-based index. The previous indexes are all fully in-memory, requiring large amounts of RAM. DiskANN only holds a small amount of the index in memory and instead has the grunt of the data in disk storage, allowing DiskANN to heavily reduce memory usage while still keeping recall at the cost of poor throughput performance and depending on the type of SSD used, latency performance.

The field has changed; it isn’t the case anymore that only large users/corporations can access extremely large datasets. Small users might generate billions of vectors from their data and need the ability to search it in the most cost-effective way. In contrast, a large user might only have a few hundred thousand data points that need to handle 10,000’s queries per second (QPS). To address this, there are a lot of customizations at the index level to support each use case; for more information, look here.

GPU vs CPU

For most users, GPU indexes are key to getting the performance they search for. They provide the high throughput many use cases require while saving money in the long run.

Building and searching indexes heavily rely on vectorized compute, which can be done on CPU but is greatly increased by GPU. Milvus achieved a 37x to 91x increase in QPS by moving the search computations to the GPU. This increase is significant as the only other way to get this increase in performance is by scaling up the cluster significantly. By using GPUs, users can greatly increase their performance while streamlining the cluster, reducing the need for extra nodes and scheduling overhead.

There is a caveat in this GPU-based search, though, and it’s the case of low concurrency. GPU performance is much better than CPU when the concurrency is high and when the slowdown from transferring data between CPU and GPU memory is smaller than the total savings in search time. In low concurrency situations, latency will be larger with GPUs as the CPU can complete the search faster than it takes to transfer it to and from the GPU.

Example Solution

The example solution we present demonstrates the integration of Milvus with Merlin at the item retrieval stage (when k most relevant items are retrieved through an ANN search). We use a real-life dataset from a RecSys challenge, described below. We train a Two-Tower deep learning model that learns vector embeddings for users and items. In this section, we also provide the blueprint of our benchmarking work, including the metrics we collect and the range of parameters we use.

Dataset

YOOCHOOSE GmbH provides the dataset we use in this integration and benchmark study for the RecSys 2015 challenge and is available on Kaggle. It contains user click/buy events from a European online retailer with attributes such as a session id, timestamp, item id associated with click/buy, and item category, which are available in the file yoochoose-clicks.dat. The sessions are independent, and there is no hint of returning users, so we treat each session as belonging to a distinct user. The dataset has 9,249,729 unique sessions (users) and 52,739 unique items.

Our approach involves a) data ingestion and preprocessing, b) Two-Tower deep learning model training, c) Milvus index building, and d) Milvus similarity search. We briefly describe each step and refer the reader to our notebooks for details.

Data Ingestion and Preprocessing

The tool we use for data preprocessing is NVTabular, a GPU-accelerated, highly scalable feature engineering and pre-processing component of Merlin. It is designed to easily manipulate terabyte-scale datasets and help train deep learning based recommender systems. It provides a high-level abstraction to simplify code and achieves accelerated computation on the GPU using the RAPIDS Dask-cuDF library. We use NVTabular to read data into GPU memory, rearrange features as necessary, export to parquet files and create a train-validation split for training. This results in 7,305,761 unique users and 49,008 unique items to train on. We also categorify each column and its values into integer values. The dataset is now ready for training with the Two-Tower model.

Model Training

We use the Two-Tower deep learning model to train and generate user and item embeddings, which are later used in vector indexing and querying. The model takes as input user attributes (user_id, user_age) and item attributes (item_id, item_category). Optionally, we can also include a target column to include rows with positive interactions only. After the model is trained, we can extract the learned user and item embeddings.

The next two steps are optional: a DLRM model trained to rank the retrieved items for recommendation, and a feature store used (in this case, Feast) to store and retrieve user and item features. We include them for the completeness of the multi-stage workflow.

Finally, we export the user and item embeddings to parquet files, which can later be reloaded to create a Milvus vector index. We can now launch the Milvus server and feed the item embeddings to create a vector index. This would be followed by making similarity search queries for existing and new users at inference time using NVIDIA TIS with a custom Merlin Systems operator. This is in our second notebook example.

Building and Querying Milvus Index

Milvus facilitates vector indexing and similarity search via a “server” launched on the inference machine. In our notebook #2, we set this up by pip-installing milvus server and pymilvus, then starting the server with its default listening port. Next, we demonstrate building a simple index (IVF_FLAT) and querying against it, with the two functions setup_milvus and query_milvus, respectively.

It gets more interesting when we accomplish the same tasks as part of a multi-stage inference within the TIS framework. Merlin provides a high-level API, Merlin Systems, that allows combining different stages of the recommender system into a single chained “ensemble model”. As a result, all the stages described above are executed in a single request to TIS. We implement a custom Merlin Systems operator here, called QueryMilvus, as part of the ensemble.

You may have noticed that the pymilvus library did not utilize the GPU while NVTabular and Merlin Models did. This is because the gpu-accelerated version of Milvus requires launching multiple containers, which is not possible with the Merlin container we used. Instead, we used a version of pymilvus that launches the Milvus server as a process on the same container where our notebooks are running. To run Milvus on GPU, check out this recent release that offers GPU support. The benchmarks conducted on GPU, as reported below, use this library version.

Benchmarking

To demonstrate the case for using a fast and efficient vector indexing/search library such as Milvus, we have designed two sets of benchmarks.

  1. Using Milvus to build vector indexes with the two sets of embeddings we generated: 1) user embeddings for 7.3M unique users, split as 85% train set (for indexing) and 15% test set (for querying), and 2) item embeddings for 49K products (with a 50–50 train-test split). This benchmark is done independently for each vector dataset, and results are reported separately.
  2. Using Milvus to build a vector index for the 49K item embeddings dataset and querying the 7.3M unique users against this index for similarity search.

In these benchmarks, we used IVFPQ and HNSW indexing algorithms, executed on GPU and CPU, along with various combinations of parameters. Details are available here.

An important performance consideration, especially in a production environment, is the search quality-throughput tradeoff. Milvus allows full control over indexing parameters to explore this tradeoff for a given use case to achieve better search results in relation to ground truth. This may mean increased computational cost in the form of reduced throughput rate or queries per second (QPS). We measure the quality of ANN search with a recall metric and provide QPS-recall curves that demonstrate the tradeoff. One can then decide on an acceptable level of search quality given the compute resources or latency/throughput requirements of the business case.

Also note the query batch size (nq) used in our benchmarks. This is useful in workflows where multiple simultaneous requests are sent to inference (e.g. offline recommendations requested and sent to a list of email recipients or online recommendations created by pooling concurrent requests arriving and processing them all at once). Depending on the use case, TIS can also help process these requests in batches.

Results

We now report the results for the 3 sets of benchmarks on both CPU and GPU, using HNSW (CPU only) and IVF_PQ (CPU and GPU) index types implemented by Milvus.

Items vs. Items vector similarity search

With this smallest dataset, each run, for a given parameter combination, takes 50% of the item vectors as query vectors and queries the top 100 similar vectors from the rest. We find that HNSW and IVF_PQ produce high recall with the parameter settings tested, in the range 0.958–1.0 and 0.665–0.997, respectively. This suggests that HNSW performs better w.r.t. recall, but IVF_PQ with small nlist settings produces highly comparable recall. We should also note that the recall values can vary greatly depending on the indexing and querying parameters. The values we report have been obtained after preliminary experimentation with general parameter ranges and zooming further into a select subset.

The total time to execute all queries on CPU with HNSW for a given parameter combination ranges between 5.22 and 5.33 sec.s (faster as m gets larger, relatively unchanged with ef) and with IVF_PQ between 13.67 and 14.67 sec.s (slower as nlist and nprobe gets larger). GPU acceleration does have a noticeable effect as seen in Figure 3.

Figure 3 shows the recall-throughput trade-off over all runs completed on CPU and GPU with this small dataset using IVF_PQ. We find that GPU provides a speedup of 4x to 15x across all parameter combinations tested (larger speedup as nprobe gets larger). This is calculated by taking the ratio of QPS from GPU over QPS from CPU runs, for each parameter combination. Overall, this set presents little challenge for CPU or GPU, and shows prospects for further speedup with the larger datasets, as we discuss below.

Figure 3. GPU speedup with Milvus IVF_PQ algorithm running on NVIDIA A100 GPU (item-item similarity search)

Users vs. Users vector similarity search

With the much larger second dataset (7.3M users), we set aside 85% (~6.2M) of the vectors as “train” (the set of vectors to be indexed), and the remaining 15% (~1.1M) “test” or query vector set. HNSW and IVF_PQ perform extremely well in this case, with recall values of 0.884–1.0 and 0.922–0.999, respectively. They are, however, computationally much more demanding, especially with IVF_PQ on CPU. The total time to execute all queries on CPU with HNSW ranges from 279.89 to 295.56 sec.s and with IVF_PQ from 3082.67 to 10932.33 sec.s. Note that these query times are cumulative for 1.1M vectors queried, so one can say that a single query against the index is still very fast. However, CPU-based querying may not be a viable option if the inference server expects many thousands of concurrent requests to run queries against an inventory of millions of items.

The A100 GPU delivers a blazing speedup of 37x to 91x (averaging 76.1x) across all parameter combinations with IVF_PQ in terms of throughput (QPS), shown in Figure 4. This is consistent with what we observed with the small dataset, which suggests the GPU performance scales reasonably well using Milvus with millions of embedding vectors.

Figure 4. GPU speedup with Milvus IVF_PQ algorithm running on NVIDIA A100 GPU (user-user similarity search)

Also, the following detailed Figure 5 shows the recall-QPS tradeoff for all parameter combinations tested on CPU and GPU with IVF_PQ. Each point set (top for GPU, bottom for CPU) on this chart depicts the tradeoff faced when changing vector indexing/query parameters towards achieving higher recall at the expense of lower throughput. Note the considerable loss of QPS in the GPU case as one tries to achieve higher recall levels.

Figure 5. Recall-Throughput tradeoff for all parameter combinations tested on CPU and GPU with IVF_PQ (users vs. users)

Users vs. Items vector similarity search

Finally, we consider another realistic scenario where user vectors are queried against item vectors (as demonstrated in Notebook 01 above). In this case, 49K item vectors are indexed, and 7.3M user vectors are each queried for the top 100 most similar items.

This is where things get interesting because querying 7.3M in batches of 1000 against an index of 49K items appears to be time-consuming on CPU for both HNSW and IVF_PQ. GPU seems to handle this case better (see Figure 6). The highest accuracy levels by IVF_PQ on CPU when nlist = 100 are computed in about 86 minutes on average but vary greatly as the nprobe value increases (51 min. when nprobe = 5 vs. 128 min. when nprobe = 20). The NVIDIA A100 GPU speeds up the performance considerably by a factor 4x to 17x (higher speedups as nprobe gets larger). Remember that the IVF_PQ algorithm, through its quantization technique, also reduces memory footprint and, combined with the GPU acceleration, provides a computationally viable ANN search solution.

Figure 6. GPU speedup with Milvus IVF_PQ algorithm running on NVIDIA A100 GPU (user-item similarity search)

Similar to Figure 5, the recall-throughput trade-off is shown in Figure 7 for all parameter combinations tested with IVF_PQ. Here, one can still see how one may need to slightly give up some accuracy on ANN search in favor of increased throughput, though the differences are much less noticeable, especially in the case of GPU runs. This suggests that one can expect quite consistently high levels of computational performance with the GPU while still achieving high recall.

Figure 7. Recall-Throughput tradeoff for all parameter combinations tested on CPU and GPU with IVF_PQ (users vs. items)

Conclusion

If you’ve made it this far, we’d happily share a few concluding remarks. We want to remind that modern recsys complexity and its multi-stage nature necessitate performance and efficiency at every step. In this blog, we have hopefully gave you compelling reasons to consider using two key features in your RecSys pipelines:

  • NVIDIA Merlin with its Merlin Systems library that allows you to easily plug in Milvus, an efficient GPU-accelerated vector search engine.
  • Use of GPU to accelerate computations for vector database indexing and ANN search with a technology such as RAPIDS RAFT.

These findings suggest that the Merlin-Milvus integration presented is highly performant and much less complex than other options, both for training and inference. Also, both frameworks are actively developed and many new features (e.g. new GPU-accelerated vector database indexes by Milvus) are added in every release. The fact that vector similarity search is a key component in various workflows, such as computer vision, large language modeling, and recommender systems, makes this effort all the more worthwhile.

In closing, we would like to thank all those from Zillis/Milvus and Merlin, and RAFT teams who contributed to the effort in producing this work and the blog post. Looking forward to hearing from you, should you have a chance to implement Merlin and Milvus in your recsys or other workflows.

References and Resource Links

--

--

Burcin Bozkaya
NVIDIA Merlin

Continuous Learner, World Citizen, Musician, Marathoner and Triathlete. Excited about Artificial Intelligence, Data Science, Big Data.