Milvus Webinar Series#1 Recap: Vector Similarity Search & Indexing Methods
In this talk, our speaker- Dr.Yi covers how various index types can be used in the Milvus vector database to accelerate vector similarity search on large datasets.
Vector similarity search & indexing methods
Dr. Xiaomeng Yi, Senior Researcher, Zilliz
❓Q&A for the session:
Q: What are some common issues when indexing large datasets?
A: For building index, the most common issue is performance. Typically, for a 32-core server, it would take minutes to hours to build an index for one billion vectors. For search, memory usage is another critical issue. The indexing algorithms usually require all the data to reside in the memory, which would be costly when data volume is large.
Q: Performing similarity search, we need to generate embeddings for our textual information. What would you recommend for choosing an embedding model?
A: Actually, the text feature extraction models we use are open source and pre-trained, like Bert and Word2Vec. If you want to extract text, try Bert, which meets most of our needs. If you have a more specific goal, you can also train and optimize the model with your own data, such as adjusting the parameters in the network.
Q: Can we use hardware accelerators (GPU, FPGA) to speed up vector indexing and similarity search?
A: Yes, you can use hardware accelerators to improve performance. Some indexes such as inverted index and PQ can naturally be executed in parallel, thus are easier to be implemented on accelerators and utilize their massive computation capacity; for graph-based indexes, it is harder. How to best utilize accelerators to speed up these indexes is still an unsolved issue.
Q: So for the space partition-based indexes, how to divide vector space into regions?
A: One way is to use hyperplanes to cut the space into multiple regions. We can use a tree structure to record the hyperplanes and regions. We can also use locality-sensitive hashing to generate a set of parallel hyperplanes and use the hash value to denote the region. Another way is to take a small sample of data and perform K-means clustering to get several cluster centroids, with each corresponding to a region. Each vector in the space is assigned to the region corresponding to its nearest centroid.
Q: I was curious to hear more about the two-stage similarity search that Xiaomeng talked about — in particular the accuracy/search time trade-off and the actual algorithm.
A: I guess this question is referring to the candidate filtering and result validation procedures. We can use approximate nearest neighbor search methods to get a little bit more result candidates than the user requires, then use their original vector data to calculate vector similarity precisely. Since we only need to check an extra small number of original data, the performance will not be affected much. But the precision can increase by up to 10% if we use product quantization in approximate search.
Q: Is Milvus an open-source software? Is it used as a Cloud System or do we need to install it locally? Are there DBMS installation requirements?
A: Milvus is, of course, open-source and is a vector database for AI that can power similarity search applications. We are ready to release Milvus 2.0, a cloud version, but the current LST version (Milvus 1.0) requires local installation.
The Milvus vector database is easy to use and it requires only Docker installation with no DBMS requirements.
Shiyu Chen, Data Engineer/DevRel, Zilliz
In this talk, our speaker- Shiyu will talk about how you can use MilvusDM, a data migration tool for Milvus, to transfer data to and from Milvus.
Got any questions about vector similarity search or Milvus? Join our Slack discussion or follow us on Twitter. 👇🏻