Evaluating Vector Databases 101
With the growing interest in Large Language Models (LLMs) and AI solutions, vector databases have seen a rise in popularity. This is especially evident in the recent adoption of vector databases in LLM solutions that employ Retrieval Augmented Generation (RAG) architecture. Those solutions depend on retrieving pertinent documents from a database before generating a response. By facilitating efficient and accurate retrieval, the use of vector databases substantially augments the performance of LLMs and reduce hallucinations.
Vector databases enhance search needs by enabling efficient operations on vectors and offering scalable and robust solutions to data management and retrieval.
This post aims to provide a comprehensive overview of vector databases (also called vector stores), delving into their constituent elements, functionalities, and key considerations. By enhancing the understanding of these components, one is equipped with necessary knowledge for evaluating and selecting the most fitting solution and algorithm for your project.
Vectors and Vector Embeddings
Because computers and machine learning models operate on numerical data only, to be able to process and analyze textual information, we need to convert words into numerical vectors.
These vectors are numerical representations of words or phrases and when those are created in a way that capture semantic meanings of what they are representing, we call them vector embeddings (or embeddings).
Embeddings are a way of representing data as points in an n-dimensional space so that similar data points cluster together [2]. A classic example is shown in figure 2, where similar words are closer to each other.
In the example, by representing words as vectors, embeddings can leverage vector arithmetic for tasks like analogy completion (e.g., king — man + woman = queen).
The process of creating embeddings is done by using an embedding model. Those models are in general algorithms or neural networks trained to represent information in a multi-dimensional space. There are many ways to train embedding models, popular ones include classic models such as word2vec, GloVe, and more modern deep learning-based models such as Cohere’s embed-english-v3.0 and the largely used OpenAI’s ada-002. A good place to find available embedding models is the MTEB Leaderboard hosted in Huggingface, there we can compare different embedding models based on a variety of benchmarks.
Vector databases
In contrast with a regular database that focuses on structured data with tabular organization, a vector database emphasizes the use of vectors for data representation and similarity-based querying, suitable for unstructured or high-dimensional data.
Vector databases stores vectors and metadata related to each vector in a way that facilitates the search process, accelerate queries through indexing and allow users to retrieve them according to a similarity metric of choice. Moreover, a vector database is expected to have the same functionalities as a regular database in terms of data management, scalability, backups, data security, integration with other systems, etc.
Vector similarity
Before we proceed to similarity search, we need to define what similarity is and how do we measure it.
In the context of vector embeddings, similarity is expressed by a measurement that tells how distant one vector is to another in multi-dimensional space, and it is the base of how a vector database identifies the best results for your query.
There are many ways to compute distance, common scores are Hamming Distance, Inner Product, Cosine Similarity, and Mahalanobis Distance [11]. Understanding the similarity metrics and the pros and cons of each metric is important to understand more about how vector databases work, you can find further information about it in references [11] and [15]
Vector indexing and search
The simplest way to retrieve vectors similar to the query is to exhaustively search all possible vectors and return the most similar one(s). This, at small scale, is done with an algorithm called k-Nearest Neighbors search (KNN).
At large scale (millions of vectors) this comparison is unfeasible (slow and memory intensive), and new approaches are needed. The most common approach is to index the stored vectors with a set of Approximate Nearest Neighbor (ANN) algorithms, which works similarly to KNNs but trades precision or recall for speed.
To speed up search, some ANN indexing strategies include reducing the dimensionality of the vectors to speed up similarity computation, narrowing down the search space, implementing an efficient data structure or a combination of them all. Popular methods such as Product Quantization (PQ), Locally Sensitive Hashing (LSH), Inverted File System (IVF), Hierarchical Navigable Small Worlds (HNSW), apply those strategies and its functionalities can be learned in depth in these resources: [3], [4]
When the query requires an “exact search” as opposed to the approximate case previously mentioned, the database utilizes techniques based on words and frequencies. Popular algorithms for this case are Inverted Index and Okapi BM-25 / TF-IDF.
Components of a vector database
The main goal of having a vector database is to be able to reliably search for vectors that are similar to a query you submit; in other words, given a vector, return vectors similar to it. This process can be explained in a pipeline as illustrated in figure 3.
Vectors: a set of texts/documents transformed into vector embeddings by an embedding algorithm.
Indexing: the process of efficiently mapping the vectors to a data structure that enables fast and effective search operations.
Vector Database: provides a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices.
Querying/Searching: step that compares the queried vector with the vectors in the database and find the nearest neighbors.
Pre-Processing (optional): apply any pre-processing steps, like pre-filtering on metadata.
Post-Processing (optional): apply any post-processing steps, like post-filtering on metadata or reranking results.
Vector database use cases
Representing unstructured data as embeddings allows us to improve many current solutions and open doors for new use cases. Some common ones include:
• Knowledge base search
• Image search / Audio search
• Multimodal search (for example, images with text)
• Recommendation systems
• Retrieval Augmented Generation (RAG)
With the surge in interest in large language models (LLM), currently the most popular use case for vector databases is as a component of a RAG architecture (see figure 4). RAGs improve LLMs’ outputs by referencing a knowledge base before generating a response. By adding additional external information to the prompt, it helps reduce problems like hallucinations.
To learn more about RAGs, check references [5], [6] and [7]
Getting started with RAG
Now that you’ve made it to here, you might be interested in having your hands on a RAG implementation. Choosing and setting up a vector database can be a bit overwhelming if you are just starting.
You can try a minimal implementation like in the link below, to get used to it. The suggested implementation uses a vector index (in memory) instead of a vector database:
https://python.langchain.com/docs/expression_language/cookbook/retrieval [8]
After getting used to it, you will be able to notice what vector databases bring in addition to a standalone vector index, and why you will need a proper database when your solution goes to the production stage.
Evaluating a vector database solution
Evaluating a vector database solution requires a comparison of many aspects in a variety of criteria such as service offerings, functional requirements, maintainability and security, scalability, cost, integration with AI tools, etc.
We will dive deeper into these main points and highlight aspects that might help in the decision to choose a vector database solution. As always, there is no single best solution, and exploration is required to find the best one suited for a specific use case.
An excellent reference is this Vector DB Comparison [9] created by Superlinked, that compared vector stores in many different aspects. With this blogpost we aim to prepare the reader to understand most of evaluation criteria used by the comparison and present additional evaluation criteria that might be worth looking at.
1 — Service Offerings
The overall offerings of each solution.
1.1 — Service type
Typically, there are two service types, self-hosted and managed software as a service (SaaS). It is important to start evaluating solutions that can meet the requirements of your project. The choice of the service type will also impact on the cost estimates for the project.
If opting for self-hosted, check for integrations with cloud services for deployment of the solution and computation/memory requirements.
1.2 — Documentation maturity
Check the documentation of the solution. Some vendors like Pinecone, Weaviate, Milvus have been in the market for a while, have good support materials and a support community. Also, some open-source solutions can be evaluated by how active the community is on improving the codebase (number of GitHub stars, Issues/Pull Requests)
1.3 — Ecosystem integration
Some solutions offer integration with a variety of tools. If your project is using other tools like Amazon Web Services or Microsoft Azure, check if vector databases solutions you are considering provide integration with them or even if the tool offers their own vector database solutions.
If you are using AWS for example, it might be interesting to analyze the pros and cons of using a native solution like AWS Open Search versus a dedicated solution like Pinecone. In either case, examples of documentations to look for are AWS Open Search k-nn search and Pinecone Integration with AWS.
2 — Functional Aspects
Evaluates search and indexing algorithms, post processing options and other database functionalities.
2.1 — Search and Indexing algorithms
As mentioned previously, the ANN algorithms trade off accuracy/precision/recall for speed. This increase in speed also requires more memory to perform well. To evaluate a solution, it is important to understand what are the minimum requirements in terms of: accuracy, latency, memory use, and storage size.
Asking questions like these can be useful while evaluating a solution:
• “How many queries per second are we going to make?”
• “How fast should the database return with an answer?”
• “How accurate do I want my answer?”
• “Do we need exact search capabilities in addition to ANN search?”
• “If self-hosting, what is the budget we have? Can we afford a machine with 1.5TB of memory?”
Evaluation regarding speed x accuracy can be found in benchmarks such as in figure 5:
A very interesting example of evaluation that takes memory into consideration can be found in Amazon’s blog on choose ANN Algorithms [4].
In the mentioned blog post, the AWS team calculated that for 1 billion 128-dimension vectors, the estimated memory required by the algorithms are:
• HNSW algorithm: 1,408 GB
• IVF algorithm: 1,126 GB
• PQ algorithm: 70 GB
According to the blogpost, “this savings does come at a cost, however, because HNSW offers a better query latency versus approximation accuracy tradeoff” [4].
By figuring out the needs of the project, it is possible to narrow down the search to solutions that provide the algorithms that are more suitable for the use case.
2.2 — Similarity Scores
Any similarity search is based in a measurement that tells how distant one vector is to another. Common scores such as Hamming Distance, Inner Product, Cosine Similarity, and Mahalanobis Distance are commonly supported by vector databases vendors. Choosing the appropriate one for your use case will require experimentation, and for that is important to consider vendors that support a considerable variety of options. In practice, the best similarity score for your use case is the one that resembles better the semantic similarities between the entities represented by the vector embeddings. As a rule of thumb, the preferable choice for a similarity metric for the vector search is the one that the embedding model of choice used for training [15], but the ideal way to select an appropriate score remains unclear [11].
For more details about this topic, check reference [11].
2.3 — Filtering
Part of the pre or post processing step is to filter on metadata. Some vector databases offer only pre-filtering or post-filtering functionalities. Understanding the pros and cons of these can be helpful to evaluate the best solution for your project.
Post-filtering: the process of metadata filtering is executed following the vector search.
While this approach can be advantageous in ensuring all pertinent results are considered, it also has the potential to introduce extra overhead, thereby decelerating the query process. This is due to the need to eliminate irrelevant results once the search has concluded.
In the worst-case scenario, post-filtering could exclude all top k-nearest neighbors, resulting in zero retrievals.
Pre-filtering: metadata filtering takes place prior to the vector search.
This process can be beneficial in decreasing the search space, but it might inadvertently cause the system to miss relevant results that do not meet the metadata filter criteria. Furthermore, comprehensive metadata filtering could potentially slow the query process due to the increased computational overhead.
The worst-case scenario would be retrieving less relevant results because of pre-filtering eliminating relevant results as consequence of sub-optimal metadata.
2.4 — Database operations and Reindexing
Evaluate if the solution offers all necessary traditional database operations such as create, read, update and delete (CRUD).
In addition, another important operation to consider is the reindexing capabilities of the vector database. As new vectors are introduced in the database, there is a need to reindex or update the index for faster search. If the use case requires inserting a high volume of new vectors frequently, be mindful of reindexing time and details about the indexing of the chosen solution.
Reindexing will also be needed anytime the embedding model changes, since it changes the embeddings. With a new model, a new index needs to be built, and depending on the volume of vectors stored this operation can be very time-consuming and costly. As a reference about indexing/reindexing operation in vector databases, check this resource.
3 — Deployment, Maintenance and Security
The deployment options, maintainability and security functionalities considered for vector databases are very similar to what we evaluate for any regular database. Here we highlight some of the main points.
3.1 — Deployment options
Deployment options should be considered with the necessity and tech stack available for the project. Options vary between integrated cloud providers managed services, deployment on EC2 or EKS instances or private clouds.
Consideration regarding these options can be important for maintenance efforts.
3.2 — Latency, Availability and Scalability
The level of latency, availability and scalability required for the project can help narrow the search for an optimal solution as well as helping with choosing other requirements for the deployment such as size of memory and computational resources.
Consider checking features/functionalities like auto-scaling, replication architecture, horizontal and vertical scaling to know better the options that are more suitable for the project.
3.3 — Backups and Migration
Consider planning the needs for backup, recovery and migration for the project from the beginning. Some providers might offer interesting features for your use case, and some might have better documentation about these steps than others.
3.4 — Maintainability
One important maintenance topic for vector databases is regard reindexing efforts, either update in the index due to new data or a complete re-vectorization and indexation of the data due to change in the embedding model.
Another need to consider is the availability of logs and monitoring metrics tracking options.
Keep in mind that more mature solutions tend to have better documentation and community support in case you need help in the future.
3.5 — Role based access
As odd as it might appear, some databases do not offer role-based access at certain levels. If your project requires role-based access to restrict access of data from certain groups, check which ones have it available and which ones are planning to support it by looking at their roadmap.
4 — Cost and Pricing
This criterion is likely the trickiest one to compare. Cost and pricing vary the way they are calculated in function of service offerings and deployment options.
Although it might be more straightforward to measure cost when signing for a SaaS product, even in these cases there are nuances to be studied. For example, some vendors charge a fee plus queries per hour, others charge fees plus storage size and query volume, making it difficult to compare solutions effectively.
5 — Integration with AI Tools
Consider the AI tools you will be using for the project. Many libraries like LangChain, Llamaindex, Haystack, etc. offer integration with vector stores. Some vector databases have more integration features implemented than others, Pinecone and Weaviate for example, are well known to have good integration and documentation with LangChain and Llamaindex.
Check integration examples in references [13], [14]
Conclusion
This blog post has built up knowledge about vector databases, starting from the fundamental definition of vector embeddings and their practical applications. It has delved into the rationale behind storing contexts in the form of vectors, enabling fast and accurate retrieval processes. The post has culminated in equipping readers with a comprehensive understanding of the key considerations for evaluating and selecting the most suitable vector database solution for their specific project requirements.
By thoroughly exploring the constituent elements of vector databases, such as indexing techniques, similarity metrics, filtering options, and database operations, readers can gain invaluable insights into the functional aspects that differentiate various vector database solutions. This knowledge serves as a solid foundation for comprehending benchmarks and comparisons, such as the one provided by Superlinked (https://superlinked.com/vector-db-comparison/), and recognizing the solutions that align best with their particular use case.
Furthermore, the post has highlighted the significance of vector databases in the context of emerging technologies like Retrieval Augmented Generation (RAG) architectures for large language models. By facilitating efficient retrieval of relevant information from knowledge bases, vector databases play a crucial role in enhancing the performance of these models and mitigating issues like hallucinations.
Armed with the knowledge acquired from this post, readers are now better equipped to navigate the landscape of vector database solutions, evaluate their suitability based on factors such as service offerings, functional requirements, maintainability, scalability, cost, and integration with AI tools. This comprehensive understanding empowers readers to make informed decisions and leverage the full potential of vector databases in their projects, driving innovation and unlocking new possibilities in various domains, including knowledge base search, image and audio search, recommender systems, and multimodal applications.
References
[2] https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings
[3] https://www.pinecone.io/learn/vector-database/
[5] https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1
[6] https://aws.amazon.com/what-is/retrieval-augmented-generation/
[7] https://learn.microsoft.com/pt-br/azure/search/retrieval-augmented-generation-overview
[8] https://python.langchain.com/docs/expression_language/cookbook/retrieval
[9] https://superlinked.com/vector-db-comparison
[10] https://github.com/erikbern/ann-benchmarks
[12] https://www.pinecone.io/learn/vector-search-filtering
[13] https://python.langchain.com/docs/integrations
[14] https://docs.llamaindex.ai/en/stable/examples/vector_stores