Vector Databases for Generative AI and beyond
Ioannis Papapanagiotou (Steel Perlot)
Large language models, Generative AI, and semantic search rely on vector embeddings, a type of vector data representation that carries within it semantic information that’s critical for the AI to gain understanding and maintain a long-term memory they can draw upon executing complex tasks. We recently posted a blog post on the the generality of transformers. In this post, we will cover parts of the infrastructure that supports semantic search.
Embeddings capture the “relatedness” of text, images, video, or other types of information. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. Embeddings are generated by AI models (such as Large Language Models) and have many attributes or features, making their representation challenging to manage. In the context of AI and machine learning, these features represent different dimensions of the data that are essential for understanding patterns, relationships, and underlying structures.
Vector databases are specialized, not universal, and are not suited for constructing the back-end of your application. Their primary role is to swiftly pinpoint numerous documents with comparable text through a method referred to as “clustering.” Consequently, most applications will typically employ a mix of a SQL/NoSQL database and a Vector database. The latter is specifically used to facilitate document searches and to supply data to AI models such as OpenAI’s ChatGPT and other Large Language Models (LLMs). This approach mirrors the use of other databases like ElasticSearch, which are tailored to deliver lexical search capabilities for particular scenarios.
For effective similarity searches, vector databases utilize distinct indexing frameworks and methodologies, including tree-based structures like k-d trees, graph-based structures like k-nearest neighbor graphs, or hashing strategies such as locality-sensitive hashing. These indexing techniques efficiently categorize and divide the vectors to allow quick access to similar ones. Essentially, a vector index is a data structure in computer science and information retrieval designed to store and retrieve high-dimensional vector data efficiently, thus enabling rapid similarity searches and queries for the nearest neighbors.
In a vector database, the vectors are typically stored along with their associated metadata, such as labels, identifiers, or any other relevant information. The database is optimized for efficient storage, retrieval, and querying of vectors based on their similarity or distance to other vectors.
Areas of applications
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)
Vector Indexes
Flat indexing
Flat indexes just encode the vectors into codes of a fixed size and store them in an array of ntotal * code_size
bytes.At search time, all the indexed vectors are decoded sequentially and compared to the query vectors. This indexing approach is straightforward, simple to execute, and yields exact accuracy. However, its major drawback is that it is slow. In a flat index, the system calculates the similarity between the query vector and every other vector in the index. It then retrieves the K vectors with the lowest similarity score. Flat indexing is the preferred method when utmost accuracy is paramount, and speed is less of a concern. Additionally, if the dataset being searched is relatively small, flat indexing might also be a viable option as the search speed can remain fairly acceptable.
Locality Sensitive Hashing (LSH) indexes
Locality Sensitive Hashing is an indexing method that prioritizes speed and the identification of an approximate nearest neighbor, rather than conducting an exhaustive search for the actual nearest neighbor as flat indexing does. This strategy constructs the index using a hashing function, where vector embeddings close to one another are directed to the same bucket. Consequently, all similar vectors can be grouped within a single table or bucket. Upon receiving a query vector, its nearest neighbors are identified by hashing the query vector and then calculating the similarity metric for all vectors in the table that have been hashed to the same value. This approach significantly narrows down the search scope compared to flat indexing, where the similarity metric is evaluated across the entire space, thus substantially enhancing the speed of the query.
Inverted file (IVF) indexes
Inverted file (IVF) indexes are similar to LSH in that the goal is to first map the query vector to a smaller subset of the vector space and then only search that smaller space for approximate nearest neighbors. This will greatly reduce the number of vectors we need to compare the query vector to, and thus speed up our ANN search.
In LSH that subset of vectors was produced by a hashing function. In IVF, the vector space is first partitioned or clustered, and then centroids of each cluster are found. For a given query vector, we then find the closest centroid. Then for that centroid, we search all the vectors in the associated cluster.
Note that there is a potential problem when the query vector is near the edge of multiple clusters. In this case, the nearest vectors may be in the neighboring cluster. In these cases, we generally need to search multiple clusters.
HNSW
Hierarchical Navigable Small World (HNSW) is a highly performant indexing algorithm that nets users great recall at high throughput. It’s used extensively throughout the vector-search world as the core of many notable open-source indexing libraries, such as NMSLIB and Faiss.
Like many graph-based algorithms, HNSW is fast. It gets its speed from its layered, hierarchical, graph structure, which allows it to quickly hone in on target neighborhoods (i.e. the potential matches for a user’s query).
HNSW generally gets accurate results (or “recall”). Its nodes and edges (i.e. the parts that make up the graph) connect semantically similar content, making it easy to find the most relevant vectors to a user’s query once within a target neighborhood. Since each layer of the graph in an HNSW index is a subsample of the layer above, every time the algorithm travels down a layer, the search space (the universe of possible ‘right’ answers) gets reduced. This translates to speed and accuracy.
In summary:
- A Flat index is one that stores vectors in their unmodified form, and is used for exact kNN search. It is the most accurate, but also the slowest.
- IVF-Flat indexes use inverted file indexes to rapidly narrow down on the search space, which are much faster than brute force search, but they sacrifice some accuracy in the form of recall
- IVF-PQ uses IVF in combination with Product Quantization to compress the vectors, reducing the memory footprint and speeding up search, while being better in recall than a pure PQ index
- HNSW is by far the most popular index, and is often combined with Product Quantization, in the form of HNSW-PQ, to improve search speed and memory efficiency compared to IVF-PQ
- Vamana is a relatively new index, designed and optimized for on-disk performance — it offers the promise of storing larger-than-memory vector data while performing as well, and as fast, as HNSW. Vamana is among the most recently developed graph-based indexing algorithms, first presented at NeurIPS 2019 by Subramanya et al. in collaboration with Microsoft Research India.
Vector Indexes or Vector Databases
Standalone vector indices like FAISS (Facebook AI Similarity Search) can significantly improve the search and retrieval of vector embeddings, but they lack capabilities that exist in any database. Vector databases, on the other hand, are purpose-built to manage vector embeddings, providing several advantages over using standalone vector indices:
- Data management: Vector databases offer well-known and easy-to-use features for data storage, like inserting, deleting, and updating data. This makes managing and maintaining vector data easier than using a standalone vector index like FAISS, which requires additional work to integrate with a storage solution.
- Metadata storage and filtering: Vector databases can store metadata associated with each vector entry. Users can then query the database using additional metadata filters for finer-grained queries.
- Scalability: Vector databases are designed to scale with growing data volumes and user demands, providing better support for distributed and parallel processing. Standalone vector indices may require custom solutions to achieve similar levels of scalability (such as deploying and managing them on Kubernetes clusters or other similar systems).
- Real-time updates: Vector databases often support real-time data updates, allowing for dynamic changes to the data, whereas standalone vector indexes may require a full re-indexing process to incorporate new data, which can be time-consuming and computationally expensive.
- Backups and collections: Vector databases handle the routine operation of backing up all the data stored in the database. Pinecone also allows users to selectively choose specific indexes that can be backed up in the form of “collections,” which store the data in that index for later use.
- Ecosystem integration: Vector databases can more easily integrate with other components of a data processing ecosystem, such as ETL pipelines (like Spark), analytics tools (like Tableau and Segment), and visualization platforms (like Grafana) — streamlining the data management workflow. It also enables easy integration with other AI related tools like LangChain, LlamaIndex and ChatGPT’s Plugins.
- Data security and access control: Vector databases typically offer built-in data security features and access control mechanisms to protect sensitive information, which may not be available in standalone vector index solutions.
Main Features of Vector Databases
- Perform complex mathematical operations to filter and locate “nearby” vectors using clustering techniques like “Cosine Similarity”
- Provide specialized Vector indexes to make retrieval of data significantly faster and deterministic
- Store vectors in a way that makes them more compact, like by compressing and quantizing them, to keep as much data in memory as possible (further reducing load + query latency)
- Sharding data across multiple machines to handle large amounts of data (which many databases can do, but SQL databases in particular take more effort to scale out)
HNSW Issues
Memory issues
HNSW is primarily tailored for datasets that are static, yet the nature of production applications is often dynamic. As the underlying data evolves, the vector embeddings that symbolize your data must also adapt, along with the vector index itself. With regular updates to an integrated HNSW index, you’ll start to notice an increase in memory usage. To address the escalating memory demands that typical CRUD operations (such as updates and deletes) impose on HNSW indexes, one might need to implement one or several strategies.
- Tolerate the higher memory cost, which results in higher query latencies, timeouts, and a higher cost.
- Frequently adjust the HNSW index parameters to achieve and maintain production-level performance, despite less and less available memory (if you have access and expertise to do so).
- Periodically rebuild the index, and either tolerate downtime during the rebuild process or manage a blue-green deployment each time.
It turns out the apparent convenience and initial performance of a bolted-on index comes with a significant cost and operational burden. See some interesting data.
Soft Deletes
In vector databases, especially those utilizing HNSW or similar layered graph algorithms, one must rebuild the entire index from scratch. Given the memory-intensive nature of vector operations, it’s crucial to promptly reclaim the memory occupied by deleted data to preserve performance. This is where the complexity arises. A characteristic of layered graph indexes like those created by HNSW is their sparsity, meaning some layers contain very few nodes (vectors). Therefore, if you remove a layer with only one node, all subsequent layers connected to that node become detached from the graph, essentially leaving them in a state of limbo. These disconnections can cause the index to fail.
To fix the graph (repair the connections), you must rebuild the index entirely from scratch. Even if you want to add or update vectors, you must first delete an old vector. Every change requires a delete.To avoid doing this for every single change, vector indexes mark to-be-deleted data with something called a tombstone. This process is a type of soft delete. At retrieval time, this tombstone acts as a flag to the system, indicating that the piece of data is not eligible to be returned to the user. Eventually, though, this to-be-deleted data has to be permanently removed from the graph, breaking the graph.
When dealing with a sparse graph index like HNSW in vector databases, completely removing data, commonly known as a “hard delete”, necessitates rebuilding the entire index. This process has a significant drawback: while reindexing is ongoing, the database becomes inaccessible to users. A common workaround is to employ a blue-green deployment strategy, where a static version of the index is used to handle user queries during the reindexing period. However, this approach has its own set of challenges. Primarily, it means users are temporarily served outdated or “stale” data. Additionally, duplicating the entire index for this purpose can require substantial storage resources. This becomes a critical concern for businesses that depend on providing up-to-date, near real-time data to their users, as any delay or inaccessibility can impact the user experience and the business’s efficacy.
Optimizations
HNSW is not the best index in terms of memory utilization. However, if this is important and using another index isn’t an option, we can improve it by compressing our vectors using product quantization (PQ). Using PQ will reduce recall and increase search time — but as always, much of ANN is a case of balancing these three factors.
Leaders
Pgvector (OSS)
Pgvector is an open source extension for PostgreSQL databases making vector similar search. It supports exact and approximate nearest neighbor search and any language with a Postgres client. By using pgvector, one can turn the regular PostgreSQL instance into a database capable of processing vectors. Hence they get ACID compliance, point-in-time recovery, JOINs, and all of the other great features of Postgres.
Pgvector implements multiple algorithms, supports indexes, allows for tuning the performance of the operations, and does all of that directly in PostgreSQL. This gives the ability to keep both the business data and the embeddings together, work on them in one environment, and monitor the performance accordingly. Pgvector also supports indexing, so we can improve the performance of the operations easily. There are two types of indexes: Inverted File (IVFFlat) and Hierarchical Navigable Small Worlds (HNSW).
The biggest advantage of using pgvector over a dedicated vector database is the ability to keep all the data in the same database, and using the same workflows that we currently use in production. This changes the landscape of vector databases and open source significantly, because they turn from “new approach” to “commodity”. Especially that PostgreSQL is hosted by multiple providers, both in cloud and on-premise. Since pgvector is open source, we can read its internals, tune them to our needs, and deploy even in most limited contexts.
Pgvector faces the HSNW challenges we mentioned above. Some of the performance issues are showcased in this benchmark.
Pinecone (Closed Source)
The Pinecone Graph Algorithm, or PGA, is similar to HNSW in that it also builds a graph structure. But that is where its similarities end. PGA is based off of Microsoft’s Vamana (commonly referred to as FreshDiskANN) algorithm. Unlike HNSW, which builds a hierarchical graph structure (sparse graph), FreshDiskANN builds a flat-graph structure (dense graph). FreshDiskANN uses a small, fast storage system (memory buffer) for new data points and a large, less-fast storage system (disk) for older data points. Once the memory buffer hits capacity, it merges its items into the index on disk, maintaining data freshness. When searching for retrieval candidates, the algorithm searches over both storage units, combining the results. It naturally performs periodic maintenance operations, such as compaction and splitting, which optimize the index. Its flat structure is also ideal for caching and parallel processing, which increase speed and recall.
- A big driver for Pinecone is the ability to provide read and after write guarantees (inserts and you need to update the indexes)
- The massive datasets that drive LLM applications are fairly static and many do not need read after write guarantees.
- The market is highly segmented on the importance of speed/latency of reindexing. If you are doing a recommendation system, you might only recompute all your embeddings once a day, and in many cases this is totally fine. For example, Amazon books does something like this — once a day updates.
- However, there are other use cases where it matters much more to incrementally reindex quickly. Gong has a set of LLM features for analyzing sales calls. They do vector search across the company’s recent calls as part of this. Many sales leaders use gong to analyze a sales call almost immediately after it happens. They therefore care much more that, once a new call happens, it rapidly is reflected in the vector index.
- Similarly — many tools in code generation with LLMs have to be able to immediately retrieve recently written code in the context of their generations. If your vector index is one day stale that would be quite bad.
- On a small scale, a lot of LLMs are doing RAG or Pgvector or some other OSS. Most of the big companies tend to go with Pinecone, Radiant etc
- Some companies are combining something like Pgvector and ElasticSearch or Pinecone and ElasticSearch, sometimes even building their own ranker on top.
- Pinecone has captured the market of Recommended systems and we are yet to see the success on LLMs.
- Rapid re indexing is one of the primary drivers of whether someone may lean towards a high-end solution like Pinecone vs a “naive” solution such as running dockerized or a self-hosted Pgvector or Redis.
Alternative Solutions:
- Weaviate (Open Source)
- Milvus (Open Source)
- FAISS (Open Source)
- Chroma (Open Source)
- Qdrant (Open Source) → great computer science technology.
Vector Search in current databases
Many databases are now rushing to support vector search by integrating a vector index into their core offering. A common strategy they use is taking a well known vector-indexing algorithm, bolting it onto an existing (not vector-native) architecture.
- MongoDB Atlas Vector Search
- ElasticSearch
- Lucene currently uses the hierarchical navigable small world (HNSW) algorithm to index vectors. At a high level, HNSW organizes vectors into a graph where similar vectors are likely to be connected.
- Because data is stored on disk, Elasticsearch will allow data sets that are larger than the total amount of RAM that is available on the local host, and performance will degrade as the ratio of the HNSW data that can fit in the page cache decreases.
Interesting new solutions
- Lance: a very interesting project, unique in design (they differentiate on the data storage with their lance format and they have a serverless architecture. Typically, vector databases store the index in memory, requiring machines with a lot of RAM. There are ways to reduce the memory usage, e.g. with quantization, but more recently disk-first databases (LanceDB) and algorithms (DiskANN) are becoming more popular as they scale better and are cheaper
- TurboPuffer: builds serverless retrieval on Object Storage, vector search, lexical search, no configuration, no sizing, just retrieval. The team is very interesting and technical. Separating storage and compute allows it to scale in response to load, without complex data migration or replication techniques. turbopuffer commits ingested data directly to highly reliable, low-cost object storage. TurboPuffer is built built on top of similar options like RocksDB/LSMs, use Nginx for Load Balancer, etc. Those external components to the database system provides several failure scenarios but the LLM space may not even need higher 9s of availability. Project is still evolving with the data currently hosted in Google Cloud, currently in `us-central` Iowa, USA (expanding to more regions next year).
Lexical and Semantic Search
Vector databases have grown in popularity especially in the Generative AI space. However, in many cases, traditional keyword search can yield more relevant results and increased user satisfaction. This is because ranking based on metrics like cosine similar causes results that have a higher similarity score to appear above partial matches that may contain input keywords. This reduces the quality of the search results to the end user. On the other hand, pure keyword search has its own limitations. Most well-known search databases (Elastic Search, Open Search etc) use BM25 as an indexing algorithm. BM25 was introduced by Robertson and Walker in 1994 as an improvement over the previous Okapi BM11 algorithm.
It produces a sparse vector, by considering keyword term frequencies in relation to their inverse document frequency (IDF). In contrast, vector databases typically encode and store text in embeddings represented by dense vectors, and this is typically done via a bi-encoder model like BERT, that produces a sentence embedding for a pair of documents, that can then be compared to produce a cosine similarity score. The user enters a term that is semantically similar to the stored data but a lexical search engine cannot find the correlation. A very good analysis is provided here.
Hence, there is a trade-off, and in most use cases we need to combine lexical and semantic search.To effectively perform hybrid search, it becomes necessary to combine search result candidates obtained via BM25 (keyword search) and cosine similarity (vector search), which requires a cross-encoder. That allows two sentences to be simultaneously passed to an encoder model like BERT. Unlike the bi-encoder that’s used to produce sentence embeddings, a cross-encoder doesn’t produce embeddings, but rather, it allows us to classify a pair of sentences by assigning them a score between 0 and 1, via a softmax layer. This is termed re-ranking, and is a very powerful approach to obtaining results that combine the best of both worlds (keyword + vector search).
Market opportunity
- Typical vector databases store the index in memory, requiring machines with a lot of RAM. There are ways to reduce the memory usage e.g. with quantization. Most recently disk-first databases (LanceDB) and algorithms (DiskANN) are becoming more popular as they scale better and they are cheaper.
- Provide the ability for lexical search and semantic search. You need different algorithms because if you filter naively you may disconnect the graph. Traversing the graph takes some delay so it is hard to do lexical search on the graph.
- People may prefer unified solutions at scale. If you’re already using an existing data store like Elasticsearch, Redis or PostgreSQL, it’s pretty straightforward to utilize their vector indexing and search offerings without having to resort to a new technology. Many users would prefer this route because it does not require to maintain or pay for additional solutions.
- Purpose-built and specialized vector databases may slowly out-compete established databases in areas that require semantic search. Mainly because they are innovating on the most critical component when it comes to vector search — the storage layer. Indexing methods like HNSW and ANN algorithms are well-documented in the literature and most database vendors can roll out their own implementations. Purpose-built vector databases have the benefit of being optimized to the task at hand (not to mention that they’re written in modern programming languages like Go, Rust, etc.), and for reasons of scalability and performance, will most likely win out in this space in the long run.
- Serverless retrieval databases that address all modalities. This can be enabled by separated compute (vector clustering) and storage (low-latency, always-fresh vector search over a big number of records at a low cost.). You are not managing reranking and many databases.
Summary
As we navigate the intricate world of vector databases together, your insights and experiences are invaluable. If you have thoughts, questions, or unique perspectives on vector databases, or if there are specific aspects you’d like to discuss further, I invite you to reach out. Let’s continue this conversation and explore these technological frontiers together. Please feel free to leave your comments below or contact me directly. Your input not only enriches our collective understanding but also sparks innovative ideas and solutions.