Vector Databases: Understanding the Algorithm (part 3)

15 min readAug 3, 2023

Introduction

In the part one and part two of this series, we embarked on a journey through the fascinating world of vector databases and embeddings. We started by exploring the concept of vector databases, a revolutionary technology that transforms data into vectors in a multi-dimensional space, enabling a nuanced understanding of data and the ability to compare almost anything. We then delved deeper into the art of embeddings, a fundamental concept in deep learning that allows us to capture rich context in ones and zeros and transform raw data into a machine-readable and indexable format. These two components, vector databases and embeddings, are the building blocks of a powerful AI tool that can process and understand data in ways that were once the exclusive domain of humans. The final piece of the puzzle to understand how vector databases work is the way vector embeddings are stored and retrieved from the database.

As we move into the third part of this series, we turn our attention to the algorithms that compose our comparable embeddings into a database. The choice of algorithm used in a vector database is a critical factor that can significantly impact its performance. Different algorithms have different strengths and weaknesses, and the choice of algorithm can have a profound effect on: the latency, throughput, build time, accuracy, and re-indexing capabilities of a vector database. Understanding these algorithms is not just a matter of academic interest; it’s a practical necessity for anyone looking to get the best performance out of their vector database. Different products use different algorithms, and the performance of these products can vary widely depending on the size and shape of your data. In the following sections, we will delve into these algorithms, explore their performance trade-offs, and provide you with the knowledge you need to make an informed choice when evaluating vector database and vector index solutions.

Introducing the families of Algorithms

Now that we know the importance of choosing the right algorithm for our vector database, let’s introduce the types of algorithms we have at our disposal. Each family of algorithms: tree-based, quantization, hashing, clustering, and graph methods — comes with its own set of performance trade-offs. These trade-offs can significantly impact the latency, throughput, build time, accuracy, and re-indexing capabilities of a vector database. Note: Some argue that there are only three distinct categories; trees, hashes, and graphs. I purposefully added clustering and quantization methods as well since I think they are unique enough to merit their own explanation, but know that some folks would categorize quantization as a hashing algorithm and clustering as a graphing algorithm which is also true.

Tree-based methods are efficient for low-dimensional data and offer exact nearest neighbor search. However, their performance generally degrades in high-dimensional spaces due to the “curse of dimensionality”. They also require substantial memory and are less efficient for large datasets, leading to longer build times and higher latency.

Quantization methods are memory-efficient and offer fast search times by compressing vectors into compact codes. However, this compression can lead to a loss of information, potentially reducing search accuracy. Additionally, these methods can be computationally expensive during the training phase, increasing build time.

Hashing methods are fast and relatively memory-efficient, mapping similar vectors to the same hash bucket. They perform well with high-dimensional data and large-scale datasets, offering high throughput; However, the quality of search results can be lower due to hash collisions leading to false positives and negatives. Choosing the right number of hash functions and the number of hash tables are imperative as they significantly affect performance.

Clustering methods can expedite search operations by narrowing down the search space to a specific cluster, but the accuracy of the search results can be influenced by the quality of the clustering. Clustering is typically a batch process, which means it’s not well-suited to dynamic data where new vectors are constantly being added, leading to frequent re-indexing.

Graph methods provide a good balance between accuracy and speed. They are efficient for high-dimensional data and can provide high-quality search results. However, they can be memory-intensive, as they need to store the graph structure, and the construction of the graph can also be computationally expensive.

The choice of algorithm in a vector database depends on the specific requirements of the task, including the size and dimensionality of the dataset, the available computational resources, and the acceptable trade-off between accuracy and efficiency. It’s also worth noting that many modern vector databases use hybrid methods, combining the strengths of different methods to achieve high speed and accuracy. Understanding these algorithms and their trade-offs is crucial for anyone looking to get the best performance out of their vector database. Now let’s take a look at an example of a popular algorithm in each one of these categories.

Annoy (Approximate Nearest Neighbor Oh Yeah)

Among the tree-based algorithms used in vector databases, one stands out for its efficiency and simplicity: Annoy, or “Approximate Nearest Neighbors Oh Yeah”. Annoy is a C++ library with Python bindings that is designed by Spotify to search for points in space that are close to a given query point. It creates large read-only, file-based data structures that is memory-mapped into memory, which allows multiple processes to share the same data. I admit I am kind of cheating including this algorithm as it is a vector index rather than an algorithm in a vector database. THat being said, I found it to be the most compelling tree based algorithm that also scales very well for a tree based algorithm.

Annoy operates by using a forest of random projection trees to perform efficient approximate nearest neighbor search. The algorithm projects points onto random hyperplanes and partitions the space according to which side of the hyperplane the points fall on. This process is repeated recursively, resulting in a binary tree of partitions. A forest of trees is created each with a different random seed. When a query point is received, the algorithm traverses down each tree in the forest to find the leaf node where the point would belong. The nearest neighbors are then approximated by collecting a list of points in the leaf nodes found in all trees, and returning the top-k points from this list that are closest to the query point.

Annoy is particularly well-suited for high-dimensional data, where exact nearest neighbor search can be prohibitively expensive. It is used in production at Spotify for music recommendations, where it helps find similar songs based on their audio features. It’s worth noting that Annoy provides approximate nearest neighbors, which means it may not always return the exact nearest neighbors, but it does offer great accuracy for an approximation method. That being said, accuracy can be influenced by the number of trees in the forest and the number of points inspected during the search, both of which can be tuned and adjusted based on the specific requirements of the task.

The efficiency and memory-efficiency of Annoy make it a strong choice for handling high-dimensional data and large databases. Despite being an approximate method, Annoy often provides high-quality results that are close to the true nearest neighbors. There are a few trade-offs to consider. Building the index can take a significant amount of time, especially for large datasets. Because Annoy uses a random forest partitioning algorithm, the indices cannot be updated with new data and must be rebuilt from scratch. Depending on the size of your data set or how often your data changes, retraining the index might be prohibitively expensive.

Vizualization of the Random Forest Structure

Product Quantization

Quantization is a process used to transform large, complex datasets into simpler, more manageable forms. Product Quantization (PQ) is a method introduced by Hervé Jégou, Matthijs Douze, and Cordelia Schmid in 2011. In the context of vector databases, Product Quantization (PQ) is used to efficiently store and search high-dimensional vectors. While technically PQ is a technique for creating an index rather than a full-fledged vector database algorithm, its importance and effectiveness in handling high-dimensional data make it a crucial player in the field and a part of many hybrid algorithms. It’s particularly notable for its ability to provide a balance between memory usage and search quality, making it particularly well-suited for large-scale datasets where memory usage is a concern.

Product Quantization operates by dividing the high-dimensional space into low-dimensional spaces using Cartesian products, and then quantizing each of these subspaces independently. The original high-dimensional vector space is divided into distinct subspaces. For each of these subspaces, a separate codebook is learned. Each codebook contains centroids that are representative of the data distribution in that subspace. High-dimensional vectors are then represented as a concatenation of codes where each code corresponds to the nearest centroid in the respective subspace. This results in a compact and efficient representation of the high-dimensional vectors. PQ is known as an inverted index because it maps high-dimensional vectors to a compact code, and these codes can be used as keys in an index to quickly retrieve the vectors during a search operation.

Product Quantization is particularly well-suited for large-scale datasets in high-dimensional spaces where memory usage and speed are most important. On the positive side, it provides a very compact representation of high-dimensional vectors, reducing memory usage by up to 97–98%. It allows for fast nearest neighbor search, especially when combined with an inverted file system which further optimizes storage and lookup accuracy. On the downside, PQ, like other methods for approximate nearest neighbor search, is known to have lower recall and accuracy. Learning the codebooks can be computationally expensive, especially for large datasets, which can increase preprocessing time and requires re-indexing with sufficient data drift. There is a tradeoff between the quality of the results and the compactness of the representation — using more centroids will generally give better results but will also require more memory. Additionally, the division of the original space into subspaces is fixed, which may not always be optimal for a data distribution.

Product Quantization is a powerful method for approximate nearest neighbor search in high-dimensional spaces, offering a tradeoff biased towards memory efficiency and speed with marginally lower search quality. Although it may not be the best choice for all scenarios, the specific requirements of the task should be considered when choosing this method. It is often used as one component of a more complex hybrid algorithm in vector databases, for instance PQ with Inverted file indexes is an optimization used in HNSW.

Local Sensitivity Hashing

In the realm of hashing algorithms utilized in vector databases, one shines for its efficiency and simplicity: Locality-Sensitive Hashing, or LSH. LSH is a method designed to handle high-dimensional data by hashing input items in such a way that similar items map to the same “buckets” with a high probability, while dissimilar items map to different buckets with a high probability. It’s a popular technique in machine learning, data mining, and information retrieval, especially when dealing with large-scale and high-dimensional data. It is important and effective in handling high-dimensional data, making it a crucial player in the field. It’s particularly notable for its scalability and ability to provide approximate nearest neighbor search results efficiently in high-dimensional spaces.

In the process of transforming high-dimensional data for efficient similarity search, several steps are involved. First text is encoded into sparse matrices using a combination of shingling and one-hot encoding. Then the MinHashing algorithm is used to transform the sparse matrices into dense matrix embeddings called signatures. Finally these signatures are partitioned into sub-vectors which are each passed through hash functions into buckets. If multiple partitions of different signatures are hashed into the same buckets they are considered candidate pairs. A candidate pair simply refers to a pair of vectors that can be considered as potential nearest neighbors.

As far as algorithmic trade offs are concerned, LSH is designed to handle large-scale and high-dimensional data, making it suitable for large datasets. It can provide a significant speedup over traditional methods for nearest neighbor search, especially in high-dimensional spaces. LSH provides approximate results, which means there is a marginal tradeoff of lower accuracy. The performance of LSH depends heavily on the choice of parameters, such as the number of hash functions and the size and number of buckets. Choosing these parameters can be challenging and may require domain knowledge and experimentation. Additionally, LSH can require a large amount of memory to store the hash tables, especially for large datasets and high-dimensional data.

In terms of latency, LSH can provide fast query times due to its hashing mechanism, though throughput can be degraded by the number of hash tables and the size of the buckets. The build time is typically fast, but it can increase with the number of hash functions used. The accuracy of LSH is often lower than exact methods, but it can be improved by using more hash tables or adjusting the bucket size. Lastly, re-indexing in LSH can be done by adding new items to the existing hash tables, but if the data distribution changes significantly, the hash functions may need to be redefined, which requires rebuilding the entire index.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a popular clustering algorithm used in machine learning and data mining. Unlike many other clustering algorithms, DBSCAN does not require the user to specify the number of clusters in advance and can discover clusters of arbitrary shape. It’s particularly useful in situations where there is a lot of noise or outliers in the data, and where the clusters are density-defined, meaning there are dense regions of data points separated by regions of lower density. It works by defining a cluster as the maximal set of density-connected points.

DBSCAN operates on the idea of density reachability and density connectivity. It starts with an arbitrary point in the dataset, and if there are at least ‘minPts’ within a given radius ‘eps’ from that point, a new cluster is created. Eps stands for epsilon which is a user defined input denoting the maximum distance between two points to be considered in a cluster, while minPts refer to the minimum number of data points required to form a cluster. The algorithm then iteratively adds all directly reachable points within the ‘eps’ radius to the cluster. This process continues until no further points can be added to the cluster. Then, the algorithm proceeds to the next unvisited point in the dataset and repeats the process until all points have been visited. The key parameters in DBSCAN are ‘eps’ and ‘minPts’, which define the cluster of points and the minimum density of points required to form a cluster respectively. DBSCAN does have its trade-offs. In terms of latency and throughput, DBSCAN can be relatively efficient for smaller datasets, but its performance can degrade for very large datasets as it needs to compute the distance between all pairs of points. The build time can be quite high for large datasets and high-dimensional data. In terms of accuracy, DBSCAN can accurately discover clusters of arbitrary shape, but its performance is sensitive to the setting of ‘eps’ and ‘minPts’. If these parameters are not set properly, DBSCAN might fail to identify clusters correctly. Lastly, re-indexing or updating the model with new data can be challenging with DBSCAN. If new data points are added, the entire clustering process needs to be rerun, which may not be practical for dynamic datasets that are updated frequently.

HNSW (Hierarchical Navigable Small World)

Hierarchical Navigable Small World (HNSW) is a graph-based algorithm used for efficient nearest neighbor search in high-dimensional spaces, and is among the top-performing indexes for vector similarity search. It’s a part of a broader family of Navigable Small World (NSW) graphs, which are designed to provide a balance between search efficiency and accuracy. HNSW is particularly useful in situations where high accuracy is required, speed is preferred, and some computational expense can be afforded. HNSW works by constructing a layered graph structure, where each layer is a subset of the previous one, and search starts from the top layer and navigates down to the bottom layer.

HNSW constructs a hierarchical graph where each layer of the graph corresponds to a subset of the data points, with the topmost layer containing the fewest points. The algorithm starts by randomly assigning each point to a layer. Then, point by point, it connects it to its nearest neighbors in the same and lower layers. This results in a graph where each node is connected to other nodes that are close in the high-dimensional space. When a query comes in, the algorithm starts from a node in the topmost layer and navigates down the layers, always moving towards nodes that decrease the distance to the query point. This process continues until the algorithm reaches the bottom layer, where it performs an exhaustive search among the remaining nodes.

In terms of latency, HNSW can provide fast query times due to its hierarchical structure, which allows it to quickly narrow down the search space. Throughput can be affected by the number of layers and the number of connections per node. The build time can be quite high, especially for large datasets, as the algorithm needs to construct the graph and determine the connections between nodes. It is still much more efficient than clustering and tree based algorithms. In terms of accuracy, HNSW can provide high-quality results, but its performance can be sensitive to the choice of parameters, such as the number of layers and the number of connections per node. In General HNSW strikes a great balance between perfect accuracy and low enough latency to make it thrive in production. Lastly, re-indexing or updating the model with new data can be challenging with HNSW. With significant data drift, the entire graph may need to be reconstructed, which may not be practical for extremely dynamic datasets that are updated frequently.

Part 3 Recap

In this article we navigated the various algorithms that power vector databases. We emphasized the importance of choosing the right algorithm for a vector database, as it can significantly impact its performance, including latency, throughput, build time, accuracy, and re-indexing requirements. We introduced the families of algorithms available for vector databases, namely tree-based, quantization, hashing, clustering, and graph methods. Each family of algorithms comes with its own set of performance trade-offs, which can significantly impact the performance of a vector database. Then we highlighted that many modern vector databases use hybrid methods, combining the strengths of different methods to achieve high speed and accuracy. Finally we took an in-depth look at several specific algorithms including: Annoy, Product Quantization, Local Sensitivity Hashing, DBSCAN, and HNSW, comparing how their tradeoffs will affect their use on production datasets. While each algorithm has its strengths and weaknesses, the choice of algorithm should depend on the specific requirements of the task, including the size and dimensionality of the dataset, the available computational resources and budget, and the acceptable trade-off between accuracy and efficiency.

Conclusion

Well we’ve reached the end of our series on vector databases, and if you are still reading I want to profusely thank you for the attention you’ve given this series. As a thank you I’d like to boil down this topic into the most concise explanation I can offer. A vector database is comprised of 3 parts:

A procedure for encoding text (or other data for that matter) into vector embeddings that our computers can compare and “reason about”.
A mechanism for comparing these vector embeddings between one another (I.E. similarity measures).
An algorithm that allows us to use these comparisons to efficiently search, store, and retrieve our vector embeddings for use in production systems.

In this three-part series on vector databases, we’ve embarked on a comprehensive journey through the fascinating world of vector databases and embeddings. We began with the foundational concept of what a vector database is, what is a vector embedding, and how do we use similarity search to compare them. Then dove deep into the art of creating embeddings by tokenizing words and using transformer models to translate the tokens into rich multi-dimensional embeddings. In the final part, we turned our attention to the intricate algorithms that enable us to search and store our embeddings. We’ve aimed to demystify these procedures and provide practical insights for those looking to leverage vector databases in their work. From the basics of vector transformation to the nuanced trade-offs of various embeddings and algorithms, we’ve strived to offer a valuable guide for both newcomers and seasoned professionals in the field.

I wrote this series to be a practical explanation of “the How” for vector databases. I’m not done writing on this topic, nor this year’s advances in AI. So If you are interested in the Why, other topics within AI, or the practical applications therein: PLEASE drop me a comment and let me know what you’re interested in.

Keep learning! I’ll see you soon!