Introduction to AI-Native Vector Databases

Tony Siciliani
6 min readJan 19, 2024

--

Unlike traditional relational databases that organize data in rows and columns, vector databases represent information as ‘vectors’, a format particularly suited for capturing nuanced relationships between data points. This novel approach not only facilitates efficient similarity searches but also significantly enhances the capabilities of modern AI tasks, such as Retrieval Augmented Generation (RAG), recommendation systems, and video or image recognition.

Vectors and Embeddings

Embeddings are numerical representations of words, phrases, or other objects that capture their meaning and relationships. In large language models (LLMs), words or sentences with similar meanings, will have similar embeddings. Likewise, related images (e.g. pictures of cars) will have analogous embeddings. Embeddings are typically represented as vectors, which are multi-dimensional arrays of numbers used as coordinates in a high-dimensional space. Vectors can represent embeddings anywhere from text, to complex data like molecular structures.

A vector database (VDB) is a comprehensive and feature-rich AI-native data management system, specifically designed for handling large collections of vectors, while fully supporting CRUD (Create-Read-Update-Delete). VDBs allow users to do search by nearest neighbors rather than by substrings like traditional databases.

Similar records will be ‘closer’ (i.e. neighbors) in high-dimensional spaces, therefore easier to find. A variety of techniques can be used to measure ‘proximity’, such as the classic Euclidian distance (L2 ) , Manhattan distance (L1 ), Dot Product, or cosine similarity.

A similarity search will go through the model to generate query vectors which would then be compared to vectors in the VDB. The results will go through the model, which will convert cryptic mathematical objects to something more human-like, such as…words.

Once we store potentially millions or even billions of vectors, the next question would be, how do we implement efficient search mechanisms to make all this technically viable?

Vector search algorithms

The K-Nearest Neighbors (k-NN) proximity search algorithm identifies the k most similar data points to a given query point. However, k-NN turns out to be computationally expensive, as it calculates the similarity between a query vector and every entry in a VDB, which could theoretically store billions of embeddings.

A solution to this problem would be to pre-organize vectors by similarity, so that the search will start by only looking at the closest elements and not waste time calculating the distances for all data points. This is basically what Approximate Nearest Neighbor (ANN) does, with the trade-off of sacrificing some accuracy for significant gains in speed.

Efficient and scalable similarity search in high-dimensional spaces, involving massive amount of data, would require a few key components, such as:

  1. A framework for quick neighbor searches such as ANN.
  2. Enhanced search efficiency using some type of vector indexing.
  3. Vector compression to help reduce resource consumption in large-scale environments.

Vector compression and vector indexing are complementary techniques that are essential in scenarios requiring both rapid processing and efficient scaling. Let’s look at them in more details.

Vector indexing

Vector indexing is a method for organizing vectors in a way that allows for fast search and retrieval, which is especially important when dealing with large-scale, high-dimensional data. There are various options for indexing. Most VDBs support HNSW (Hierarchical Navigable Small World), which works on multi-layered graphs, and is known for its high search efficiency and speed. A series of layers acting as filters (for data skipping) are used to build a list of nearest neighbors quickly and efficiently.

The HNSW index type will scale well to large datasets. For smaller datasets, VDBs usually provide a more adapted, lightweight flat index type.

Vector compression

Vector compression is a method for reducing the amount of disk space required to store vectors. This is important for large-scale applications, where vectors can become very large. One such technique is PQ (Product Quantization), which breaks up high-dimensional vector into a set of smaller vectors in a lower-dimensional space, allowing for efficient storage and fast ANN searches in large datasets.

Clustering the sub-vectors using a method like k-means, results in a set of centroids for each sub-vector set. Each sub-vector is then replaced by the index of the nearest centroid, effectively compressing the data. When a vector needs to be reconstructed during a search operation, each indexed centroid is used to approximate the original sub-vectors. These approximated sub-vectors are then concatenated to form an approximate version of the original high-dimensional vector.

Now that we briefly covered the important concepts, let’s switch to a more practical setting.

Example VDBs

Some common VDBs are Chroma, Milvus, Qdrant, and Weaviate, which are all open-source, with available paid subscription options. All of these VDBs support HNSW indexing.

For a comparison in terms of raw performance, largest community presence, etc., see Vectorview to get an initial idea. In the rapidly shifting landscape of AI, both comparative analysis and any current benchmark would be transient at best.

As an example, here’s a practical application of Chroma v0.4.22 (most recent version as of this writing) within a Databricks Python notebook. Basically, we upload a PDF to a Databricks volume and segment it to adhere to LLM input size limits. Chroma automatically converts text into embeddings using the all-MiniLM-L6-v2 model by default, although it also offers lightweight wrappers around popular embedding providers like OpenAI or Hugging Face.

The demonstration code below uses the default embedding, HNSW indexing, Cosine similarity, and the langchain library. The VDB is tested at the end by returning a couple of relevant documents.

pip install chromadb langchain pypdf

To verify that the embeddings were created:

Closing thoughts

Although we described the use of VDBs with embeddings and vectors, keep in mind that VDBs are not limited to storing vectors. While primarily designed to handle and optimize vector data for similarity searches, VDBs can also collect a variety of data types, like metadata, documents, multi-media files, etc.. The reason for that versatility is that VDBs need to cater to complex applications that may require access to multiple types of data.

Nevertheless, VDBs form the backbone of semantic search, although it is possible — to a certain extent — to do those using alternatives, such as vector libraries. Pros and cons of using a library vs. a database is beyond the scope of this article. Here’s a Weaviate blog that explores some of the trade-offs.

VDBs are combining AI’s advanced analytical capabilities with efficient data storage and retrieval methods, addressing complex data analysis problems beyond the reach of traditional relational databases or text search engines. As they continue to evolve and mature, we can expect to see more innovative applications harnessing their power.

--

--