The Power of Vector Databases: Organizing Information at Lightning Speed

Shashank Vats
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
7 min readMay 7, 2023
visualization of high-dimensional vector

In recent weeks, there’s been a surge in investors’ interest in vector databases. In the last few months (as of May’23), the vector database startup Weaviate landed a $50M in Series B funding, Pinecone raised a massive $100M in Series B funding at a $750M valuation, Chroma, an open-source project, raised $18M for its embeddings database. But you might be wondering, what are vector databases?

Vector Databases

A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes. These vectors are usually generated by applying some kind of embedding function to the raw data, such as text, images, audio, video, and others.

Vector databases can be defined as a tool that indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like metadata filtering and horizontal scaling.

Vector databases rely heavily on vector embeddings, a type of data representation that carries within it the semantic information that’s critical for the AI to gain understanding and maintain a long-term memory they can draw upon when executing complex tasks. So, before we proceed further with vector databases, let's understand what vector embeddings are!

Vector Embeddings

Vector embeddings are like a map, but instead of showing us where things are in the world, they show us where things are in something called vector space. Vector space is kind of like a big playground where each thing has its own spot to play at. Imagine that we have a bunch of animals — a cat, a dog, a bird, and a fish. We can create a vector embedding for each picture by giving it a special location in the playground. The cat might be in one corner, the dog on the other side. The bird might be in the sky and the fish might be in the pond. This place is a high-dimensional space. Each dimension corresponds to different aspects of them, eg, the fishes have fins, birds have wings, cats and dogs have legs. Another aspect of them can be that fish belongs to water, birds mostly to the sky, and cats, and dogs to land. Once we have these vectors, we can use mathematical techniques to group them together based on their similarity. Based on what information we have, we can group animals that have similar characteristics, like cats and dogs, closer in the vector space.

So, vector embeddings are like a map that helps us find the similarity between the things in vector space. Just like a map helps us find our way around the world, vector embeddings help us find our way around in the playground of vectors.

As we might have understood by now, the key idea here is that the embeddings that are semantically similar to each other have a smaller distance between them. To find how similar they are, we can use vector distance functions like Euclidean distance, cosine distance, etc. However, comparing the distances can be quite a daunting task considering we have to calculate and compare the distances between the query vector and every other vector that we have. This is why we have vector databases and vector libraries. Now the biggest question that might arise here is — since both of them allow an efficient search through vectors, what is the difference between them? Let's find out!

Vector Databases vs Vector Libraries

Vector libraries store vector embeddings in in-memory indexes, in order to perform similarity search. Vector libraries have the following features/ limitations:

  1. Store vectors only — Vector libraries only store vector embeddings and not the associated objects they were generated from. This means, when we run a query, a vector library will respond with the relevant vectors and object ids. This is limiting since the actual information is stored in the object and not the id. To solve this problem, we would need to store the objects in secondary storage. We can then use the returned ids from the query and match them to the objects to understand the results.
  2. Index data is immutable — Indexes produced by vector libraries are immutable. This means that once we have imported our data and built the index, we cannot make any modifications (no new inserts, deletes or changes). To make any changes to our index, we will need to rebuild it from scratch
  3. Query during import limitation — Most vector libraries cannot be queried while importing the data. It is required to import all of our data objects first. Then the index is built after the objects have been imported. This can be a concern for applications that require importing millions or even billions of objects.

There are many vector search libraries available — Facebook’s FAISS, Spotify’s Annoy, and Google’s ScaNN. FAISS uses the clustering method, Annoy uses trees, and ScaNN uses vector compression. There is a performance tradeoff for each, which we can choose depending on our application and performance measure.

The core feature that sets vector databases apart from vector libraries is the ability to store, update and delete the data. Vector databases have full CRUD (create, read, update, and delete) support that solves the limitations of a vector library.

  1. Store vectors and Objects — Databases can store both the data objects and vectors. Since both are stored, we can combine vector search with structured filters. Filters allow us to make sure the nearest neighbours match the filter from the metadata.
  2. Mutability — Since vector databases fully support CRUD, we can easily add, remove, or update entries in our index after it has been created. This is especially useful when working with data that is continuously changing.
  3. Real-time search — Unlike vector libraries, databases allow us to query and modify our data during the import process. As we upload millions of objects, the imported data remains fully accessible and operational, so we don’t need to wait for the import to complete to start working on what is already in.

In short, a vector database provides a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices as discussed in the above points.

But what makes vector databases superior to traditional databases?

Vector Databases vs Traditional Databases

Traditional databases are designed to store and retrieve structured data using relational models, which means that they are optimized for queries based on columns and rows of data. While it is possible to store vector embeddings in traditional databases, these databases are not optimized for vector operations and cannot perform similarity searches or other complex operations on large datasets in an efficient manner.

This is because traditional databases use indexing techniques that are based on simple data types, such as strings or numbers. These indexing techniques are not suitable for vector data, which has a high dimensionality and requires specialized indexing techniques such as inverted indexes or spatial trees.

Additionally, traditional databases are not designed to handle the large amounts of unstructured or semi-structured data that is often associated with vector embeddings. For example, an image or audio file can contain millions of data points, which traditional databases are not equipped to handle efficiently.

Vector databases, on the other hand, are specifically designed to store and retrieve vector data, and are optimized for similarity searches and other complex operations on large datasets. They use specialized indexing techniques and algorithms that are designed to work with high-dimensional data, making them much more efficient than traditional databases for storing and retrieving vector embeddings.

Now that you’ve read so much about vector databases, you might be wondering, how do they work? Let's look at it.

How does a vector database work?

We all know how traditional databases work — they store strings, numbers, and other types of scalar data in rows and columns. On the other hand, a vector database operates on vectors, so the way it’s optimized and queried is quite different.

In traditional databases, we are usually querying for rows in the database where the value usually exactly matches our query. In vector databases, we apply a similarity metric to find a vector that is the most similar to our query.

A vector database uses a combination of different algorithms that all participate in the Approximate Nearest Neighbor (ANN) search. These algorithms optimize the search through hashing, quantization, or graph-based search.

These algorithms are assembled into a pipeline that provides fast and accurate retrieval of the neighbours of a queried vector. Since the vector database provides approximate results, the main trade-offs we consider are between accuracy and speed. The more accurate the result, the slower the query will be. However, a good system can provide ultra-fast search with near-perfect accuracy.

Below is a common pipeline for a vector database:

pipeline for a vector database
  • Indexing — The vector database indexes vectors using an algorithm such as PQ, LSH, or HNSW. This step maps the vectors to a data structure that will enable faster searching.
  • Querying — The vector database compares the indexed query vector to the indexed vectors in the dataset to find the nearest neighbours (applying a similarity metric used by that index)
  • Post Processing — In some cases, the vector database retrieves the final nearest neighbours from the dataset and post-processes them to return the final results. This step can include re-ranking the nearest neighbours using a different similarity measure.

Wrapping up

In conclusion, vector databases are a powerful tool for similarity search and other complex operations on large datasets, which cannot be effectively performed using traditional databases. To build a functional vector database, embeddings are essential, as they capture the semantic meaning of the data and enable accurate similarity searches. Unlike vector libraries, vector databases are designed to scale with our use case, making them ideal for applications where performance and scalability are critical. With the rise of machine learning and AI, vector databases are becoming increasingly important for a wide range of applications, including recommendation systems, image search, semantic similarity, and the list goes on. As the field continues to evolve, we can expect to see even more innovative applications of vector databases in the future.

References:

--

--