Vector Databases | Which one to Choose?

Apurva Kumar
6 min readAug 19, 2023

--

What is a vector database?

A vector database is a type of database that stores data as high-dimensional vectors. Each vector represents a single entity, such as a document, image, or audio clip. The vectors are usually generated by applying some kind of transformation or embedding function to the raw data. This function can be based on various methods, such as machine learning models, word embeddings, or feature extraction algorithms.

Vector databases are designed to efficiently store and search for data based on its vector representation. This makes them ideal for applications that require similarity search, such as:

  • Product recommendation systems
  • Image search
  • Natural language processing
  • Fraud detection
  • Recommendation systems

To have a complete picture of a vector database, it’s helpful to define what is a vector embedding and a vector model:

Vector embedding

Vector embeddings are the representations of data stored and analyzed in vector databases. These vectors place semantically similar items close together in space, and dissimilar items far apart.

These (vector) embeddings can be produced for any kind of information — words, phrases, sentences, images, nodes in a network, etc. Once you have vector embeddings for your data, algorithms can detect patterns, group similar items, find logical relationships, and make predictions.

Figure 1. Vector embedding example using Star Wars characters

Figure 1 shows an embedding representation of Star Wars characters, learned from analyzing patterns in dozens of Star Wars books. This embedding space could be used as follows:

  • Cluster characters into groups like “Jedi”, “Sith”, “ Droids” etc. based on vector proximity.
  • For a character like Yoda, the nearest neighbors in the vector space may be other Jedi masters (i.e. Luke), indicating an affiliation we could infer even with no label for the given cluster.
  • Find edge-cases, e.g. Anakin Skywalker can be on the intersection of Jedi & Sith -even though we know his final form is more akin to Sith & Droid when he is fully led into to the dark side⚡.

Different embeddings will compute different underlying similarity measures, see Figure 3. For example, CLIP can compute the high-level semantic similarity of concepts like “Jedi” and “Sith”, whereas other embeddings, such as PCA, may compute lower-level similarities, such as shapes or colours.

Figure 2. A different vector embedding space of the same Star Wars characters

Embedding model

Vector databases use embedding models as a key component for translating data into vector formats optimized for similarity search and pattern analysis. The embedding models produce the vector representations that vector databases are built to store, query and analyze.

Some ways embedding models work with vector databases include:

  • Vector databases rely on embedding models to encode data such as words, images, knowledge graphs, etc. into numeric vector representations.
  • Because embedding models map semantically related items close together in vector space, vector databases can perform rapid vector similarity searches.
  • Embedding models map sparse data into lower-dimensional dense vectors, which vector databases are optimized to work with.

Vector embeddings, embedding models and vector databases work together to provide an end-to-end solution for generating, storing, and using vector data to power AI applications.

Popular vector databases

There are a number of popular vector databases available, including:

Anthos Vector Database is a fully managed vector database service from Google Cloud Platform. It is based on the open source Faiss library and offers a variety of features, including:

  • High performance for similarity search
  • Scalability to handle large datasets
  • Support for a variety of vector data types

Chroma is an open source vector database that is designed for high performance and scalability. It is based on the Faiss library and offers a number of features, including:

  • Support for a variety of vector data types
  • Distributed indexing for scalability
  • Real-time query processing

Milvus is an open source vector database that is designed for real-time search and recommendation applications. It is based on the HNSW algorithm and offers a number of features, including:

  • High performance for real-time search
  • Scalability to handle large datasets
  • Support for a variety of vector data types

Pinecone is a commercial vector database that is designed for enterprise applications. It offers a number of features, including:

  • High performance for similarity search
  • Scalability to handle large datasets
  • Security and compliance features
  • Integration with a variety of data sources

Comparison of vector databases

The following table compares some of the key features of the vector databases mentioned above:

Table 1. Comparing Vector databases in terms of functionalities offered

Choosing a vector database

The best vector database for you will depend on your specific needs and requirements. Consider the following factors when making your decision:

  • The size and complexity of your dataset
  • The performance requirements of your application
  • The features that are important to you
  • Your budget

If you are unsure which vector database to choose, you can try out a few different ones to see which one works best for you. There are a number of free trials available.

Figure 3. ML pipeline with vector database used to store embeddings

In the field of artificial intelligence, vector databases are an emerging database technology that is transforming how we represent and analyze data by using vectors — multi-dimensional numerical arrays — to capture the semantic relationships between data points.

In this article, we begin by defining what is a vector database. We compare some of the top companies offering vector database solutions. Then, we highlight how vector databases differ from relational, NoSQL and graph databases. We illustrate with an example how vector databases work in action. Finally, we discuss what might be on the horizon for this technology.

How Vector DBs compare to other kinds of DBs

Vector databases excel in its particular niche: handling embedding vectors at scale. The following table shows some of the differences between Vector DBs and other types of databases.

Table 2. Comparing Vector databases with other kinds of databases

Bear in mind that while this table provides a general overview, there can be specific databases within each category that have unique features and characteristics.

Future

Vector databases are likely to become commodities as demand grows for managing machine learning vector data at scale. They provide the performance, scale, and flexibility that AI applications require across industries.

Unlike other databases, vector databases were created specifically for vector embeddings and neural networks applications. They introduce a vector-native data model and query language providing functionality beyond SQL or graphs. As machine learning enriches use-cases that understand the world through vectors, vector databases deliver the data solution to gain insights from them.

Vector databases exhibit characteristics of both commodities and novel technologies. They are becoming commonplace for enterprises developing AI but represent a new database with a vector-first architecture no other technology provides.

References

[1] Item2Vec: Neural Item Embedding for Collaborative Filtering

[2] Efficient Estimation of Word Representations in Vector Space

[3] Distributed Representations of Sentences and Documents

[4] graph2vec: Learning Distributed Representations of Graphs

[5] Efficient Indexing of Billion-Scale datasets of deep descriptors

[6] SPTAG: A library for fast approximate nearest neighbor search

[7] Db2 event store: a purpose-built IoT database engine

[8] Billion-scale similarity search with GPUs

[9] AnalyticDB-V: A Hybrid Analytical Engine Towards Query
Fusion for Structured and Unstructured Data

[10] Milvus: A Purpose-Built Vector Data Management System

--

--

Apurva Kumar

Principal Software Engineer at Walmart Labs | Ex-Uber, Amazon, Yahoo and Samsung. MS in CS @UC San Diego