World of Vector Databases: A Comprehensive Overview

This article is good starting point for those looking to understand vector databases and their practical applications.

Introduction

For many years, relational databases and full-text search engines have been the foundation of searching modern IT systems.
However, semantic search has become more crucial and this approach encountered its limitations. In response, vector similarity search has emerged in this context.

The data is transformed into a numeric representation β€” vector (a list of numbers). And then vector similarity search finds similar content by comparing the distances and similarities between vectors.

Vector databases play a pivotal role in this paradigm, storing vectors for fast retrieval and similarity search.

In the present day it is the technology behind Google Search, YouTube, Play, and more other modern systems.

Google Cloud partner Groovenauts, Inc. published a live demo of MatchIt Fast. As the demo shows, you can find images and text similar to a selected sample from a collection of millions in a matter of milliseconds. You could also see the demo of vector database milvus.

Vector Databases vs relational DB

Traditional databases work with storing strings, numbers and more in rows and columns. For storing image, their features must be described almost manually.

For example for image of the cat it should be specify in columns :

animal : cat, color : white, tags : [little, kind].

And then we are querying for rows that match to this description.

However vector database represent data in n-dimensional space. And it help group similar data points together based on distance within this space.

Compared to traditional keyword search, vector search yields more relevant results and executes faster.

But how does an image become a vector?

Image embedding

For transforming image to vector the first step is to extract meaningful features from the image that can be represented numerically. It could be textures, shapes and more. In the second step each feature becomes a dimension in the vector, and its value represents a specific aspect of the image.

The process of transformation image in vector know as β€œimage embedding”. And the models IA that learn to extract hierarchical features from images and map them to lower-dimensional vector representations know as β€œmodels embedding.”

The following image illustrates process transformation of the data to vector.

Global view of the text to vector transformation

Searching process

For searching the similar images or text vector database use distance between the vectors for defining their similarity.

The following image visualize the vectors of the words. As you can see the cat and kitten are very closed and will be identified as text with similar meaning.

In the numerical way this proximity will be represented by the proximity of the value of the dimensions.

Semantic search operates by evaluating the distances and similarities between these vectors.

The analysis of the distance is based on approximate nearest neighbor algorithms and metrics. For more details about metrics you can refer to the article titled Different types of Distances used in Machine Learning Explained!

For each search, the initial step involves transforming the query into a vector representation. Subsequently, it progresses to the indexing stage. Vector database map the vector embedding to data structures for faster searching.

Following indexing, the process advances to the querying stage. Here, the vector database compares the queried vector to indexed vectors, applying the similarity metric to find the nearest neighbor.

Lastly, the process terminated in the post-processing stage. Depending on the vector database you use, the vector database will post-process the final nearest neighbor to produce a final output to the query. As well as possibly re-ranking the nearest neighbors for future reference.

Use Case

The ability of the vector database to handle similarity-based searches efficiently makes them a powerful tool in data retrieval.

They find application in several ways, including :

  • clustering
  • searching
  • recommendations
  • classifications

Vector Database systems

Pinecone β€” A vector database that is designed for machine learning applications. It is fast, scalable, and supports a variety of machine learning algorithms.

Weaviate (Github : 7.5k ⭐) β€” An open-source vector database that allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.

Chroma (Github : 8.6k⭐) β€” An AI-native open-source embedding database. It is simple, feature-rich, and integrable with various tools and platforms for working with embeddings.

Milvus (Github : 23k ⭐) β€” An open-source vector database built to power embedding similarity search and AI applications.

Qdrant (Github : 12.9k ⭐) β€” Powering the next generation of AI applications with advanced and high-performant vector similarity search technology

Vespa (Github : 4.7k ⭐) β€” A fully featured search engine and vector database. It supports vector search (ANN), lexical search, and search in structured data, all in the same query.

And more see the following article : Vector database systems and related technologies.

From Theory to Practice

The practical implementation of vector databases involves understanding the fundamental stages for handling a vector database:

1. Create index and define algorithm for similarity search
2. Defining Vector Transformer (Embedding Model )
3. Load dataset
4. Create query
5. Transform query to vector
6. Run query
7. Find the similar items using created index

For practical application examples with Pinecone, you can explore the code provided in the Pinecone documentation, which is simplified to enhance fundamental comprehension.

  1. Create index and define algorithm for similarity search :
import pinecone

pinecone.create_index(
name=index_name,
dimension=len(dataset),
metric='cosine'
)

The metric refers to the distance measure employed for similarity search, which can include options such as β€˜euclidean,’ β€˜cosine,’ or β€˜dot product.’.
The choice of metric depends on the specific problem and the nature of the data being analyzed.

2. Defining Embedding Model (Vector Transformer) and transforming query :

from sentence_transformers import SentenceTransformer

import torch
# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

query = "which city has the highest population in the world?"

# create the query vector by method encode
query_vector = model.encode(query).tolist()

3. Run query using index :

# run query
xc = index.query(query_vector)

4. Show obtained results :

for result in xc['matches']:
print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

Conclusion

Vector databases have revolutionized the way we approach data retrieval and similarity-based searches in modern IT systems.

They store vectors, which are numeric representations of data.

To transform an object such image or text into a vector an embedding model is used. This model extracts meaningful features from the object and maps these features to lower-dimensional vector representations.

Finding similar vectors is a key operation in vector database. This is achieved by evaluating the distance between vectors using approximate nearest neighbor algorithms. These algorithms efficiently identify vectors that are close in similarity to the query vector.

The applications of vector databases are diverse and powerful, including clustering, searching, recommendations, and classifications. Their ability to handle similarity-based searches efficiently positions them as essential tools in various fields.

--

--