Vector Database

Ishika Garg
3 min readJul 3, 2024

--

In the world of databases, we’re all familiar with traditional databases like RDBMS. But have you heard about vector databases? Unlike RDBMS, which provides exact matches based on specific conditions, a vector database finds the most similar items based on their semantic or contextual meaning. Let’s explore vector databases, as they are incredibly important if you’re working with machine learning.

A vector database is designed to handle high-dimensional data efficiently, making it perfect for large language models (LLMs). This is crucial for AI and machine learning, where understanding context and similarity is key.

The vector representations encode facts and commonsense concepts that may not be directly expressed in the LLM’s training data. For example — vector(“King”) — vector(“Man”) + vector(“Woman”) results in a vector close to vector(“Queen”) in the vector space.

How we can use vector database –

  1. Initially, we utilize the embedding model to generate vector embeddings for the content.
  2. These vector embeddings are then inserted into the vector database.
  3. When a user or application issues a query, we employ the same embedding model to create embeddings for the query. These embeddings are then used to search the database for similar vector embeddings.
  4. Finally, these similar vector embeddings are forwarded to the LLM model for further processing.
Model Workflow

Here are a few similarity measures –

  1. Cosine Similarity — Cosine similarity measures the cosines of the angle between 2 vectors in a vector space. It ranges from -1 to 1, where 1 represents identical vectors, 0 represents orthogonal vectors and -1 represents vectors that are diametrically opposed.

2. Euclidean Distance — Euclidean distance measures the straight-line distance between 2 vectors in a vector space. It ranges from 0 to infinity, where 0 represents identical vectors and larger values represent increasingly dissimilar vectors.

Euclidian Distance

3. Jaccard Similarity — Jaccard similarity is used for measuring the similarity between vectors. It is determined by comparing their shared elements to their total elements.

Jaccard Similarity

Following are some of the vector databases –

  1. FAISS (Facebook AI Similarity Search) — Developed by Facebook AI, FAISS is a library designed to efficiently search and manage large collections of high-dimensional vectors, making it ideal for tasks such as image and text similarity search.
  2. Pinecone — Pinecone is a managed vector database service that offers real-time vector similarity search.
  3. Chroma — Chroma is a vector database that focuses on providing a flexible and scalable solution for storing and querying vector embeddings.

Fact—

A lot of venture capitalists are investing in various vector databases because they have realized that to build a successful LLM model, you need a robust vector database with very low latency that can easily perform numerous tasks for customers.

References —

  1. https://www.pinecone.io/learn/vector-database/

Finally —

Hopefully, you enjoyed reading it. This was an introduction to Vector Store. Buckle up, because our next blog is gonna be EPIC!

Got questions? Don’t be shy! Hit me up on LinkedIn. Coffee’s on me (virtually, of course) ☕️

--

--