Exploring Vector Databases with PGVector

Published in

Ekohe

4 min readMay 28, 2024

In software engineering, the way we store and query data is crucial. Traditional databases have served us well, but as the complexity and volume of data grow, so does the need for more advanced solutions. Enter vector databases, a revolutionary approach to handling high-dimensional data. PGVector, an extension of PostgreSQL, integrates vector storage and querying seamlessly. In this article, we’ll explore vector databases and how PGVector is transforming data handling.

Applications

Vector databases offer powerful capabilities that transform how we handle and analyze data. Below are some key applications where vector databases like PGVector excel, demonstrating their practical value in real-world scenarios.

Retrieval-Augmented Generation (RAG): Vector DB is crucial in RAG systems, helping retrieve related information quickly for LLMs while maintaining the integration of relational databases.

Source: Building RAG-based LLM Applications for Production

Recommendation System: Similar search on vector databases forms the basis of content-based recommendation systems, showing items similar to those a user has liked or selected.
Training data gathering: Let’s say you’re building an NLP classifier, instead of downloading or purchasing training data. You can gather them by manually selecting a few samples, then look for samples that are similar to the ones you just selected. For example: To gather training data for a spam classifier, you’ll only need to pick a spam email with poor grammar, too good to be true offers, suspicious sender, etc, and look for similar emails.

What are vector databases?

Vector databases (DBs) are specialized databases designed to store and retrieve high-dimensional vectors. Unlike traditional databases that excel at handling structured data, vector databases are built to manage unstructured data, such as images, text, and audio, which are often represented as vectors. These vectors can capture complex relationships and similarities that are not easily discernible through traditional database queries.

Traditional DBs will store their records like the following example:

To convert those records to a suitable format for a vector DB we typically use an embedding model. For my projects, I use sentence-transformers/distiluse-base-multilingual-cased-v2
This is because traditional text representations, like bag-of-words or TF-IDF, focus on the occurrence of words but often fail to capture the semantic meaning and context of the text. Embedding models is designed to understand the context and meaning behind words and sentences and convert them into dense vectors.

However, now our records in the DB will look drastically different.

It’s obvious that those records are not human-readable and should go through other processing steps before we can make use of them. We can apply a distance metric to measure how these embeddings are far apart from each other. A commonly used metric is cosine distance.

When we need to compare a record versus millions or even hundreds of millions, comparing one by one and sorting the distance is incredibly inefficient. Which is why we need something to combine the robust features of relational DB with the specialized functions to store, index, and execute vector operations efficiently.

PGVector

PGVector is an extension of PostgreSQL that provides these capabilities. It enables the storage, indexing, and querying of vector data within PostgreSQL, allowing for seamless integration of vector operations in a relational database environment.

Key features:

Vector Data Type: PGVector introduces a new data type for storing high-dimensional vectors.
Efficient Similarity Search: Perform fast similarity searches using vector operations like cosine similarity.
Indexing: Create indexes on vector columns to speed up search queries.
SQL Integration: Use standard SQL commands to manage and query vector data.

You can either create a table with traditional PSQL as usual, or gem Neighbor if you’re using Rails. For simplicity’s sake, I’m using PSQL

CREATE TABLE companies_vectors (
 id bigserial PRIMARY KEY,
 company_id int8 REFERENCES companies,
 embeddings vector (512),
 created_at TIMESTAMP,
 updated_at TIMESTAMP
);

You also need to create an index with either IVFFlat or HNSW. Here I used IVFFlat but if you have memory to spare HNSW is a very good choice with excellent query time.

CREATE INDEX ON companies_vectors USING ivfflat (embeddings vector_cosine_ops);

Source: Nearest Neighbor Indexes for Similarity Search

Now we can start querying, I’m looking for companies with a description similar to that of BBC.com. The 2 tables I used have 30 million and 3 rows million respectively. Note that this query finishes after only 0.8s.

Conclusion

PGVector revolutionizes data handling by combining relational database features with efficient vector operations. This makes it ideal for applications like recommendation systems, retrieval-augmented generation, and training data gathering. As data complexity grows, tools like PGVector will be essential for efficient and meaningful data analysis.

Exploring Vector Databases with PGVector

Applications

What are vector databases?

PGVector

Conclusion

Written by Mai