Storing Vector Easily with pgvector Extension by PostgreSQL

Published in

EDTS

20 min readAug 30, 2024

ML and AI have become an important part of our life. Generative AI models like ChatGPT, and Gemini have been assisting us in our daily life by helping us write emails and generate creative content. Some parts of this article also get help from Gen AI 😀.

Similarity search is one of the key applications of AI, allowing us to find similar objects without the need for exact matches. Instead, we can compare objects based on their context or semantic meaning. This can open the possibility for recommendation systems for unstructured data such as images and video, chatbot question answering, text summarization and many more.

To perform similarity search, we need to convert the object into something that machines can understand, such as vectors. Algorithms to translate objects into vectors are called Vector Space Model. This model translates an object into a collection of numbers with an imaginary multi-dimensional continuous space to capture the meaning and its relationship [1]. Length of these vectors varies. Word2Vec, a popular model to generate vector space, uses 300 dimensions in its pre-trained vectors [2], while OpenAI embedding by default will generate 1536 dimensions for its vectors [3].

So, how do we store these vectors? If we are able to store them, how to query these vectors efficiently without sacrificing processing time? pgvector extension by PostgreSQL comes to the rescue!!🎉

This article will give a brief introduction about pgvector by exploring its capability based on the documentation. This article also explains a little bit about Vector Space Model (VSM), and vector databases. Then, we will try using them to create a semantic similarity search using pgvector, and Gemini. You can find the full code here.

Table of Content

What is Vector Space Model (VSM)?
Vector Database and Why is it Important?
Why pgvector?
Explore Distance Function and Indexing Method in pgvector
Use Case: Semantic Search using pgvector and Gemini.
What’s next?

1. What is Vector Space Model (VSM) in Machine Learning?

As mentioned in the previous section, the Vector Space Model is an algorithm that translates an object into a collection of numbers that is called a vector. A vector consists of dimensions and values associated with those dimensions. Vectors capture the semantic meaning behind an object. This object can be text, images, videos, documents, etc.

Machine reads an object as zero and one. It doesn’t understand the meaning behind it. If we are not using a vector, the machine will read an object as it is. If we want to compare “desk” with “table” it will evaluate it as not similar since it is a different arrangement of words even though it has a similar meaning. Vectors on the other hand capture the semantic meaning behind an object by assigning dimensions/features and values into it.

Figure 1 gives a simple visualization of a vector I found in a book by Pilehvar and Camacho-Collados [1]. By using vectors, the word “desks”, “desk”, and “table” will be clustered together even though each has different letters arrangement.

Figure 1. Simple Visualization of Vectors Space of the word “Desk”, “Desks”, “Table”, and “Plate”

2. Vector Database and Why is it Important?

Figure 1 only shows vectors with three dimensions. Usually generated vectors for text ranged between 100–1000 dimensions and 1000 to 10000 for images. You can imagine how large the data will be, and how slow the querying speed.

This problem creates a need for tools to manage vectors data well. It needs to be able to query a vast amount of vector data without sacrificing accuracy, speed, and cost. For that, it needs to have a good indexing algorithm.

Index is a pointer to tell the database where the data is stored. This way, the database won’t have to do full table scan and resulted in a faster query speed.

3. Why pgvector?

pgvector is an extension in PostgreSQL to manage vector data. It has two powerful features to support managing vector data, that is indexing and built-in distance function. Let’s start with the indexing method in pgvector!

A. pgvector Indexing Method

pgvector have two indexing method. IVFFlat and HNSW. Based on pgvector documentation, IVFFlat or the Inverted File with Flat Compression will create clusters of vectors. When searching for similar vectors of the input, pgvector will search for the nearest centroid of the cluster then look into the cluster’s member. It has faster build time and uses less memory than HNSW, but has lower query performance. Figure 2 visualizes how IVFFlat index works.

Figure 2. IVFFlat Indexing Method Visualization

HNSW or Hierarchical Navigable Small Worlds use a different approach of indexing vector data. Instead of creating clusters, pgvector will create layers of dense linked vectors. When searching for similar vectors of the input, pgvector will search the nearest vector on the top layer. Then it will descend and search for the nearest vectors on that layer, repeat until it generate the closest vector. Figure 3 visualizes how HNSW works.

Figure 3. HNSW Indexing Method Visualization

B. pgvector Distance Function

Besides indexing, another key feature of pgvector is distance function. Based on pgvector official documentation, there are four distance function supported in pgvector.

<-> L2 Distance (Euclidean)
<=> Cosine Distance
<#> (Negative) Inner Product
<+> L1 Distance (added in ver. 0.7.0)

Let’s discuss one by one!

<-> L2 Distance (Euclidean
L2 Distance or Euclidean distance measures distance between 2 points. It is sensitive to vector magnitude and orientation. In Natural Language Processing (NLP) similarity is often measured by orientation rather than magnitude. You can choose Euclidean distance when magnitude plays a part to determine the similarity. The code below visualizes two vectors that have similar semantics (pointing to the same direction) but have different magnitudes. Figure 4 show the visualization about measuring distance between points using L2 distance function.

<=> Cosine Distance
Cosine distance measure the angle or orientation between vectors. Magnitude is not measured when using cosine, which can be beneficial when searching objects that have different length such as similarity between documents and title. Figure 5 show the visualization about measuring distance between points using cosine distance function.

<#> (Negative) Inner Product
Inner product measure the projected vector to another vector. It is sensitive for both orientation and magnitude. If inner product is normalize, it is the same as cosine distance. Figure 6 show the visualization about measuring distance between points using inner product function.

Figure 6. Inner Product Distance Measurement

<+> L1 Distance (added in ver. 0.7.0)
L2 and L1 distance is similar, both calculate distance between two points. The different is L1 measure distance using grid like path. L1 distance is more sensitive to orientation and magnitude than L2. You can use L1 distance when you want to see the absolute different between two points. L1 Distance is more sensitive to outlier than other function. Figure 7 show the visualization about measuring distance between points using L1 distance function.

4. Explore Distance Function and Indexing Method in pgvector

Let’s do a little hands on to better understand distance function and indexing method of pgvector. You can find the full code here.

Before we do anything, make sure you follow the guide in the README file and prepare the database.

Exploring Built-In Distance Function in pgvector

In this hands on we won’t be jumping into pgvector yet. Our aim here is to understand the difference between L1 distance, cosine distance, inner product, and L1 distance.

We will create two dummy vectors with three dimensions. First, import the necessary library. If you haven’t installed it, run the requirement.txt file

pip install -r /path/to/requirements.txt

Then import the library.

import numpy as np
import matplotlib.pyplot as plt

We will create two sample vectors with three dimensions.

# Example vectors. Both are pointing to the same direction (semantically similar), but have different magnitude.
# Vector A: (3, 3)
# Vector B: (6, 8)

a = np.array((3,3))
b = np.array((6,8))

V = np.array([a, b])
origin = np.array([[0, 0],[0, 0]]) # origin point

Then we create the plot

# set max x and y
plt.xlim(-max(V[:, 0]) - 1, max(V[:, 0]) + 1)
plt.ylim(-max(V[:, 1]) - 1, max(V[:, 1]) + 1)

plt.quiver(*origin, V[:,0], V[:,1], color=['r','b'], scale=25)
plt.show()

It will return something like this.

Then we will compare each distance function result using this code.

# L2 Distance / Euclidean Distance
# Smallest number (nearest distance), means it is more similar
l2_dist = np.linalg.norm(a-b)
print("L2 Distance: ", l2_dist)

# Cosine Distance
# Result with closer to one means it is more similar
from numpy import dot
from numpy.linalg import norm

cosine = dot(a, b)/(norm(a)*norm(b))
print("Cosine Distance: ", cosine)

# Inner Product
# Larger result means it is more similar
inner = np.inner(a, b)
print("Inner Product: ", inner)

# L1 Distance
# Smallest number (nearest distance), means it is more similar
l1_dist = np.linalg.norm(a-b, ord=1)
print("L1 Distance: ", l1_dist)

It will return this result.

L2 Distance:  5.830951894845301
Cosine Distance:  0.9899494936611667
Inner Product:  42
L1 Distance:  8.0

From this result we can see the different output from every distance measurement function. L2 distance give large distance between points from vector a and vector b, while cosine distance result is almost 1, which means it evaluate the two vector as similar. From this result we can concludes that indeed L2 distance is sensitive to magnitude, while cosine distance is not. Inner product also returns high value, means it evaluate the two vectors as similar, while L1 distance return higher number than L2, which means it evaluate the vectors as less similar than L2.

Now we already understand about the different type of distance measurement in pgvector, but how about indexing method?

Exploring Indexing Method Available in pgvector

For this hands on, we will use vectorized dataset available publicly in Hugging Face. First we need to load the dataset first using datasets library.

from datasets import load_dataset # HuggingFace dataset library

# Import hugging face dataset as dataframe
dataset = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")
df = dataset.to_pandas()

Let see some of the data we loaded.

print("Show 5 items from the dataset:")
print(df.iloc[:5,:])
print("Vector Length: ",len(df.iloc[1]['emb']))

It will return like this.

Show 5 items from the dataset:
   id          title                                               text   
0   0  24-hour clock  The 24-hour clock is a way of telling the time...  \
1   1  24-hour clock  A time in the 24-hour clock is written in the ...   
2   2  24-hour clock  However, the US military prefers not to say 24...   
3   3  24-hour clock  24-hour clock time is used in computers, milit...   
4   4  24-hour clock  In railway timetables 24:00 means the "end" of...   

                                            url  wiki_id        views   
0  https://simple.wikipedia.org/wiki?curid=9985     9985  2450.625488  \
1  https://simple.wikipedia.org/wiki?curid=9985     9985  2450.625488   
2  https://simple.wikipedia.org/wiki?curid=9985     9985  2450.625488   
3  https://simple.wikipedia.org/wiki?curid=9985     9985  2450.625488   
4  https://simple.wikipedia.org/wiki?curid=9985     9985  2450.625488   

   paragraph_id  langs                                                emb  
0             0     30  [0.077112876, 0.3197174, -0.20515901, 0.630257...  
1             1     30  [0.1961289, 0.51426697, 0.03864574, 0.5552187,...  
2             2     30  [0.1391919, 0.17759687, -0.11669647, 0.3985595...  
3             3     30  [0.12796868, 0.067080714, -0.079901434, 0.3950...  
4             4     30  [0.07533602, 0.3530838, -0.08269313, 0.6591874...  
Vector Length:  768

Now we will add the dataset into our table

import os
import psycopg2
import pandas as pd
from sqlalchemy import create_engine 
from dotenv import load_dotenv
from pgvector.psycopg2 import register_vector

# I use .env file to store my credentials
load_dotenv()
PASSWORD = os.getenv("PASSWORD")
FILE_PATH = "./dataset/vector_database_wikipedia_articles_embedded.csv"

# Adjust accordingly
conn_string = f'postgresql://postgres:{PASSWORD}@localhost:5432/pgvector_sandbox'

# Create db connection
db = create_engine(conn_string) 

with psycopg2.connect(conn_string) as conn:
    register_vector(conn) # register vector to work with numpy array
    cursor = conn.cursor()

    query='''
        -- CAUTION!! you will delete all data in this table. Proceed with caution.
        DROP TABLE IF EXISTS wiki_article;

        CREATE TABLE wiki_article (
            id INTEGER NOT NULL,
            title TEXT,
            text TEXT,
            url TEXT,
            wiki_id INTEGER,
            views FLOAT,
            paragraph_id INTEGER,
            langs INTEGER,
            emb VECTOR(768)
        );
    '''
    cursor.execute(query) 

# Insert to db
with db.connect() as conn:
    df.to_sql('wiki_article', con=conn, if_exists='append', index=False)

Now we need to create the index.
First let’s try creating IVFFlat index.

Based on the official documentation, there are three keys to achieving good recall using IVFFlat index:

Create the index after the table has some data.
Choose an appropriate number of lists. Recommended to start with rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows.
Specify an appropriate number of probes. A higher value provides better recall at the cost of speed.

SET ivfflat.probes = 10;

We already populate the table with a 485,859 rows data. It is small compared to the real world production data, but we still can compare the different between indexing method using this.

Before we start, we need to generate vector embedding from our input string. We will use Gemini AI with Text Embedding 004 model to create this. We want to find 10 famous places in the world. This is our prompt:

Give me 10 famous places in the world.

To create vector embedding, we need to use model. Let’s try using Gemini API. You need to sign in to Google AI Studio first, and get the Gemini API key. Currently, Gemini API is free of charge with some limitation. You can find more information about this here.

Our data vector embedding length is the same as Gemini API embedding vector length. If you want to use another model. Make sure the vector length is the same.

import google.generativeai as genai

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

prompt = "Give me 10 famous places in the world"

genai.configure(api_key=GEMINI_API_KEY)
prompt_embed = genai.embed_content(
    model="models/text-embedding-004",
    content=prompt,
    task_type="semantic_similarity",
    
)

print(str(prompt_embed['embedding'])[:50], '... TRIMMED]') # Show some part of the vector
print(len(prompt_embed['embedding'])) # Show dimension size

[-0.042778704, -0.010645142, -0.021331308, -0.0098 ... TRIMMED]
768

Now let’s do semantic similarity search from the input to the nearest article. We will also start the timer to show how long the searching last.

We didn’t apply any indexing method yet.

import time

print("Executing similarity search with cosine distance function with no index")

# Get data from db
with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        DROP INDEX IF EXISTS wiki_article_emb_idx_ivfflat;
        DROP INDEX IF EXISTS wiki_article_emb_idx_hnsw;
        WITH input AS (
            SELECT '{prompt_embed["embedding"]}'::VECTOR(768) AS ve
        )
        SELECT 
            title,
            1 - (emb <=> ve) similarity -- we use cosine similarity since out input string is smaller that the article string.
        FROM wiki_article join input on 1=1
        ORDER BY 1 - (emb <=> ve) desc
        LIMIT 10; 
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))
    
    # Show data
    for i in cursor.fetchall(): 
        print(i)

Executing similarity search with cosine distance function with no index
--- 1.0352156162261963 seconds ---
('Juneau, Alaska', 0.10167275595485259)
('Singrauli district', 0.10040002682733795)
('Judenburg District', 0.09355199853086849)
('Squash (drink)', 0.09190088175644828)
('Knittelfeld District', 0.09044343345245787)
('Claypool, Arizona', 0.09006537389065139)
('Juneau, Alaska', 0.0898781109338872)
('Juneau, Alaska', 0.08747756723047428)
('Unionville, Missouri', 0.08697072096916048)
('Lugu Lake', 0.08659680950396198)

The code above execute about 1 second before we applying any indexing function. The execution time may be different for every query. Now let’s try to apply a index function. Let’s start from IVFFlat!.

First we need to create the index. Based on the official documentation, there are three keys to achieving good recall using IVFFlat index:

Create the index after the table has some data.
Choose an appropriate number of lists. Recommended to start with rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows. “List” means cluster. We can determined how many cluster we want to create.
Specify an appropriate number of probes. A higher value provides better recall at the cost of speed. A good place to start is sqrt(lists). Probes is the threshold of cluster we want to consider during the search.

SET ivfflat.probes = 10;

So we will make the list number is 485859 / 1000 = 485 and we will set the probes to sqrt(485) = 22

If you found this error, change the configuration of maintenance_work_mem in postgresql.conf and set to above 122 MB, I use 150MB. After that restart the database.

ProgramLimitExceeded: memory required is 122 MB, maintenance_work_mem is 64 MB

Run this query to create index for IVFFlat.

with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        DROP INDEX IF EXISTS wiki_article_emb_idx_ivfflat;
        CREATE INDEX wiki_article_emb_idx_ivfflat ON wiki_article USING ivfflat (emb vector_cosine_ops) WITH (lists = 485);
        SET ivfflat.probes = 22;
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))

--- 37.56687045097351 seconds ---

We found that the building time is 38 seconds. The execution time may vary. We already use the IVFFlat indexing method. Let use run the same query, will it be faster?

import time

print("Executing similarity search with cosine distance function with IVFFlat index")

# Get data from db
with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        WITH input AS (
            SELECT '{prompt_embed["embedding"]}'::VECTOR(768) AS ve
        )
        SELECT 
            title,
            1 - (emb <=> ve) similarity -- we use cosine similarity since out input string is smaller that the article string.
        FROM wiki_article join input on 1=1
        ORDER BY 1 - (emb <=> ve) desc
        LIMIT 10; 
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))
    
    # Show data
    for i in cursor.fetchall(): 
        print(i)

Executing similarity search with cosine distance function with IVFFlat index
--- 0.8755912780761719 seconds ---
('Juneau, Alaska', 0.10167275595485259)
('Singrauli district', 0.10040002682733795)
('Judenburg District', 0.09355199853086849)
('Squash (drink)', 0.09190088175644828)
('Knittelfeld District', 0.09044343345245787)
('Claypool, Arizona', 0.09006537389065139)
('Juneau, Alaska', 0.0898781109338872)
('Juneau, Alaska', 0.08747756723047428)
('Unionville, Missouri', 0.08697072096916048)
('Lugu Lake', 0.08659680950396198)

We found that query time is around 0.87 second. After we use indexing the query is a lot faster. How about using HNSW index?

First, drop the previous index and create a new HNSW index.

Based on the official documentation, HNSW index has two parameter:

m is the maximum number of connection per layer (16 by default)
ef_construction is the size of the dynamic candidate list for constructing the graph (64 by default). A higher value of ef_construction provides better recall at the cost of index build time/insert speed.

Index Options

Based on a paper by Malkov and Yashunin [4] a reasonable range of m is ranged from 5 to 48. Smaller m produce better result for lower recall and/or lower dimensional data, while bigger m is better for high recall and/or high dimensional data. This parameter also defined the memory consumption of the algorithm, so use it with care.

As for ef_construction, it best to keep the numbers relatively small since the it has little impact to the performance but in exchange of significantly longer construction time.

Query Options

Other than tuning through index, we can also tune the speed through query. Parameter ef_search is similar to ef_construction but for query optimization.

ef_search has value of 40 by default. Since I haven’t found any recommendation for the parameter values, I will use the default value to built the index.

Creating index need a lot of disk space and memory. We already increase the maintenance memory when creating IVFFlat with 150 MB. For HNSW index, you might need to increase the disk space memory. I already set the shm_size when building the database container. You can check it in the README file in my repo.

If you found error like this, you might need to increase the shm_size more (following the error message) when building the container.

DiskFull: could not resize shared memory segment “/PostgreSQL.3018413434” to 154177056 bytes: No space left on device

Run this code to build HNSW index.

with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        DROP INDEX IF EXISTS wiki_article_emb_idx_hnsw;
        CREATE INDEX wiki_article_emb_idx_hnsw ON wiki_article USING hnsw (emb vector_cosine_ops) WITH (m = 16, ef_construction = 64);
        SET ivfflat.probes = 100;
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))

--- 1555.6995317935944 seconds ---

To build HNSW index takes a lot of computer process. Mine is build around 1555 seconds or 25 minutes!!
But let’s see is it faster than IVFFlat?

Let’s run the same query for similarity search as we’ve done for IVFFlat.

import time

print("Executing similarity search with cosine distance function with HNSW index")

# Get data from db
with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        WITH input AS (
            SELECT '{prompt_embed["embedding"]}'::VECTOR(768) AS ve
        )
        SELECT 
            title,
            1 - (emb <=> ve) similarity -- we use cosine similarity since out input string is smaller that the article string.
        FROM wiki_article join input on 1=1
        ORDER BY 1 - (emb <=> ve) desc
        LIMIT 10; 
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))
    
    # Show data
    for i in cursor.fetchall(): 
        print(i)

Executing similarity search with cosine distance function with HNSW index
--- 0.7794852256774902 seconds ---
('Juneau, Alaska', 0.10167275595485259)
('Singrauli district', 0.10040002682733795)
('Judenburg District', 0.09355199853086849)
('Squash (drink)', 0.09190088175644828)
('Knittelfeld District', 0.09044343345245787)
('Claypool, Arizona', 0.09006537389065139)
('Juneau, Alaska', 0.0898781109338872)
('Juneau, Alaska', 0.08747756723047428)
('Unionville, Missouri', 0.08697072096916048)
('Lugu Lake', 0.08659680950396198)

Hmm, its not that much different between IVFFlat and HNSW. Might be because our dataset is not that big or we need to do parameter tuning.

The official documentation didn’t mention starting value for these parameter. Just in mind that a big m and ef_construction will impact negatively to the build speed and the query speed.

Building HNSW is time consuming and required a lot of computing power. Since I deployed my database in a container, creating HNSW index will need a lot of time. So I will skip parameter tuning step for HNSW index and continue to the hands on.

If you want to know the comparison between IVFFlat and HNSW, you can refer to this blog. The difference is quite far in term of build speed and query speed.

5. Use Case: Semantic Search using pgvector and Gemini.

We already explore the capability of pgvector. Let’s put this knowledge into a real world use case. In this case, we want to create a search engine for movies. We want our users to be able to find similar movies based on their description plot.

What we’ll do in this hands on?

Insert Wikipedia movie plot dataset, retrieved from Kaggle by JustinR.
Create chunks of movie plot so it can be fit to sent to Google API.
Create vector embedding of the plot using Gemini AI and store them in the database.
Create index for the table. We will use IVFFlat to minimize the build execution time.
Create function to retrieve user input then generate it’s vector embedding.
Search similarity between user input and movies.

This code is based on Google’s tutorial.
Let’s insert the data first!

Insert Wikipedia movie plot dataset, retrieved from Kaggle by JustinR.

# different source dataset
FILE_PATH = "./dataset/wiki_movie_plots_deduped.csv"

# Import csv file as dataframe
file_path = f"{FILE_PATH}"
df = pd.read_csv(file_path)

# Insert to db
with db.connect() as conn:
    df.to_sql('movies_plot', con=conn, if_exists='replace', index=False)

# Get data from db 
with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = '''
        -- Create ID set as primary key
        ALTER TABLE movies_plot
        ADD COLUMN movie_id INTEGER PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY;

        SELECT * FROM movies_plot LIMIT 10; 

    '''
    cursor.execute(query) 
    
    # Save to dataframe
    for i in cursor.fetchall(): 
        print(i)

Create chunks of movie plot so it can be fit to sent to Google API.

We will only generate vector embedding for movie’s plot. The text is a bit long so we need to chunked it. We need to chunk long text into smaller chuck to match with the API request size limit. For demonstration, We will chunk our data with chunk size 500 token.

Please notice at the moment, Gemini API free tier for model text embedding 004 has limit 1500 request per minutes. Our dataset has 35000 row, it is far over the limit. So we will only send about 800 row to the API.

# Retrieved 800 rows from database
# Get data from db 
with psycopg2.connect(conn_string) as conn:
    query = '''
        SELECT movie_id, "Plot"  
        FROM movies_plot LIMIT 800; 
    '''

    df_movie = pd.read_sql_query(query,con=conn)

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chuck to 500 token for each chuck
text_splitter = RecursiveCharacterTextSplitter(
    separators=[".", "\n"],
    chunk_size=500,
    chunk_overlap=0,
    length_function=len,
)

chunks = []

for idx, row in df_movie.iterrows():
    movie_id = row["movie_id"]
    plot = row["Plot"]
    splits = text_splitter.create_documents([plot])
    for s in splits:
        r = {"movie_id": movie_id, "content": s.page_content}
        chunks.append(r)

Create vector embedding of the plot using Gemini AI.

We will send every chuck to the Gemini API. It will create a 768-dimensional vector for each chunk of text.

# Helper function to retry failed API requests with exponential backoff.
def retry_with_backoff(func, *args, retry_delay=5, backoff_factor=2, **kwargs):
    max_attempts = 10
    retries = 0
    for i in range(max_attempts):
        try:
            response = func(*args, **kwargs)
            time.sleep(1)
            return response
        except Exception as e:
            print(f"error: {e}")
            retries += 1
            wait = retry_delay * (backoff_factor**retries)
            print(f"Retry after waiting for {wait} seconds...")
            time.sleep(wait)

batch_size = 5
embed_response = []
for i in range(0, len(chunks), batch_size):
    request = [x["content"] for x in chunks[i : i + batch_size]]
    response = retry_with_backoff(
        genai.embed_content, 
        model="models/text-embedding-004",
        content=prompt,
        task_type="semantic_similarity",
    )
    embed_response.append(response['embedding'])
    # Store the retrieved vector embeddings for each chunk back.
    for x, e in zip(chunks[i : i + batch_size], embed_response):
        x["embedding"] = e

# Store the generated embeddings in a pandas dataframe.
plot_embeddings = pd.DataFrame(chunks)
plot_embeddings.head()

We have the embed data, now we will store them in our database.

with psycopg2.connect(conn_string) as conn:
    register_vector(conn) # register vector to work with numpy array
    cursor = conn.cursor()

    query='''
        -- CAUTION!! you will delete all data in this table. Proceed with caution.
        DROP TABLE IF EXISTS movies_plot_embed;

        CREATE TABLE movies_plot_embed (
            movie_id INTEGER NOT NULL REFERENCES movies_plot(movie_id),
            content TEXT,
            embedding VECTOR(768)
        );
    '''
    cursor.execute(query) 

# Insert to db
with db.connect() as conn:
    plot_embeddings.to_sql('movies_plot_embed', con=conn, if_exists='append', index=False)

Create index for the table. We will use IVFFlat to minimize the build execution time.

Now we will create index in our newly created table. We will use IVFFlat for this, since our data is small and to minimize the build execution time.

We will use lists size of 100.

with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        DROP INDEX IF EXISTS movie_plot_embed_idx_ivfflat;
        CREATE INDEX movie_plot_embed_idx_ivfflat ON movies_plot_embed USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
        SET ivfflat.probes = 10;
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))

Create function to retrieve user input then generate it’s vector embedding.

Then we will generate vector embedding for our user input

prompt = "Thriller movies with vampire and warewolf"

genai.configure(api_key=GEMINI_API_KEY)
prompt_embed = genai.embed_content(
    model="models/text-embedding-004",
    content=prompt,
    task_type="semantic_similarity",
    
)

print(str(prompt_embed['embedding'])[:50], '... TRIMMED]') # Show some part of the vector
print(len(prompt_embed['embedding'])) # Show dimension size

Search similarity between user input and movies.

After that we will try to search for similar movies to our prompt.

print("Executing similarity search with cosine distance function with IVFFlat index")

# Get data from db
with psycopg2.connect(conn_string) as conn:
    conn.autocommit = True
    cursor = conn.cursor() 

    query = f'''
        WITH input AS (
            SELECT '{prompt_embed["embedding"]}'::VECTOR(768) AS ve
        ),
        similar_movies as (
            SELECT 
                movies_id,
                1 - (emb <=> ve) similarity
            FROM movies_plot_embed join input on 1=1
            ORDER BY similarity desc
            LIMIT 10
        )
        SELECT 
            title,
            origin,
            director,
            cast,
            genre
        FROM movies_plot
        WHERE movie_id in (SELECT movie_id FROM similar_movies)
        
    '''
    # marks the start of the execution
    start_time = time.time()
    cursor.execute(query)
    print("--- %s seconds ---" % (time.time() - start_time))
    
    # Show data
    for i in cursor.fetchall(): 
        print(i)

We did it !! 🎉
We have created a semantic similarity search using pgvector, and Gemini!

6. What’s Next?

We already try to make a simple semantic similarity search in local deployment. Computing power is one of the limitation of local deployment. Next step is try this hands on in cloud environment which enables more powerful computing power. You can try checking this Google’s tutorial about semantic similarity search in Google Cloud environment.

You also can try to use a bigger dataset for this demo. We already try to use two dataset which consist of around 500,000 rows and 800 rows.

Parameter tuning is another topic you can explore when learning pgvector. It affect the index building speed, query speed, and recall.

This demo didn’t focus on the accuracy of the similarity index. That is another interesting topic to explore.

[1] M. T. Pilehvar and J. Camacho-Collados, Embeddings in natural language processing: Theory and Advances in Vector Representations of Meaning. Morgan & Claypool Publishers, 2020.

[2] “Google Code Archive — Long-term storage for Google Code Project Hosting.” https://code.google.com/archive/p/word2vec/

[3] “OpenAI Platform.” https://platform.openai.com/docs/guides/embeddings

[4] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, Apr. 2020, doi: 10.1109/tpami.2018.2889473.