Vector similarity search with duckdb

Chang She
LanceDB
Published in
4 min readOct 3, 2022

I’ve been playing with duckdb a bit lately (who hasn’t). The good folks at duckdb labs just released a postgres scanner to query tables directly from duckdb. Since I’ve been using duckdb for computer vision, I decided to do a little quick hack to enable vector similarity computations using the new postgres scanner.

The Setup

First we setup an embeddings table in postgres like this:

embeddings for oxford pet dataset

Here the id column was generated using a uuid per image and the embeddings are extracted out of a trained classification model.

The Goal

We want to get to something like this:

sorted in descending order by cosine similarity score to the first row

The Setup

Given a particular id we want to be able to rank this table by the cosine similarity of the embedding column to that id’s embedding. For example, we can select out a particular row using the id and see that it corresponds to a samoyed:

Vector similarity

Using pgvector, we get avector data type for postgres, which is the data type of the embedding column. The pgvector extension also defines distance metrics like cosine_distance, l2_distance, and inner_product:

And if you have a lot of vectors (or of high dimensions), you can create an extra index to speed up the computation:

CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops);

Duckdb Postgres Scanner

The postgres_scanner from duckdb enables you to query postgres tables directly from a running postgres instance:

SELECT * FROM postgres_scan('dbname=myshinydb', 'public', 'mytable'); 
SELECT * FROM postgres_scan_pushdown('dbname=myshinydb', 'public', 'mytable');

There are two gaps here:

  1. The duckdb postgres scanner has limited type support. So if you try to query a table with an embedding column, you get an error about unsupported types (https://github.com/duckdblabs/postgres_scanner/blob/main/postgres_scanner.cpp#L198).
  2. The extension doesn’t know to push down the cosine_distance function to postgres and it’s not a registered function in duckdb.

Fortunately, we can add a quick hack to the extension to add this functionality in. We define a new table function similarity which takes an id and returns a table with ascore column.

You can find the source code on github (note that it’s the similarity branch). The absolute quickest way to hack this was to create a temporary table with the results of the embeddings search query, and then use that table to generate the PostgresBindData.

There are certainly a lot better ways of doing this, but this was the quickest :)

Putting it all together

Following the directions for the postgres scanner extension, we build it from source using make and then use the duckdb cli to use the extension:

You can setup pgvector however you’d like, in the hack repo, I used a pre-made docker image ankane/pgvector and then all you have to do is install the pgvector extension by running the SQL command:

CREATE EXTENSION IF NOT EXISTS vector;

This is all assembled together by the docker-compose.yml file on github so you can just run docker compose up db .

Conclusion

Vector similarity is an extremely important task in computer vision (and really ML in general). Being able to do it in duckdb means more and more of computer vision queries can be done in duckdb directly. For the postgres scanner, it would be great to be able to 1) set a default connection string and 2) push the function evaluation to postgres. Though ultimately, it would be great to see a native vector search extension in duckdb.

Links

https://github.com/pgvector/pgvector

https://github.com/changhiskhan/postgres_scanner/tree/similarity

--

--