Leveraging the power of Vector databases (CrateDB) with OpenAI’s CLIP model

Abdulkhader Sakivelu
DataPebbles
Published in
6 min readMar 23, 2024

Vector databases have been around for some time but with the advent of GenAI they are gaining more prominence.

Let’s see why?

Vector databases are designed explicitly to store and manage high-dimensional vector data, which represents the semantics of unstructured data such as text, video, and audio.

They differ from traditional databases which offer mechanisms to get precise matches as Vector similarity search, enables users to find semantically similar texts or images, also known as k-Nearest-Neighbor search.

In this blog, we will be looking at an implementation where we store embeddings of images from a multimodal model and implement text-to-image, where we can search images using text.

We will be using an e-commerce dataset from Kaggle. It contains images of millions of fashion products in ‘jpg’ format. The original dataset of 25GB is a bit too large for the demo, so I will be using a smaller version of size 280MB which contains approximately 45k images.

The core technological components for this implementation are

CrateDB

CrateDB is a distributed SQL database management system that integrates a fully searchable document-oriented data store. It is open-source, written in Java, based on a shared-nothing architecture, and designed for high scalability.

CrateDB has inbuilt support for storing and querying data in Vector format along with a KNN matching function, which gives us the nearest vector for the one we are searching for.

More information about CrateDB can be found here.

CLIP

CLIP is a revolutionary model that introduced joint training of a text encoder and an image encoder to connect two modalities.

The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. More details about the model can be found here.

We will be using the pre-trained version of CLIP available in the Transformers library for the example.

DOCKER

We will be using Docker to set up a single node CrateDB locally on our computer.

docker run --publish=4200:4200 --publish=5432:5432 --env CRATE_HEAP_SIZE=1g --pull=always crate

The above command also port-forwards the admin UI of CrateDB to port 4200 on local.

I have created a table using the below from the console tab.

.CREATE TABLE retail_data (
filename STRING,
embeddings FLOAT_VECTOR(512)
);

CrateDB stores vectors in the type ‘FLOAT_VECTOR’. The length of the vector has been kept to 512 to match the length of embeddings generated by CLIP.

Overview

The below diagram shows the overview of the implementation.

The source code for the above example can be found here. The structure of the project repository looks like this

It contains two scripts that drive the entire process.

train.py

The script takes an input dataset of images, iteratively generates embeddings for each image, and stores them in a table in CrateDB along with the file name.

The script uses crate client to establish a connection with our local installation and a cursor to insert the data.

#!/usr/bin/env python3
from conf import ModelConf,CrateConf
import argparse
import sys
import os
from PIL import Image

def create_folder(path):
if not os.path.exists(path):
os.makedirs(path)

def parse_args(args):
parser = argparse.ArgumentParser(description="Generate embeddings for images")
parser.add_argument(
'--folder-path',
help="Folder with image dataset which should be embedded",
required=True
)
parser.add_argument(
'--processed-path',
help="Folder where image should be moved to after processing",
required=False
)
return parser.parse_args(args)


def main(args):
parsed_args = parse_args(args)
kwargs = vars(parsed_args)
print(kwargs)
files_loc = kwargs['folder_path']
processed_files = kwargs['processed_path']
if processed_files is not None:
create_folder(processed_files)
if os.path.isdir(files_loc):
files = os.listdir(files_loc)
model = ModelConf().model
processor = ModelConf().processor
crate_cursor = CrateConf().get_cursor()
results = []
counter = 0
file_counter = 1
for file in files:
file_name_path = files_loc+'/'+file
print(f"processing file no :{file_counter} , name:{file}")
image = Image.open(file_name_path)
embedding = model.get_image_features(**processor(image, return_tensors="pt"))
results.append((file_name_path, embedding.tolist()[0]))
image.close()
counter = counter + 1
file_counter += 1
if processed_files is not None:
os.rename(file_name_path, f"{processed_files}/{file}")
if counter == 10:
print(results)
crate_cursor.executemany("insert into retail_data (filename,embeddings) values (?,?)",results)
print("inserted batch of 10 file embeddings")
del results
results = []
counter = 0
if len(results) > 0:
crate_cursor.executemany("insert into retail_data (filename,embeddings) values (?,?)",results)
print("inserted final batch")
crate_cursor.close()
else:
raise Exception(f"{files_loc} Not a directory")


if __name__ == "__main__":
main(sys.argv[1:])

search.py

This script is designed to accept a user query from stdin and return the name of the image that resembles the input query the most in our database.

In the backend, the script generates embeddings for the user input using the text encoder from CLIP, queries the database using the KNN matching function against the embedding, and returns the image with the highest matching score.

#!/usr/bin/env python3
from conf import CrateConf,ModelConf

def search_str(prmt):
model = ModelConf().model
tokenizer = ModelConf().tokenizer
crateCursor = CrateConf().get_cursor()
text=model.get_text_features(**tokenizer([prmt], return_tensors="pt", truncation=True))
embedding = text.tolist()[0]
query = f"SELECT filename FROM retail_data WHERE knn_match(embeddings, {embedding}, 2) ORDER BY _score DESC limit 1"
crateCursor.execute(query)
result = crateCursor.fetchall()
print(result[0])

if __name__ == "__main__":
input = input("What are you looking for? \n")
if len(input) > 0:
search_str(input)
else:
print('Acceptable query format: String')

Execution

Now that we have understood the contents of the repository, let us start executing the code.

Before we start running the code, make sure that you have completed the below.

  1. Set up a CrateDB instance.
  2. Created the retail_data table
  3. Have your image dataset available locally.

We can start loading the embeddings into our table by running the script train.py

It takes one mandatory named argument ‘- -folder-path’ which indicates the folder containing the images and an optional named argument ‘- -processed-path’. when the optional argument is passed, the files that are processed are moved into this folder to avoid re-processing in case of failures.

Example

python train.py --folder-path c:/Downloads/images/

This might take some time depending on the size of your dataset. The script inserts embeddings in batches of ten at a time.

Once the data has been loaded, we can start querying using search.py

I queried ‘Get me a white shirt with writing on it’ and it gave me a pretty accurate result. If in case there is no exact match then the script gives you the closest matching image.

Tip: Try using the same search term while the data is being loaded and check how the results keep varying as it looks for the closest match.

Conclusion

We saw how simple it is to set up a text-to-image search database with CrateDB. We can use the same approach with any other ML model which can generate the embeddings with minor changes.

Even though the demo is with a simple retail dataset, we can customize this to create a sophisticated tool that can provide good business value.

Did you find this article helpful? Clap, share, and if you have any questions or suggestions about this article, please contact me at abdulkhader.sakivelu@datapebbles.com .

--

--