Using Weaviate to Find Similar Images

Vector database in action to identify stamps

Estelle Scifo
8 min readApr 29, 2022
An English stamp on an enveloppe
Photo by Brett Jordan on Unsplash.

Problem statement

I had a long-lasting idea of building a machine learning powered tool to identify stamps in my collection from an image of them (picture or scan). The final goal would be to do an estimation of a collection from pictures of album pages for instances.

I quickly wanted to extract a vector representation of images (an embedding), but didn’t know how to do efficient pairwise comparison to find the closest image.

Disclaimer: this is probably the first time I have to deal with image data, so be indulgent! If you find any bad practice while reading this story, kindly let me know in a comment. Thanks.

A few months ago, I discovered Weaviate at a conference and it seems to be a good candidate to fix the above issue. Let’s see how we can use it to identify stamps from a picture.

You said “Weaviate”?

Weaviate defines itself as:

an open source vector search engine that stores both objects and vectors, allowing for combining vector search with structured filtering with the fault-tolerance and scalability of a cloud-native database.

It is an awesome le tool for NLP, letting users ask questions with a GraphQL-like syntax, and get incredibly relevant result:

Image from the Weaviate GitHub repository: https://github.com/semi-technologies/weaviate

In short, weaviate will store the vector representation of your observation, and use it when performing queries, so that you can easily find close vectors. Sounds interesting? Bare with me then!

Dataset

For this task, I have extracted a few images from a stamp listing website. Since this is only a proof of concept, I have not spent time on scraping scripts and so on, just downloaded images manually. Here is a sample of the dataset used:

Example data

It contains around 200 stamps with various sizes, orientations, colors… Not a large dataset, but it is good enough for a first prototyping.

Show me the code

Let’s stop talking and actually build stuff. We’ll be using Jupyter notebooks and… weaviate. I assume you know how to install and start Jupyter, but let’s see how to setup Weaviate on your local machine.

Firing-up Weaviate in Docker

Weaviate is super easy to setup since you’ll only need docker compose to be installed on your machine.

Following this tutorial, I have used the following docker-compose.yml file:

# docker-compose.yml
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: semitechnologies/weaviate:1.12.2
ports:
- 8080:8080
restart: on-failure:0
environment:
IMAGE_INFERENCE_API: 'http://i2v-neural:8080'
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'img2vec-neural'
ENABLE_MODULES: 'img2vec-neural'
CLUSTER_HOSTNAME: 'node1'
i2v-neural:
image: semitechnologies/img2vec-pytorch:resnet50
environment:
ENABLE_CUDA: '0'

Then simply run:

docker-compose up

And your Weaviate container will be up and running after a few minutes depending on your internet connection.

Once started, it exposes endpoints to interact with the database schema and the data it contains. You can for instance inspect the data schema with the following command:

curl -s http://localhost:8080/v1/schema

But in the rest of this story, I’ll use the Python client for convenience, which can be installed with pip:

pip install weaviate-client

First thing is to instantiate a new client. With the above configuration, it is as simple as:

import weaviateclient = weaviate.Client("http://localhost:8080")

We’ll use this client in the following sections to define the database schema, insert data and perform searches.

Creating the class to store Stamps

In order to add data to our database, we have to define its schema, made of a name, list of properties and the information Weaviate needs to be able to vectorize the objects. Here is our definition of the “Stamp” class:

def define_schema(client):
class_obj = {
"class": "Stamp",
"description": "Stamp with an image blob and a path to the image file",
"properties": [
{
"dataType": [
"blob"
],
"description": "Image",
"name": "image"
},

{
"dataType": [
"string"
],
"description": "",
"name": "path"
}

],
"vectorIndexType": "hnsw",
"moduleConfig": {
"img2vec-neural": {
"imageFields": [
"image"
]
}
},
"vectorizer": "img2vec-neural"

}

client.schema.create_class(class_obj)
define_schema(client)

We create a class named “Stamp”, with two properties (think “columns” in SQL):

  • the image path on disk (not necessary, just stored for convenience)
  • we’ll need to store the b64-encoded image as blob, so we create the image property of type blob

We also tell weaviate that the blob is the image property (in case there as several of them). Finally, we specify which encoder has to be used to vectorize each data point, here the img2vec-neural encoder. Each time we insert data into the “Stamp” collection, weaviate will take the data in the image field and extract its vector representation thanks to the model img2vec-neural . This vector will be saved internally, so that it can be used for queries. But let’s try not to skip steps. In the next section, we are going to insert some data in the database.

Importing data

In order to import data, we’ve put all stamp images we downloaded into the data/stamps folder.

Stamp images as the ones displayed on the previous image have various sizes and shapes. In order for them to be “comparable”, I apply a preprocessing treatment in order to “standardize” their size, using the Python OpenCV package.

import os
import cv2 # opencv-python package
import uuid
import base64
DATA_DIR = "data/stamps"
IMAGE_DIM = (100, 100)
def _prepare_image(file_path):
img = cv2.imread(file_path)
# resize image
resized = cv2.resize(img, IMAGE_DIM, interpolation= cv2.INTER_LINEAR)
return resized

Note: in a first version, I had converted images to grayscale in the preparation step, but it seems to be counter productive and lead to lowest prediction scores.

Once the image is prepared, we can extract its base 64 encoding, and insert data in Weaviate:

def insert_data(client):
for file_name in os.listdir(DATA_DIR):
file_path = os.path.join(DATA_DIR, file_name)
img = _prepare_image(file_path)
# encode image as base64 string
jpg_img = cv2.imencode('.jpg', img)
b64_string = base64.b64encode(jpg_img[1]).decode('utf-8')
# define properties as expected by the class definition
data_properties = {
"path": file_name,
"image": b64_string,
}
# create data of type "Stamp" with a random UUID
r = client.data_object.create(
data_properties,
"Stamp",
str(uuid.uuid4())
)
print(fp, r)
insert_data(client)

Our database now contains some data, let’s try and use them to find stamps from a picture.

Finding similar images

When users upload their pictures, it won’t necessarily be well centered and oriented. Before we can compare the image to the ones in our database, we need to clean it.

Image cleaning

Image cleaning is done by the clean_image function reproduced below. After converting the image to gray scale, we look for the greatest contour and crop the image to it, then we resize the cropped image so that it matches the size of the images in our database:

def clean_image(image_path, tmp_file_name="test_image.jpg"):
# convert to gray scale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 180, 255, cv2.THRESH_BINARY_INV)
contours, _ = cv2.findContours(thresh, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
# find largest contour
cnt = max(contours, key=cv2.contourArea)
# crop image based on the largest contour
x, y, w, h = cv2.boundingRect(cnt)
cropped_contour = img[y:y+h, x:x+w]
# resize (same size as train data)
resized = cv2.resize(cropped_contour, (100, 100), interpolation= cv2.INTER_LINEAR)
return img, resized

Finally, the function returns the initial and prepared image, so that we can visualize them together. An example output is show below.

Image transformation: cropping and resizing

Searching the database

In order to search for the most similar image in the database, we’ll build a Weaviate query using the Python client:

def search(client, test_image_path):    # prepare query and payload
near_image = {
"image": tmp_file_name,
}
query = client.query.get(
"Stamp", # get data from class "Stamp"
["path"] # return the "path" property
)
.with_near_image(
near_image # find "Stamp" close to img in embedding space
)
.with_limit(3) # limit result to first 3 best matches
# perform query
res = query.do()
# return results
# (res is a dict following GraphQL syntax)
return res["data"]["Get"]["Stamp"]

We can then use the last two functions in combination:

# read test image and transform it
test_image = os.path.join(TEST_DIR, "photo.jpg")
_, prepared = clean_image(test_image)
# save prepared image into a tmp file
test_image_prepared = "test_image.jpg"
cv2.imwrite(test_image_prepared, prepared)
# perform query
res = search(client, test_image_prepared)
print(res)

Results

The following images show a few results, obtained by comparing stamp pictures from my collection to the train data using the procedure described above.

Search result from the “student” stamp
Search result for the “Claudine” stamp (character from a series of novels by Colette)
Search result for Michelangelo sculpture
Search result for the comic “Blake and Mortimer — The Yellow Mark”
Search result for Marilyn by Andy Warhol

While we can see from the “Cleaned” image that the data preparation and image cleaning process can be improved (see next section), the results already look quite promising.

Next steps

This is a very simple approach, which, is a real-life project, won’t be ready yet for production. Here are some ideas to improve it and make it user-friendly:

  1. Improve image preparation: the tests performed here are done with pictures I have taken myself, so showing the full stamp in the proper orientation, that won’t be always the case! Image preparation needs to be improved: for instance, taking the largest contour is probably not the best option.
  2. Extend the training dataset: right now, the train dataset contains few observation (about 200), we can use web scraping to download more images and more information about the stamps.
  3. Create metric to evaluate model: perhaps labeling test images?
  4. Add more filters: maybe users already know the year the stamp was issued or the country it comes from? Usually, both information are written on the stamp itself, so we can also try and improve our CV approach to read this information from the stamp, which would considerably reduce the possible matches.
  5. Build UI for users to find out their stamps from a picture: users usually don’t use notebooks :)

If you want to play with the code, the notebooks used in this story are available on GitHub.

Going further

In addition to the inline links, you can explore the following resources to learn more about this topic:

--

--

Estelle Scifo

PhD, Data Scientist, Python developer — author @PacktPub — now ML engineer Graphs + GenAI https://www.linkedin.com/in/estellescifo/