Image similarity with DINOv2 and FAISS

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

9 min readSep 27, 2023

Earlier this year, MetaAI made a notable milestone in the field of computer vision by open-sourcing DINOv2, a model trained on an impressive dataset of 142 million images. This release places DINOv2 in direct competition with OpenAI Clip, and initial assessments suggest it may even surpass it in certain tasks.

Yet, navigating the available documentation may be challenging to harness DINOv2’s capabilities. In this article, we will explore the steps to be taken for image similarity tasks, accompanied by a comprehensive assessment of its performance.

The significance of image similarity

In a prior story, we delved into the diverse applications of image similarity and demonstrated how OpenAI Clip facilitated this process.

For a quick recap, the following instances exemplify its relevance:

Organizing visual content: Image similarity proves invaluable for assembling visually cohesive albums or slideshows. By grouping similar images together, a seamless narrative or thematic progression can be achieved.
Content moderation: It serves as a robust tool for preventing the upload of unauthorized or inappropriate content. This application is particularly crucial in platforms where user-generated content is prevalent, ensuring a safe and compliant environment.
Efficient content retrieval: In scenarios involving vast repositories of image files, finding specific visuals can be a daunting task. Image similarity streamlines this process by swiftly retrieving visually analogous images, saving time and resources.

Now, let’s explore how DINOv2 enhances and refines these applications.

Exploring DINOv2

DINOv2 is a versatile model with a wide array of applications, making it a powerful asset for various tasks. Some of its key applications include:

Depth estimation: The process of determining the distance of objects in a scene from the viewpoint of a camera.
Semantic segmentation: The precise classification of each pixel in an image into distinct categories, enabling a detailed understanding of object boundaries and their semantics.
Image similarity: The quantification of visual resemblance between two or more images and retrieval by discerning similarities in their visual content.

DINOv2 is available in several models, each tailored to specific requirements:

Small: With 21 million parameters, a size of 85MB and the features extracted from an image have a dimensionality of 384.
Base: With 86 million parameters, a size of 331MB and the features extracted from an image have a dimensionality of 768.
Large: with 300 million parameters, taking up 1.2GB of storage space, and the features extracted from an image have a dimensionality of 1024.
Giant: The most expansive among the available models, with 1.1 billion parameters, requiring 4.3GB of storage, and the features extracted from an image have a dimensionality of 1536.

The diverse range of models ensures that DINOv2 can be tailored to fit a multitude of applications, from resource-conscious tasks to those requiring high precision.

Getting started

To search for similar images, we will follow several steps:

Feature extraction: Begin by extracting the features of all images within a dataset. These features act as numerical representations of the visual characteristics of each image.
Indexing with FAISS: Next, store these embeddings into a FAISS index. FAISS provides a powerful tool for performing rapid similarity searches, and it adds an abstraction layer that simplifies the process.(For more detailed information on FAISS, you can read this story).
Input image feature extraction: Extract the features of the input image to be used as a query for similarity search.
Similarity search: Utilize the FAISS index to perform a similarity search using the features of the input image. Retrieve the top-3 images that are most similar.

To get started, several packages must be installed. It is recommended to use a virtual environment:

#Creating the virtual env
virtualenv venv-dinov2
source venv-dinov2
#Installaing the packages
pip install transformers faiss-gpu torch Pillow

Feature extraction and storage with the small model:

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import faiss
import numpy as np
import os

#load the model and processor
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-small')
model = AutoModel.from_pretrained('facebook/dinov2-small').to(device)

#Populate the images variable with all the images in the dataset folder
images = []
for root, dirs, files in os.walk('./dataset'):
    for file in files:
        if file.endswith('jpg'):
            images.append(root  + '/'+ file)

#Define a function that normalizes embeddings and add them to the index
def add_vector_to_index(embedding, index):
    #convert embedding to numpy
    vector = embedding.detach().cpu().numpy()
    #Convert to float32 numpy
    vector = np.float32(vector)
    #Normalize vector: important to avoid wrong results when searching
    faiss.normalize_L2(vector)
    #Add to index
    index.add(vector)

#Create Faiss index using FlatL2 type with 384 dimensions as this
#is the number of dimensions of the features
index = faiss.IndexFlatL2(384)

import time
t0 = time.time()
for image_path in images:
    img = Image.open(image_path).convert('RGB')
    with torch.no_grad():
        inputs = processor(images=img, return_tensors="pt").to(device)
        outputs = model(**inputs)
    features = outputs.last_hidden_state
    add_vector_to_index( features.mean(dim=1), index)

print('Extraction done in :', time.time()-t0)

#Store the index locally
faiss.write_index(index,"vector.index")

Using the small model, the feature extraction process analyzed 1000 images in approximately 12 seconds. This indicates a processing rate of approximately 83 images per second.

To further enhance performance, you could consider providing images as batches to the processor, allowing for the simultaneous handling of multiple images. This can lead to a significant reduction in the computational time required to extract the features, although it will require more GPU consumption.

We can now use FAISS to search similar images.

import faiss
import numpy as np
import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image

#input image
image = Image.open('banana.jpg')

#Load the model and processor
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-small')
model = AutoModel.from_pretrained('facebook/dinov2-small').to(device)

#Extract the features
with torch.no_grad():
    inputs = processor(images=image, return_tensors="pt").to(device)
    outputs = model(**inputs)

#Normalize the features before search
embeddings = outputs.last_hidden_state
embeddings = embeddings.mean(dim=1)
vector = embeddings.detach().cpu().numpy()
vector = np.float32(vector)
faiss.normalize_L2(vector)

#Read the index file and perform search of top-3 images
index = faiss.read_index("vector.index")
d,i = index.search(vector,3)
print('distances:', d, 'indexes:', i)

Note: The image search was performed using FAISS with a GPU, resulting in an impressively fast search time of 0.2 milliseconds.

A first look at image similarity with the Coco dataset

For our initial evaluation, we will utilize the validation set from the Coco dataset, which comprises 5000 images. We’ll be conducting image similarity searches with multiple input images, yielding the following results:

These initial results are already quite promising. Let’s delve even deeper into image similarity.

Benchmark against a larger dataset

Utilized dataset

To benchmark our models, we have chosen the DISC21 dataset, purposefully created for image similarity searches by MetaAI Research. Due to its substantial size of 350GB, we will be using a subset of 50.000 images available on Kaggle. You can find the dataset here.

Metrics employed

In terms of metrics, we will calculate:

Accuracy: the ratio of correctly predicted images to the total number of images.
Top-3 Accuracy: the ratio of times the correct image is found within the top three similar images to the total number of images.
Computational time: the time required to process the entire dataset.

Benchmark script

Firstly, we will examine the “ground truth” CSV file to identify which images, among the 50,000, have similar counterparts within the dataset. A preliminary review of the document shows that 529 images fall into this category.

Next, we will develop a script to locate similar images for these 529 identified images. Subsequently, we will compute both accuracy and top-3 accuracy.

After having extracted the features of all the images in the dataset, stored them in a FAISS index and also stored the filenames, we can use the following script (to be adapted to your filenames):

import faiss
import numpy as np
import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import json
import time

#Assume we have stored the filenames in this JSON file
f = open('list_images_disc21.json')
images = json.load(f)
f.close()

#Assume you have stored the ground truth for the 529 images
#json is formatted as follows: { image1_filename: ground_truth_filename, ...} 
f = open('ground_truth.json')
ground_truth = json.load(f)
f.close()

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base').to(device)

index = faiss.read_index("disc21_base.index")

positive=0
top3_positive=0
total=0
t0=time.time()

for filename in truth:
    image = Image.open(filename)
    with torch.no_grad():
        inputs = processor(images=image, return_tensors="pt").to(device)
        outputs = model(**inputs)
    #Convert and normalize before searching
    embeddings = outputs.last_hidden_state
    embeddings = embeddings.mean(dim=1)
    vector = embeddings.detach().cpu().numpy()
    vector = np.float32(vector)
    faiss.normalize_L2(vector)
    #Search top-3 images
    d,i = index.search(vector,3)
    if ground_truth[filename]==images[i[0][0]]:
        positive+=1
    for res in i[0]:
        if ground_truth[key]==images[res]:
            top3_positive+=1
    total+=1

print('Accuracy:', positive/total)
print('Top3-Accuracy', top3_positive/total)
print('Time:', time.time()-t0)

Please note this script must be adapted to the folders/filenames you are using. If you need help to reproduce it, do not hesitate to contact me.

Outcome of the benchmark

Accuracy and Top3-Accuracy

Computational Time

+-------+---------------------+-------------------------------+
| Model | Feature extraction: | Benchmark time (529 images    |
|       |  Nb images / second | against 50000 images          |
+-------+---------------------+-------------------------------+
| Small |               47.80 |                           22s |
| Base  |               39.87 |                           24s |
| Large |               21.83 |                           35s |
| Giant |                6.76 |                           93s |
+-------+---------------------+-------------------------------+

Other statistics

+-------+--------------------+----------------+
| Model | Size of index (MB) | GPU usage (MB) |
+-------+--------------------+----------------+
| Small |                 74 |           1322 |
| Base  |                147 |           1594 |
| Large |                196 |           2398 |
| Giant |                293 |           5722 |
+-------+--------------------+----------------+

Examining the results

The following examples showcase the results obtained by the different models:

1.All models correctly identifying the image

2. Mixed performance among models

3. All models failing to find the correct image

A comment on the last image: the input image has undergone several operations compared to the ground truth, including rotation, grayscale conversion, and masking. No wonder none of the models found the expected image.

Analysis

DINOv2 perform very well on image similarity tasks, even on challenging datasets such as DISC21. The bigger the model, the better the performance, although this comes with a significant trade-off in terms of computational time and GPU consumption. Moreover, by leveraging FAISS capabilities, performing searches is extremely fast.

Limitations

While this benchmark provides valuable insights, it’s important to acknowledge its limitations. Due to its enormous size, we could not assess the performance of DINOv2 on the whole DISC21 dataset.

Conclusion

In this evaluation, we delved into the performance of DINOv2 on image similarity tasks, employing challenging datasets like DISC21. While the size of the dataset presented some limitations, our benchmark provided valuable insights into DINOv2’s capabilities and the varying performance levels of different models.

It became evident that larger models yielded better results, albeit at the cost of increased computational demands. Leveraging FAISS further demonstrated the impressive speed at which searches could be conducted. DINOv2 surely qualify as a potent model for image similarity tasks.

Stay tuned for our upcoming article, where we’ll compare the performance of OpenAI Clip with DINOv2, providing further insights into the capabilities of these powerful models.

PS: Your applause serves as a huge motivation for me to continue crafting insightful content. 👇