CLIP Vs DINOv2 in image similarity

In the world of artificial intelligence, two giants stand tall in the realm of computer vision: CLIP and DINOv2. I’ve previously written stories about their capabilities for image similarity tasks (link for CLIP and DINOv2). CLIP revolutionized image understanding, while DINOv2 brought a fresh approach to self-supervised learning.

In this article, we embark on a journey to uncover the strengths and subtleties that define CLIP and DINOv2. We aim to discover which of these models truly excels in the world of image similarity tasks. Let’s witness the clash of the titans and find out which model emerges victorious.

Image similarity with CLIP

Calculating the similarity between two images with CLIP is a straightforward process, achieved in just two steps: first, extract the features of both images, and then compute their cosine similarity.

For more in-depth guidance, refer to these two stories: CLIP part 1 and CLIP part 2.

To begin, ensure you have the necessary packages installed. It is advisable to set up and utilize a virtual environment:

#Start by setting up a virtual environment
virtualenv venv-similarity
source venv-similarity/bin/activate
#Install required packages
pip install transformers Pillow torch

Next, proceed with the computation of image similarity:

import torch
from PIL import Image
from transformers import AutoProcessor, CLIPModel
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

#Extract features from image1
image1 = Image.open('img1.jpg')
with torch.no_grad():
inputs1 = processor(images=image1, return_tensors="pt").to(device)
image_features1 = model.get_image_features(**inputs1)

#Extract features from image2
image2 = Image.open('img2.jpg')
with torch.no_grad():
inputs2 = processor(images=image2, return_tensors="pt").to(device)
image_features2 = model.get_image_features(**inputs2)

#Compute their cosine similarity and convert it into a score between 0 and 1
cos = nn.CosineSimilarity(dim=0)
sim = cos(image_features1[0],image_features2[0]).item()
sim = (sim+1)/2
print('Similarity:', sim)
2 similar images

Using the provided example with two similar images, the obtained similarity score is an impressive 96.4%

Image similarity with DINOv2

The process of computing similarity between two images with DINOv2 mirrors that of CLIP. For a deeper understanding of DINOv2, you can explore this story.

Utilizing DINOv2 requires the same set of packages as previously mentioned, without the need for any additional installations:

from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model = AutoModel.from_pretrained('facebook/dinov2-base').to(device)


image1 = Image.open('img1.jpg')
with torch.no_grad():
inputs1 = processor(images=image1, return_tensors="pt").to(device)
outputs1 = model(**inputs1)
image_features1 = outputs1.last_hidden_state
image_features1 = image_features1.mean(dim=1)

image2 = Image.open('img2.jpg')
with torch.no_grad():
inputs2 = processor(images=image2, return_tensors="pt").to(device)
outputs2 = model(**inputs2)
image_features2 = outputs2.last_hidden_state
image_features2 = image_features2.mean(dim=1)

cos = nn.CosineSimilarity(dim=0)
sim = cos(image_features1[0],image_features2[0]).item()
sim = (sim+1)/2
print('Similarity:', sim)

Using the identical pair of images as in the CLIP example, the similarity score obtained with DINOv2 is 93%.

Test with the COCO dataset

Before delving into an in-depth assessment of their performance, let’s compare the results yielded by CLIP and DINOv2 using images from the validation set of the COCO dataset.

The process we employ is as follows:

  1. Iterate through the dataset to extract the features of all the images.
  2. Store the embeddings in a FAISS index.
  3. Extract the features of an input image.
  4. Retrieve the top-three similar images.

For those interested in delving deeper into FAISS, refer to this informative piece. Be sure to install it first using the command: pip install faiss-[gpu|cpu].

Part 1: features extraction and creation of 2 indexes

import torch
from PIL import Image
from transformers import AutoProcessor, CLIPModel, AutoImageProcessor, AutoModel
import faiss
import os
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

#Load CLIP model and processor
processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

#Load DINOv2 model and processor
processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device)

#Retrieve all filenames
images = []
for root, dirs, files in os.walk('./val2017/'):
for file in files:
if file.endswith('jpg'):
images.append(root + '/'+ file)


#Define a function that normalizes embeddings and add them to the index
def add_vector_to_index(embedding, index):
#convert embedding to numpy
vector = embedding.detach().cpu().numpy()
#Convert to float32 numpy
vector = np.float32(vector)
#Normalize vector: important to avoid wrong results when searching
faiss.normalize_L2(vector)
#Add to index
index.add(vector)

def extract_features_clip(image):
with torch.no_grad():
inputs = processor_clip(images=image, return_tensors="pt").to(device)
image_features = model_clip.get_image_features(**inputs)
return image_features

def extract_features_dino(image):
with torch.no_grad():
inputs = processor_dino(images=image, return_tensors="pt").to(device)
outputs = model_dino(**inputs)
image_features = outputs.last_hidden_state
return image_features.mean(dim=1)

#Create 2 indexes.
index_clip = faiss.IndexFlatL2(512)
index_dino = faiss.IndexFlatL2(768)

#Iterate over the dataset to extract features X2 and store features in indexes
for image_path in images:
img = Image.open(image_path).convert('RGB')
clip_features = extract_features_clip(img)
add_vector_to_index(clip_features,index_clip)
dino_features = extract_features_dino(img)
add_vector_to_index(dino_features,index_dino)

#store the indexes locally
faiss.write_index(index_clip,"clip.index")
faiss.write_index(index_dino,"dino.index")

Part 2: image similarity search

import faiss
import numpy as np
import torch
from transformers import AutoImageProcessor, AutoModel, AutoProcessor, CLIPModel
from PIL import Image
import os

#Input image
source='laptop.jpg'
image = Image.open(source)

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

#Load model and processor DINOv2 and CLIP
processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device)

#Extract features for CLIP
with torch.no_grad():
inputs_clip = processor_clip(images=image, return_tensors="pt").to(device)
image_features_clip = model_clip.get_image_features(**inputs_clip)

#Extract features for DINOv2
with torch.no_grad():
inputs_dino = processor_dino(images=image, return_tensors="pt").to(device)
outputs_dino = model_dino(**inputs_dino)
image_features_dino = outputs_dino.last_hidden_state
image_features_dino = image_features_dino.mean(dim=1)

def normalizeL2(embeddings):
vector = embeddings.detach().cpu().numpy()
vector = np.float32(vector)
faiss.normalize_L2(vector)
return vector

image_features_dino = normalizeL2(image_features_dino)
image_features_clip = normalizeL2(image_features_clip)

#Search the top 5 images
index_clip = faiss.read_index("clip.index")
index_dino = faiss.read_index("dino.index")

#Get distance and indexes of images associated
d_dino,i_dino = index_dino.search(image_features_dino,5)
d_clip,i_clip = index_clip.search(image_features_clip,5)

Results

Using four different images as inputs, the searches produced the following outcomes:

CLIP vs DINOv2

In this small subset, it appears that DINOv2 demonstrates a slightly superior performance.

Benchmark against the DISC21 dataset

To compare their performance, we will follow the same method described in this story: https://medium.com/aimonks/image-similarity-with-dinov2-and-faiss-741744bc5804

We will also reuse the scripts above to extract features and then compute image similarity.

Dataset

To benchmark CLIP and DINOv2, we have chosen the DISC21 dataset, purposefully created for image similarity searches. Due to its substantial size of 350GB, we will be using a subset of 150.000 images.

Metrics employed

In terms of metrics, we will calculate:

  • Accuracy: the ratio of correctly predicted images to the total number of images.
  • Top-3 Accuracy: the ratio of times the correct image is found within the top three similar images to the total number of images.
  • Computational time: the time required to process the entire dataset.

Outcome of the benchmark

  • Features extraction

CLIP: 70.7 images per second

DINOv2: 69.7 images per second

  • Accuracy and Top-3 Accuracy
Accuracy and Top-3 Accuracy
  • Examining the results
  1. Both models correctly predicting the image

2. All models failing to find the correct image

3. Only CLIP predicting the correct image, DINOv2 predicting it in its top 3

4. Only DINOv2 predicting the correct image

Analysis

DINOv2 emerges as the clear frontrunner, achieving an impressive accuracy of 64% on a notably challenging dataset. By contrast, CLIP demonstrates a more modest accuracy, reaching 28.45%.

Regarding computational efficiency, both models exhibit remarkably similar feature extraction times. This parity places neither model at a distinct advantage in this regard.

Limitations

While this benchmark offers valuable insights, it’s crucial to recognize its limitations. The evaluation was conducted on a subset of 1448 images, compared against a pool of 150,000 images. Given the entire dataset’s size of 2.1 million images, this narrowed focus was necessary to conserve resources.

It’s worth noting that MetaAI employs the DISC21 dataset as a benchmark for its model, potentially giving DINOv2 a favorable advantage. However, our tests on the COCO dataset revealed intriguing nuances: DINOv2 displays a heightened ability to identify primary elements in an image, whereas CLIP demonstrates adeptness at focusing on specific details within an input image (as exemplified by the image of the bus).

Lastly, it’s essential to consider the difference in embedding dimensions between CLIP and DINOv2. CLIP utilizes an embedding dimension of 512, whereas DINOv2 operates with 768. While an alternative could be to employ the larger CLIP model with a matching embedding dimension, it’s worth noting that this comes at the cost of speed. A quick test on a small subset showed a slight performance boost but without achieving the level of performance demonstrated by DINOv2.

Conclusion

DINOv2 demonstrates superior accuracy in image similarity tasks, showcasing its potential for practical applications. CLIP, while commendable, falls short in comparison. It’s worth noting that CLIP can be particularly useful in scenarios that demand a focus on small details. Both models exhibit similar computational efficiency, making the choice task-specific.

PS: Your applause serves as a huge motivation for me to continue crafting insightful content. πŸ‘‡

--

--

Jeremy K
π€πˆ 𝐦𝐨𝐧𝐀𝐬.𝐒𝐨

Innovation specialist, AI expert, passionate about OSINT and new technologies. Sharing knowledge and concrete use cases matter to me.