Unlocking OpenAI CLIP. Part 2: image similarity

7 min readJul 11, 2023

This article delves into the concept of image similarity using OpenAI CLIP. By employing the methodologies discussed in the previous article, readers can master image similarity in just five minutes.

Use cases

The computation of image similarity serves various purposes, including:

Prohibiting the upload of unauthorised content
Image retrieval: searching images in a large volume of files using an image as the input
Object recognition

Traditional approaches to computing image similarity and their limitations

Historically, one common and uncomplicated approach to determine if two images are identical involves selecting a hash function, calculating the hash value for each image, and subsequently comparing them. While this method still offers speed in comparing images, it suffers from several drawbacks.

To illustrate this, let’s consider an example using an image of a car and attempt to find similar images.

The image in question is a 512x512 pixel representation of a car. Its checksum value (MD5) is 262631d0f8c909812addc88afeb56f98.

While finding an exact match of this image on the internet, computing its hash value, and comparing it to the original may yield identical results, numerous reasons exist as to why this approach fails. Let’s examine these reasons in detail.

Collision

A collision occurs when two hashed values are identical despite having different input values. Although collisions are rare, they can occur. In the example below, both images possess the same md5sum value, thus proving the existence of collisions.

2 images with the same MD5 hash. Source here

Modified metadata

If any metadata within a file is edited, added, or removed, the checksum value of the file changes accordingly.

For instance, using exiftool, let’s add a “rights” EXIF to the file “car_meta.jpeg”:

exiftool -rights=”Copyright” car_meta.jpeg

The metadata now include a rights attribute:


...
Rights : Copyright
Image Width : 512
Image Height : 512
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
Image Size : 512x512
Megapixels : 0.262

Consequently, the md5sum values change as follows:

Original image: 262631d0f8c909812addc88afeb56f98
Image with new metadata: fd03512b962564c075a5f22d75784dc1

In conclusion, identical images with different metadata yield distinct hash values. To overcome this limitation, one can compute the md5sum solely based on the file’s content, excluding metadata.

Code snippet to compute the md5sum of the image content:

from PIL import Image
import hashlib
md5hash1 = hashlib.md5(Image.open(‘img1,png’).tobytes())
md5hash2 = hashlib.md5(Image.open(‘img2,png’).tobytes())
print(md5hash1.hexdigest(), md5hash2.hexdigest() )

One-pixel change

Another drawback of hash functions is their sensitivity to minute changes in image data. Even a single pixel alteration can cause a significant variation in the hash value. Humans may not perceive the difference, particularly if the change is limited to a one-unit shift in the RGB color values.

2 similar images but with a difference of one pixel

For instance, let’s consider those two images. Upon close inspection, the only dissimilarity between the two images is the presence of the white pixel in the top-left corner. However, their hash values differ significantly:

Original image: 262631d0f8c909812addc88afeb56f98
Image with white pixel: ef201086c7a26dfbfb1ab26c34c010dc

Different sizes

Lastly, let’s explore the impact of resizing an image. When the original 512x512 pixel image is resized to 256x256 pixels, the resulting hash value differs:

Left image: 512x512px. RIght image: 256x256px

As expected, the md5 value of the images are:

Original image: 262631d0f8c909812addc88afeb56f98
256x256px image: 56422706e85bf0f3d639c52124bdb753

In conclusion, hash functions are necessary but insufficient when it comes to determining image similarity. Hence, it becomes imperative to analyze the content of the image to make accurate assessments.

Image similarity with CLIP

The process of image similarity using OpenAI CLIP is straightforward:

Compute the embeddings of two images.
Calculate the cosine similarity between the embeddings.
If the resulting score is sufficiently high (close to 1), the images are deemed similar.

Hands-on Exercise: Compute image similarity between cars of different resolutions (512x512 vs. 256x256).

import torch
import clip
from PIL import Image
import torch.nn as nn
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image1 = "car.jpeg"
image2= "car256.jpeg"

cos = torch.nn.CosineSimilarity(dim=0)

image1_preprocess = preprocess(Image.open(image1)).unsqueeze(0).to(device)
image1_features = model.encode_image( image1_preprocess)

image2_preprocess = preprocess(Image.open(image2)).unsqueeze(0).to(device)
image2_features = model.encode_image( image2_preprocess)

similarity = cos(image1_features[0],image2_features[0]).item()
similarity = (similarity+1)/2
print("Image similarity", similarity)

Output: Image similarity 0.971923828125

As evident from the results, this method surpasses the limitations of hash functions.

Similar scores are obtained with the aforementioned examples: different metadata (cos close to 1) and one-pixel change (cos 0.978).

Image retrieval

Another use case for image similarity is image retrieval, which involves identifying and retrieving similar images based on an input image. In the previous article, we explored the process of searching a vast volume of images using prompts. With image similarity, the input is an image itself.

For this experiment, we will employ the COCO validation dataset, consisting of 5,000 images, to find images of teddy bears. Using the prompt “a teddy bear,” the top three images retrieved are as follows:

top 3 images when searching with the prompt “a teddy bear”

While these results are satisfactory, the first one does not depict an actual teddy bear. To refine the search, we will employ the following image as the query:

Input image to search teddy bears in the dataset

The search results corresponding to this input image are as follows:

To perform the search, you can utilize the following code snippet:

import torch
import clip
from PIL import Image
import os
import itertools
import torch.nn as nn

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

print(device)

dataset_folder = './val2017/'

images = []
for root, dirs, files in os.walk(dataset_folder):
    for file in files:
        if file.endswith(('jpg','jpeg')):
            images.append(  root  + '/'+ file)


#Embedding of the input image
input_image = preprocess(Image.open("teddy.jpg")).unsqueeze(0).to(device)
input_image_features = model.encode_image(input_image)

result = {}
for img in images:
    with torch.no_grad():
        image_preprocess = preprocess(Image.open(img)).unsqueeze(0).to(device)
        image_features = model.encode_image( image_preprocess)
        cos = torch.nn.CosineSimilarity(dim=0)
        sim = cos(image_features[0],input_image_features[0]).item()
        sim = (sim+1)/2
        result[img]=sim


sorted_value = sorted(result.items(), key=lambda x:x[1], reverse=True)
sorted_res = dict(sorted_value)

top_3 = dict(itertools.islice(sorted_res.items(), 3))

print(top_3)

For those skeptical of the capabilities, I challenge you to find valid prompts to describe the following flags (which indeed exist, should you have any doubts). With image retrieval, the search will be much easier.

3 examples that cannot be easily described using prompts

Caveats

Image similarity is not a panacea. It may fail to yield desired results due to various reasons. For instance, if a cropped image is used to retrieve the original image, it might not appear within the top three results.

To illustrate this limitation, let’s employ a cropped image to search our dataset:

Original image on the left. Cropped image on the right.

Top-3 results:

The results are quite different from our expectations. The key takeaways from this example are as follows:

When computing the image embedding, CLIP attempts to make sense of the cropped image, resulting in an embedding that represents a person, some letters, and a blue background.
Based on this embedding, the cosine similarity of the search images is calculated.
The top three results align with what CLIP identified in the image:
- Image 1: A person
- Image 2: A person
- Image 3: A blue background

However, by cropping another section of the image, we can achieve better results:

Top-3 results:

Conclusion

OpenAI CLIP overcomes the limitations of traditional hash functions and offers an effective approach to determine image similarity. Nonetheless, it is crucial to acknowledge its limitations and understand the contexts in which it may fall short.

That’s it for today. There is much more to learn about CLIP, so stay tuned.

#AI #CLIP #computervision #imagesimilarity

Link to part 1: https://medium.com/@jeremy-k/unlocking-openai-clip-part-1-intro-to-zero-shot-classification-f81194f4dff7

Link to part 3: https://medium.com/@jeremy-k/unlocking-openai-clip-part-3-optimizing-image-embedding-storage-and-retrieval-pickle-vs-faiss-25d0f02c049d