Diving Into CLIP by Creating Semantic Image Search Engines

CLIP (Contrastive Language-Image Pre-training) is a foundational Deep Learning Model by OpenAI that connects images and their natural language descriptions. We explore the CLIP capabilities by creating Text-Image, Image-Image, and Image-Text Search Engines.

Ahmad Anis

Published in

Red Buffer

12 min readFeb 20, 2023

Introduction

Semantic Image Search is a highly useful technique that one can employ to search for the most relevant images provided in the text description. Traditionally, it was mostly done using several hacks like having some metadata attached to each image and using text matching to match the pictures. While it was good, it wasn’t intelligent.

Consider this example:
You want to search for the phrase “A dog on a foggy day sitting on the beach”.

Now, do you think Google has an image showing the exact description in their database? Obviously yes, but if you search it in Google Images, the top results do not return the image that exactly matches the description; they are quite related though so it might be a bit semantic, we give this credit to google, it misses foggy day though).

But if you search for the exact same phrase on You.com, the world’s first search engine powered by GPT-3, the first result matches the exact description.

While this isn’t a criticism of Google’s Image Search Algorithm, this just shows that understanding the semantics of text to search the images is an important task and can not be done via traditional search approaches.

Why CLIP for Image Search

Before diving into CLIP, you must’ve thought why are we not using traditional Deep Learning methods like CNNs, etc.? If you think a while of a simple Image Classification task (cats vs dogs, or any other problem), you are still somehow trying to connect a piece of text (cat or dog) to an image, and I think this is an interesting of thinking about classification tasks that you are connecting texts to images, if you think this way, you can think of other wide range of possible solutions as well.

While traditional deep learning systems for these kinds of problems (connecting text and images) have revolutionized the world of Computer Vision, there are some key problems that we all face.

It is very labor-intensive to label big datasets for supervised learning that are required to scale a state-of-the-art model.
Strictly supervised learning restricts the model to a single task, and they are not good at multiple tasks.
The reason they are not good at multiple tasks is that
1) Datasets are very costly, so it is difficult to get labeled datasets for multiple tasks that can scale a deep learning model.
2) Since it is strictly supervised learning, hence the model learns a narrow set of visual concepts; standard vision models are good at one task and one task only. An example of this can be a very well-trained ResNet-101, a very good Deep Learning model, while it performs really well on the simple ImageNet dataset, as soon as the task deviates a little bit to sketch, it starts performing really poorly.

RESNET101 on ImageNet vs ImageNet Sketch.

This makes me think that we should stop thinking in terms of how people think of classification tasks and start thinking of the problem as connecting the text to the image.

Now to avoid the problems mentioned above, the researchers have proposed multiple solutions to these problems, leveraging the coolest training technique (Self-Supervised Learning, hence no need for manual labeling) as well as Multimodal Learning where you feed both text and image as input to modal instead of image as input and text as the correct label for supervised loss.

Self-Supervised Learning is on the top of the list for most breakthrough ideas in Deep Learning by Yann LeCun, creator of CNN.

CLIP is one of the most notable and impactful works done in multimodal learning.

Multimodal learning attempts to model the combination of different modalities of data, often arising in real-world applications. An example of multi-modal data is data that combines text (typically represented as discrete word count vectors) with imaging data consisting of pixel intensities and annotation tags. As these modalities have fundamentally different statistical properties, combining them is non-trivial, which is why specialized modeling strategies and algorithms are required. (Definition taken from Wikipedia)

In easy words, we can explain multimodal deep learning as a field of artificial intelligence that focuses on developing algorithms and models that can process and understand multiple types of data, such as text, images, and audio, unlike traditional models that can only deal with a single type of data.

The goal of multimodal deep learning is to enable machines to process information from different sources simultaneously and generate a better understanding of the overall context. For example, when processing an image, a multimodal deep learning model such as CLIP may also take into account the text description (or prompt) that accompanies it, which can help to clarify the content of the image.

Multimodal deep learning is like teaching a robot to understand different things at the same time. Just like how we can see a picture and read a description to understand what’s happening in the picture, a robot can also do the same thing.

For example, let’s say the robot sees a picture of a dog, but it doesn’t know what kind of dog it is. Multimodal deep learning can help the robot understand what kind of dog it is by also reading a description of the dog, like “This is a Golden Retriever”. By looking at the picture and reading the description, the robot can learn what a Golden Retriever looks like, and use that information to recognize other Golden Retrievers in the future.

A Simple Multimodal Architecture. Image Source

The way that CLIP is designed is very simple yet very effective. It uses contrastive learning which is one of the main techniques that can calculate the similarities. Originally it was used to calculate the similarities between images.

Contrastive learning for image similarity.

This technique was widely used for image comparison and similarity, but CLIP decided to do it differently. They used both text and images in a self-supervised fashion as there is a ton of data online and we do not need to label it, and it turned out to be performing amazingly well at connecting text to images.

Pre-training in CLIP involves using a text encoder and an image encoder to calculate semantic embeddings for both text and images. The text encoder is based on Vaswani et. al’s Transformer Architecture. The Image Encoder, on the other hand, is a simple Vision Transformer (ViT).

The goal of pre-training is to maximize the similarity between the embeddings of images and their correct text descriptions. This is done by passing the image and text embeddings through a Contrastive Function, which in the case of CLIP is Cosine Similarity. The Contrastive Function measures the similarity between the two embeddings and adjusts the model’s parameters to increase the similarity between matching image and text embeddings.

In the inference phase of CLIP, the model takes in a piece of text and finds the best image that matches the description. This is done by passing the text input through the Text Encoder and the image set through the Image Encoder. Both sets of embeddings are then passed through the same Contrastive Function that is used in the pre-training. The text for which the embedding has the highest similarity to the Image Embedding is selected as the best match. Hence it is an image-text Search, it searches for the best available text description for a given image.

The batch size on which CLIP was trained was ~32000 and the training set size was 400M image-text pairs. Using a large batch size has some really good benefits.

Firstly, a larger batch size allows for more efficient use of hardware resources. When processing a large batch, parallelization is more effective, allowing the model to make more efficient use of multiple GPUs. This can lead to faster training times and better utilization of computing resources.

Secondly, with a larger batch size, the model is exposed to more diverse and varied examples during each iteration, allowing it to better generalize across the entire dataset. This means that the model can learn to recognize and encode more complex relationships between images and text, rather than memorizing specific image-text pairs.

Lastly, a larger batch size can also help mitigate the effects of noisy or outlier data. By exposing the model to a greater number of examples, it is less likely to be swayed by individual noisy examples that may not be representative of the overall distribution.

This huge batch size and pre-training have made it a really good zero-shot learner. This means that it can classify new things without even being trained on them since it has already seen enough images and learned their descriptions.

If you have ever used Stable Diffusion, you must have heard of prompt engineering. The more precise the prompt is, the better the output image is. This is because Stable Diffusion uses CLIP behind the hood and the clip connects the prompt (text) to the generated image (output).
The better the prompt is, the more likely it will recognize the Image correctly. So if you are using CLIP for your task, make sure that you are using well-defined prompts in order to get maximum accuracy. Hence to use CLIP at its best, it is pertinent to have a well-defined prompt.

Image-Text Search Engine using CLIP (Default Image Classification)

By default, CLIP is designed to predict the best possible prompt that matches the input image. This is a standard classification task, the difference is that in place of labels, we have prompts/descriptions for the input image.

Installation:

You can install CLIP simply by installing it from the OpenAI CLIP repo.

$ pip install git+https://github.com/openai/clip
$ pip install ftfy regex tqdm

You can confirm the installation by importing it.

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Let’s take this input image as an example.

To perform a Classification task, we need to have a few descriptions with one of them being the closest to the given image.

descriptions = [
    "A Pictrue of a Tiger and Girl on Rocks", 
    "A picture of Donkey and a Man", 
    "A picture of a red car", 
    "A picture of a Sparrow and Butterfly", 
    "A picture of Animal and Human",
    ]

If you notice, out of these 6 descriptions, there are 2 descriptions that we can relate to this image. Let’s see how CLIP ranks these.

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device) # Load ViT Encoder

image = preprocess(all_images[0]).unsqueeze(0).to(device) # Preprocess the Image
descriptions = [
    "A Pictrue of a Tiger and Girl on Rocks", 
    "A picture of Donkey and a Man", 
    "A picture of a red car", 
    "A picture of a Sparrow and Butterfly", 
    "A picture of Animal and Human",
    ]
text = clip.tokenize(descriptions).to(device) # Tokenize the Text with CLIP
with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text) # Pass both text and Image as Input to the model
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()


results = dict(zip(descriptions, map(lambda x: x*100, probs[0])))
results = {k: v for k, v in sorted(results.items(), key=lambda item: item[1], reverse=True)} # Sorted Results
for text, percentage in results.items():
  print(f"Description: {text}, Similarity: {percentage}")

BOOM! You can see CLIP has correctly identified the picture of Tiger and Girl on Rocks with a similarity of 99.99%. That’s mind-blowing. Even the second recommendation is very accurate (It is a picture of an animal and a human). The red car recommendation comes after all the animals and it shows that the CLIP has a semantic understanding of animals and cars.

Text-Image Search Engine using CLIP

This is the default search that we all are used to. We put in a query and get the results based on that query. Even though CLIP is not designed for it, it turns out that the embedding it has learned during the whole pre-training process stores semantic information so well that we can use mathematical operations to do this for ourselves.

CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings.

Problem Formulation

Consider that we have a database of images, and we want to search within them, as shown at the start of the article. We are going to have a single text query, and we will match the similarity (cosine) between the embeddings of all the images and the text query.

Step 1: Database of Images

If you want to use it in production, I’d highly recommend you index your images/embeddings in some good image/vector database for faster results such as Qdrant/ElasticSearch which are out of the scope of this article. We are going to store our embeddings in memory and focus on CLIP’s part.

We are going to use the free PixaBay API to get images.

words_to_search = ["Giraffe", "Tiger", "Fruits"]
original_api = "https://pixabay.com/api/?key="
no_to_retrieve = 5

for pixabay_search_keyword in words_to_search:

    pixabay_api = original_api+pixabay_api_key+"&q="+pixabay_search_keyword.lower()+"&image_type=photo&safesearch=true&per_page="+str(no_to_retrieve)
    response = requests.get(pixabay_api)
    output = response.json()

    for each in output["hits"]:
        imageurl = each["webformatURL"]
        response = requests.get(imageurl)
        image = Image.open(BytesIO(response.content)).convert("RGB")
        all_images.append(image)

You can see the images using ipyplot in Python.

! pip install ipyplot
ipyplot.plot_images(all_images,max_images =50,img_width=150, force_b64=True)

Step2: Calculate the Embeddings for Each Image

We can simply use an image encoder (model.encode_image) from the CLIP model to calculate the embeddings for each image. We create a batch of all images to do so at once instead of calculating it for each image.

input_image = [preprocess(im) for im in all_images]  # preprocess each Image

with torch.no_grad():
  image_embeddings = model.encode_image(torch.stack(input_image))  # Torch.Stack will help us to levragebatch processing to speed up the calculation

Step3: Calculate the Similarity between the Input Text and All Images in our Database

We now have to calculate the similarity between the input text and all the images. The first step will be to calculate the input text embeddings. We can do that using text encoder from CLIP.

query = "A photo of Apple"  # Input Query

query_tokens = clip.tokenize([query])  # Tokenize Before Embeddings

with torch.no_grad():
  query_embeddings = model.encode_text(query_tokens)

Now we can calculate the similarity between our embeddings from image database and the query embeddings. We are calculating cosine similarity in PyTorch, and it can be done in one line of code.

def calculate_similarity(query_embeddings, input_embeddings):
  similariries = query_embeddings @ input_embeddings.T
  return similariries

Now that we have our image_embeddings and query_embeddings, we can simply call our function to see the results.

sim = calculate_similarity(query_embeddings, image_embeddings)

sim_dict= dict(zip(range(len(sim[0])), sim[0]))  # Use Dictionary to Sort the Results
sorted_sim = sorted(sim_dict.items(),key=lambda x:x[1],reverse=True)
top_sim = sorted_sim[:3]  # Get top 3 results

for i in top_sim:
  display(all_images[i[0]])

Top 3 Results (left to right) for the query ‘’A photo of Apple’’.

Notice how the next 2 results are also photos of fruits from our database instead of tigers or giraffes. This shows that the embeddings that are learned have a semantic understanding of the data and we can use our own vector operations on them.

Image-Image Search Engine using CLIP

This is also another very interesting area where CLIP can be used. In the very first issue on CLIP’s Github repo, this thing was explored and it showed how it performs better than a lot of other image-image open-source options.

Turns out, it is very similar to the Text-Image Search, all you have to do is to replace the input text with the input image, and find the similarity between image embedding (query) and image embeddings (database).

We are using an image (query) that has both a giraffe and a tiger as we have both in our database. Let’s see how the model performs in this case.

import cv2
from PIL import Image

query_image = cv2.imread('/content/download.jpg')

query_image = preprocess(Image.fromarray(query_image)).unsqueeze(0).to(device)

with torch.no_grad():
  query_embeddings = model.encode_image(query_image)

sim = calculate_similarity(query_embeddings, image_embeddings)  # image_embeddings is our image database

And the search results are:

Image-Image Search Results (left to right).

We can see the top 6 results here, 3 are for giraffe and 3 for tiger. It is really interesting to see how it has captured the semantics for both and found similar images from the database.

Learning Outcomes

In this article, we discussed briefly the working of CLIP and what makes it so good. We also saw how you can perform different types of semantic search using CLIP across multiple modalities accurately and easily. You can find the complete(modular and modifiable) code in my GitHub repo.

GitHub - ahmadmustafaanis/Clip-Image-Search: Semantic Text-Image/Image-Text/Image-Image Search…

CLIP is a powerful Image Model that is trained under the supervision of Natural Language. We Explore different Semantic…

github.com

Like the article? Let me know your thoughts in the comments section!