A dead-simple image search engine for Bangla using CLIP (Contrastive Language–Image Pre-training)

Zabir Al Nazi Nabil
5 min readJun 27, 2023

In this article, we will build a very simple image search engine for the Bangla language. The system will take a text query and from an image database, it will return the most relevant k image results.

Image search with Bangla CLIP

Let’s break down the system into the following steps:

  1. We will collect multiple image-text pair datasets for Bangla.
  2. We will train a CLIP model to learn a joint representation of Bangla text and images. We will use an EfficientNet / ResNet image encoder and a BERT text encoder that was already pre-trained on Bangla text with MLM (masked language model).
  3. While performing a search with a text query, we will calculate the text embedding vector for the query.
  4. We will also calculate all the image embedding vectors from the image database.
  5. A cosine similarity score will be calculated between the text query vector and the image embedding vector.
  6. We will return the images with top-k similarity scores.
  7. We can use vector indexing to speed up the search process.

CLIP, or Contrastive Language-Image Pre-training, is a framework that learns a joint representation of images and text by training a model on a large dataset containing paired image-text samples. It aims to bridge the gap between visual and textual information, allowing the model to understand and generate text descriptions of images and perform image-based searches using text queries.

The CLIP model consists of three main components:

  1. Image encoders (pre-trained),
  2. Text encoders (pre-trained), and
  3. Projection head.
Image search engine for Bangla with CLIP

Image Encoder:

CLIP uses powerful image encoders to process and extract meaningful representations from images. In this tutorial, we’ll utilize EfficientNet and ResNet as the image encoders. These architectures have been widely adopted in computer vision tasks and have proven to be effective in capturing visual features. The image encoder transforms the input image into an embedding space that encodes the visual content where similar images are close to each-other.

Text Encoder:

For the text encoder, we employ BERT (Bidirectional Encoder Representations from Transformers). BERT is a state-of-the-art model that captures contextual information from text by using a transformer architecture. It is pre-trained on a large corpus of text and can generate high-quality text embeddings. We use a pre-trained BERT that was already trained on Bangla text. By using BERT as the text encoder, CLIP can learn to associate relevant textual information with images.

Projection Head:

The projection head is a crucial component of the CLIP model that helps to map the high-dimensional joint image-text embeddings into a lower-dimensional space. It is responsible for transforming the learned representations into a more compact and discriminative form, which facilitates downstream tasks such as image classification and text-based image retrieval.

class CLIPModel(nn.Module):
"""CLIP model for Bangla"""
def __init__(self):
super(CLIPModel, self).__init__()
self.image_encoder = models.efficientnet_b2(weights = "EfficientNet_B2_Weights.DEFAULT")
self.image_encoder.fc = nn.Identity()

self.image_out = nn.Sequential(
nn.Linear(CFG.image_embedding, 256), nn.ReLU(), nn.Linear(256, 256)
)

self.text_encoder = AutoModel.from_pretrained(CFG.text_encoder_model)
self.target_token_idx = 0


self.text_out = nn.Sequential(
nn.Linear(768, 256), nn.ReLU(), nn.Linear(256, 256)
)

def forward(self, image, text, mask):
image_vec = self.image_encoder(image)
image_vec = self.image_out(image_vec)

text_out = self.text_encoder(text, mask)
last_hidden_states = text_out.last_hidden_state

last_hidden_states = last_hidden_states[:,self.target_token_idx,:]
text_vec = self.text_out(last_hidden_states.view(-1,768))

return image_vec, text_vec

def get_image_embeddings(self, image):
image_vec = self.image_encoder(image)
image_vec = self.image_out(image_vec)

return image_vec

def get_text_embeddings(self, text, mask):
text_out = self.text_encoder(text, mask)
last_hidden_states = text_out.last_hidden_state

last_hidden_states = last_hidden_states[:,self.target_token_idx,:]
text_vec = self.text_out(last_hidden_states.view(-1,768))

return text_vec

In CLIP, the projection head is applied to both the image and text encodings separately. Let’s explore how it works for each modality:

1. Image Projection Head:

The image projection head takes the output from the image encoder (e.g., EfficientNet or ResNet) and applies a linear transformation followed by a non-linear activation function. This linear layer reduces the dimensionality of the image embeddings, typically to a lower-dimensional space such as 128 or 256 dimensions (we use 256). The purpose of the non-linear activation function (e.g., ReLU) is to introduce non-linearity into the projections, enabling the model to learn complex relationships between image features.

The image projection head can be seen as a learnable transformation that maps the high-dimensional image representations onto a more compact space while preserving important discriminative information.

2. Text Projection Head:

Similar to the image projection head, the text projection head also reduces the dimensionality of the text embeddings. It takes the output from the text encoder (e.g., BERT) and applies a linear transformation followed by an activation function.

Since BERT already generates text embeddings with a fixed dimensionality, the text projection head is primarily used to further refine the representations and make them more suitable for downstream tasks. The linear transformation in the projection head helps to capture important semantic information from the text and create a compact representation that can be effectively matched with the image embeddings.

The projection heads for both image and text encodings are jointly trained with the rest of the CLIP model using contrastive learning. The goal is to optimize the projection head parameters to enhance the alignment between the image and text embeddings, making them more semantically meaningful and facilitating effective retrieval and understanding of multimodal information.

Overall, the projection head in CLIP plays a crucial role in reducing the dimensionality of image and text embeddings, capturing discriminative information, and enabling efficient matching and retrieval of multimodal data.

Loss Function:

criterion = nn.CrossEntropyLoss()
...

image_vec, text_vec = model(
input, texts , masks
)
logits = torch.matmul(text_vec, image_vec.T)

targets = torch.arange(logits.size(0)).long().to(device)

texts_loss = criterion(logits, targets)
images_loss = criterion(logits.T, targets)
loss = (images_loss + texts_loss) / 2.0

loss.backward()
optimizer.step()

During training, CLIP employs a contrastive loss function to align the image and text representations. The goal is to bring the representations of matching image-text pairs closer while pushing apart the representations of non-matching pairs.

Let’s break down how the CLIP loss function works:

  1. Computing Similarity Scores: Once we extract the image embedding and the text embedding from the encoders, the first step is to calculate their similarity score. This is typically done by taking the dot product between the image and text embeddings. The dot product reflects the similarity between the two vectors, with higher values indicating stronger similarity.
  2. Softmax Normalization: To convert the similarity scores into probabilities that sum to 1, softmax normalization is applied. The softmax function transforms the raw similarity scores into a probability distribution over all the samples in the batch. It is defined as:
softmax(x_i) = exp(x_i) / sum(exp(x_j)) for j in range(batch_size)

Here, x_i represents the similarity score for a specific image-text pair, and exp() denotes the exponential function. By dividing the exponential of each similarity score by the sum of the exponentials across all pairs, the softmax function ensures that the resulting probabilities represent the relative similarities between pairs.

Datasets:

I used the following datasets to train the CLIP model.

Overall, by combining the image encoders (EfficientNet and ResNet) and the text encoder (BERT) within the CLIP framework and training them with a contrastive loss function, we can create a powerful model that understands the Bangla language and performs multimodal tasks involving text and images.

Codebase:

--

--