CLIP by OpenAI Explained

Pragyan Subedi
3 min readApr 19, 2023

--

This article is a concise explanation of the CLIP model by OpenAI.

Main Idea

“State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept.

Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.” — Research Paper

How is the model trained?

The Contrastive Language Image Pre-training (CLIP) model consists of the following steps for training:

  1. Create an input of paired texts and images.
  2. At each iteration, a specific batch size from these pairs is randomly sampled from a large dataset (of paired texts and images).
  3. The text encoder encodes its text inputs into text feature vectors (T1, T2, …, TN). Similarly, the image encoder encodes its image inputs into image feature vectors (I1, I2, …, IN)
  4. Calculate the cosine similarities between these text and image vectors where the similarity values for the matching pairs of texts and images are highlighted by blue in the above figure (I1.T1, I2.T2, …, IN.TN).
  5. For contrastive pre-training, we want these highlighted similarity values to be higher indicating the matching texts and images are mapped to a similar region in the feature space while keeping the rest of the similarities lower.
  6. To achieve this, we treat the similarity values as scaled logits as inputs to the softmax classifier. Therefore, this turns the problem into a classification problem and the loss function to be minimized can be kept as Cross Entropy Loss. (For each row in the above figure, we want the respective pair of images and texts to have a high probability.)
  7. We do this for all the images and texts in the batch and minimize the average Cross Entropy Loss, which corresponds to the InfoNCE loss.

As a result of this pre-training, we obtain a set of encoders (text and image) that can encode texts and images into a shared multimodal embedding space.

How does the model make predictions? — Zero-shot image classification

In this example, the task is to predict the class label of the input image (zero-shot image classification).

The Contrastive Language Image Pre-training (CLIP) model consists of the following steps for making predictions:

  1. Pass each of the possible label text classes through the pre-trained text encoder and get the text feature vectors (T1, T2, T3, …, TN).
  2. Pass the image to be classified through the pre-trained image encoder and get the image feature vector (I1).
  3. Find the cosine similarity between each of the text feature vectors (T1, T2, T3, …, TN) and the image feature vector (I1).
  4. The text feature vector providing the highest cosine similarity is the image’s label. This means we are selecting the text feature vector that is closest in angular distance to the image feature vector. In the above figure, the highest value is dog.

In this way, the CLIP model can perform predictions on new datasets without having ever been trained on those respective datasets of images and texts.

CLIP Details

  • The CLIP model is trained on 400 million image-text pairs from the internet.
  • The batch size for the input is 32,768.
  • 32 epochs over the dataset.
  • Cosine learning rate decay is applied.
  • The architecture of the image encoder is ResNet-based or ViT-based.
  • The architecture of the text encoder is Transformer-based.

For more information, please refer to the research paper or the slide presentation.

--

--