Understanding OpenAI’s CLIP model

11 min readFeb 24, 2024

CLIP was released by OpenAI in 2021 and has become one of the building blocks in many multimodal AI systems that have been developed since then. This article is a deep dive of what it is, how it works, how it is used and also how it is implemented.

Introduction

CLIP which stands for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision and was introduced in 2021 in the paper Learning Transferable Visual Models From Natural Language Supervision.

In summary, CLIP is a joint image and text embedding model trained using 400 million image and text pairs in a self supervised way. This means that it maps both text and images to the same embedding space. So, for example, an image of a dog and the sentence “an image of a dog” would end up having very similar embeddings and be close to each other in the vector space. This is very significant as you can build many interesting applications witch such a model such as searching an image database with a description or vice versa.

The authors found that CLIP can be used for a variety of tasks that it was not trained on. For instance, it achieved remarkable zero-shot performance on various benchmarks such as ImageNet which in an image classification dataset. Zero-shot learning refers to the fact that the model was not explicitly trained on any of the 1.28M training examples in the ImageNet dataset. Nevertheless, CLIP matched the accuracy of the original ResNet-50 which was trained on the data!

But how do you use CLIP to classify images? Using ImageNet as an example, you can take each of its 1000 possible classes/objects and embed them with CLIP using the prompt “a photo of a {object}” (for example “a photo of a dog” or “a photo of a cat”). This gives you 1000 different embeddings corresponding to all the possible classes. Next, you take the image that you want to classify, let’s say a photo of a dog, and you also embed it with CLIP. Finally, you take the dot product between the image embedding and all the text embeddings. Since, CLIP is trained in such a way that both images and text are in the same embedding space and the dot product calculates the similarity between the embeddings, it is highly likely that the dot product with “a photo of a dog” will be the highest. Thus, you can predict that the image is a dog. Note, that if you want to turn CLIP into a true classifier you can also pass the dot products through the softmax function to get a predicted probability for each class.

The above process can be seen in steps 2 and 3 of the following figure.

Source: OpenAI’s Blog showing how to use CLIP for zero-shot image classification.

Now let’s dive into more detail of how CLIP works.

Model Details

Architecture

The CLIP model has two main components, a text encoder (which embeds the text) and an image encoder (which embeds the images). For the text encoder a Transformer was used. This architecture has revolutionised the field of NLP since 2017 and it is not surprise that it has been used. For a great visual explanation please see the following blog.

For the image encoder the authors tried two different models, a ResNet-50 and a Vision Transformer (ViT). ResNet-50 is the original state of the art architecture using Convolutional Neural Networks (CNNs) that is used for image classification. The ViT is a more recent adaptation of the original Transformer for images, where each image can be split up into a sequence of patches and passed into the model analogous to a sequence of tokens. The authors found that the ViT trained faster.

The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs.

Both the text and image encoder were trained from scratch.

We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.

For all architectures minor modifications were made as described in the paper.

Training

The authors initially tried to train an image captioning model which given an image predicts the exact caption/description.

Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method.

However, they found it did not scale to train on the 400M (image, text) pairs and so they opted instead for a contrastive representation learning approach. The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart.

In a standard contrastive learning approach you give the model examples of the form (anchor, positive, negative), where anchor is an image from one class, say a dog, positive is an alternative image from the same class, a dog, and negative is an image of another class, say a bird. You then embed the images and you train the model in such a way that the distance between the two embeddings for the same class (the dog), distance(anchor, positive), gets minimised and the distance between the two embeddings of different class, (the dog and bird), distance(anchor, negative), gets maximised. This encourages the model to output very similar embeddings for the same objects and dissimilar embeddings for different objects.

A visualisation of **contrastive learning**. Source: https://www.v7labs.com/blog/contrastive-learning-guide

The same approach can be applied to text as well as combination of text an images. For instance, for CLIP, for a single training example, the anchor could be an image of a dog, the positive could be the caption, “an image of a dog”, and the negative could be the caption “an image of a bird”.

CLIP generalises that even further using a multi-class N-pair loss, an extension of the above but when you have multiple negatives and positives for each anchor. As described in the paper:

Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N × N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximise the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimising the cosine similarity of the embeddings of the N² − N incorrect pairings. It optimises a symmetric cross entropy loss over these similarity scores.

The below pseudo code provided in the paper encapsulates the core details nicely:

Source: Learning Transferable Visual Models From Natural Language Supervision

The steps include:

Embed the image with the image encoder and embed the text with the text encoder.
The image and text embeddings will come from different models and will have different dimensions, so project them (by multiplying with a learnt projection matrix) into the same joint multimodal embedding space. For instance, np.dot(I_f, W_i) multiplies a matrix of size [n, d_i] with a matrix of size [d_i, d_e] which results in a projected matrix of size [n, d_e].
Normalise the new embedding vectors. This turns them into unit vectors.
Calculate the matrix of dot products.
Calculate the cross entropy loss for each row and column and divide by 2, since each pair would be calculated twice.

Prompt Engineering and Ensembling

Since the rise of language models prompt engineering has become a very common practice to get good outputs from generative models. Since the text encoder in CLIP is a transformer model the authors found it to be very crucial to get good zero-shot performance as well. The authors found that it was relatively rare in their pre-training dataset for the text paired with the image to be just a single word, for example,“dog” representing a class label. It was more common for the text to be a full sentence like a caption or description of the image. The authors therefore found the prompt “a photo of a {object}” was a good default but for certain cases more specialised prompts worked better. For instance, for satellite images they found “a satellite photo of a {object}” to work well.

The authors also experimented with ensembling of different models. Ensembling is where you combine the predictions of a few different models on the same inputs to get the final output, a common technique in machine learning to address issues in high variance and low bias (overfitting) models. In the case of CLIP, the authors construct the ensemble by constructing classifiers using many different prompts.

Both prompt engineering and ensembling showed a significant performance improvement on ImageNet.

On ImageNet, we ensemble 80 different context prompts and this improves performance by an additional 3.5% over the single default prompt discussed above. When considered together, prompt engineering and ensembling improve ImageNet accuracy by almost 5%.

Limitations

Whilst the paper dives into many more experiments and results it is important to also mention that CLIP is not perfect and has various limitations.

Due to the design decision mentioned earlier, this is not a generative model and cannot do image captioning for instance.
Authors note that CLIP is still far from state of the art (and is only comparable to a ResNet with a linear layer on top). It generalises very badly to certain tasks, for example only achieving 88% on the easy MNIST hand written digit recognition dataset. This is likely due there being no similar images in its training but CLIP does little to address that.
CLIP is trained on text paired with images on the internet. These image-text pairs are unfiltered and uncurated and result in CLIP models learning many social biases. (These are similar concerns to those of LLMs currently which techniques like RLFHF and Direct Preference Optimisation try to tackle).
The Transformer text encoder maximum sequence length (maximum number of tokens that can be passed in) was capped at 76 in the original implementation as the dataset were mostly images and captions which would generally be short sentences. Therefore, using the off the shelf pre-training model would not work well with longer texts as it would be cut off after 76 tokens, and as it was trained with short texts.

Implementation Details

Inference with HuggingFace Transformers

You can use CLIP on your own computer using the HuggingFace Transformers library in just a few lines of code! Firstly, import the library and load the pre-trained model.

import transformers

model = transformers.CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = transformers.CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Then create a list of captions/descriptions and a list of images. The image can either be a url or a PIL image.

import PIL.Image

images = [PIL.Image("for_example_a_dog_image.jpeg")]
possible_classes = ["an image of a bird", "an image of a dog", "an image of a cat"]

Call the processor which tokenises the texts and images and prepares them to be passed into the model. This is very similar to calling a tokeniser in the standard text only use case. Since we have a batch of descriptions we need to use padding to “pad” them all to the same length to be able to store as a tensor and truncation to cut any long sentences at the maximum sequence length (which is 76 as discussed earlier). Then pass the tokenised inputs to the model which passes them through the text and image encoders.

with torch.no_grad():
    inputs = processor(text=descriptions, images=images, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)

Now we can retrieve the matrix of dot products with two different functions. Use logits_per_image to get a dot product matrix with the shape [num_of_images, num_of_text] and logits_per_text to get the matrix with the shape [num_of_text, num_of_images].

dot_products_per_image = outputs.logits_per_image
dot_products_per_text = outputs.logits_per_text

Finally, we can pass these through a softmax function if we want to get a probability distribution for each image.

probabilities = dot_products_per_image.softmax(dim=1)

Diving Deeper into the Implementation

The transformers CLIP source code can be found on github and I found it to be a very modular and nice implementation. The main model is implemented in the CLIPModel class and you can see the main logic in the forward method as shown below

Core implementation of CLIP at the highest level. Source: modeling_clip.py

The vision model and text models have some minor differences in embeddings and layer norms as shown below

CLIP text encoder is a Transformer. Source: modeling_clip.py

CLIP image encoder is a Vision Transformer. Source: modeling_clip.py

but they both share the same CLIPEncoder which is the main transformer encoder. This comprises of many sub blocks which they call a CLIPEncoderLayer. Recall that in the Transformer architecture each encoder and decoder gets stacked together N times. For CLIP we don’t use a decoder as we don’t generate any text.

The transformer architecture from Attention is All You Need.

Each CLIPEncoderLayer is then comprised of the attention mechanism, a normalisation layer and a simple feed forward layer/multi layer perceptron (MLP).

One of the N encoder layers. Source: modeling_clip.py

Finally, I went through and annotated the implementation for the multi head attention mechanism in the following gist — enjoy!

Further Work

As mentioned at the start CLIP can be used in a variety of ways, especially in semantic search type applications. For example, we could use CLIP for image retrieval from a database by searching with a description of the image.

CLIP and its alternatives is also a building block for many multimodal models that have emerged since then. For example, in Flamingo, a Vision Language Model, it can take a sequence of text and images in one go and generate text. Flamingo uses a vision encoder to turn the images into the same embedding space as the text.

Source: Flamingo paper. For full details on how Flamingo works, see my other article!

The authors experimented with both CLIP and their own version trained in a similar manner.

Finally, models like Google’s Gemini, although we don’t know much about, they are likely using similar approaches to combine input data from various modalities including also audio and video!

Conclusion

In summary, CLIP is joint text and embedding model that can be used for many applications and to build multimodal AI systems. It is also very easy to run in Python on a CPU in just a few lines of code.

Hope that you found it useful and thank you for reading! If you enjoyed it you might also want to check out my article on Flamingo — a good follow up!