Getting started with OpenAI’s CLIP

Kerry Halupka
6 min readJan 29, 2023

--

The internal workings of CLIP (according to Stable Diffusion)

CLIP has changed my life.

OK, I may be exaggerating slightly. But I can truthfully say that CLIP (Contrastive Language-Image Pre-Training) has changed quite a few of my ML workflows for the better. It’s my go-to for so many different problems now. Once you understand it and the many ways it can be used you’re sure to agree with me.

CLIP is a neural network trained on a variety of image and text pairs. It essentially creates a shared embedding space for images and text, meaning that you could use it to find the most relevant caption (given a selection of captions) for a given image, or vice-versa find the most relevant image for a given text snippet. An important thing to note is that CLIP is not a generative model, i.e. it does not generate the text-snippet or image, you use the embedding space to retrieve a previously embedded item.

We’ll go into how this this useful in future posts, but for now let’s just learn how to set it up.

Loading and instantiating CLIP

Thanks to HuggingFace integrating CLIP in their transformers library, using it has never been easier (seriously, kids these days with their pre-trained models have no idea what we used to go through to set up a CNN).

Note: You could alternatively use the `clip` library released directly by OpenAI, but the HuggingFace implementation is easier and faster to set up (mostly because of the “processor” module which we’ll explore below), and by using the transformers package you can easily interchange the model with others (as we’ll show below).

Setting up CLIP takes just 3 lines of code (beware, this will download a local copy of the model weights, so it will take a while!):

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

The above code instantiates a model and a processor using the CLIPProcessorand CLIPModel classes from the transformers package.

  • Model: it probably comes as no surprise that this is the CLIP model. What might interest you more though is that under the hood it’s actually two models! This is because CLIP uses a ViT-like transformer to get visual features and a causal language model to get the text features, this class wraps up both of these pieces.
  • Processor: The CLIPProcessor also wraps up two pieces: the CLIPFeatureExtractor to prepare the images for the image network, and the CLIPTokenizer, which encodes text ready for the language model.
  • Both Model and Processor require a config to be specified (I’ve specified openai/clip-vit-base-patch32 , which uses a ViT-B/32 Transformer architecture as an image encoder and, a masked self-attention Transformer as a text encoder). You need to use the same config string for both, otherwise you won’t have a good time.

Using a specific CLIP Config

You can very easily test out other model configs by searching the HuggingFace model zoo and filtering for CLIP models, like this.

For example, you could use the new CLIP model trained with the LAION-2B English subset of LAION-5B, like so:

model = CLIPModel.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
processor = CLIPProcessor.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

Zero-shot classification with CLIP

OK so we’ve loaded the model and processor, now let’s actually use them. The classic use-case for CLIP is zero-shot classification.

Usually with classification tasks we would need to label a number of images with our target classes (like cat/dog), then train the network using binary or multi-class cross-entropy to predict the correct label. If we wanted to add another label to the set we’d need to re-label our dataset.

However, with zero-shot classification we’re able to classify an image with labels that the model was not specifically trained on, and we can also add new labels without re-training.

In the code below I’ve selected some images from the COCO dataset to test on. First let’s download our images from their URLs and visualise them in a grid:

def image_grid(imgs, cols):
rows = (len(imgs) + cols - 1) // cols
w, h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))

for i, img in enumerate(imgs):
grid.paste(img, box=(i%cols*w, i//cols*h))
return grid

image_urls = [
'http://images.cocodataset.org/val2014/COCO_val2014_000000159977.jpg',
'http://images.cocodataset.org/val2014/COCO_val2014_000000311295.jpg',
'http://images.cocodataset.org/val2014/COCO_val2014_000000457834.jpg',
'http://images.cocodataset.org/val2014/COCO_val2014_000000555472.jpg',
'http://images.cocodataset.org/val2014/COCO_val2014_000000174070.jpg',
'http://images.cocodataset.org/val2014/COCO_val2014_000000460929.jpg'
]
images = []
for url in image_urls:
images.append(Image.open(requests.get(url, stream=True).raw))

grid = image_grid(images, cols=3)
display(grid)

This displays our images in a grid, like so:

A grid of images from the COCO dataset. The grid has 2 rows and 3 columns.
A grid of images from the COCO dataset.

The labels I’m going to use are giraffe , zebra , and elephant . You can see that first 4 images contain those labels, but the last two images don’t contain any of those labels, which will be interesting for our classification task.

Now we can classify the images using the code below:

classes = ['giraffe', 'zebra', 'elephant']
inputs = processor(text=classes, images=images, return_tensors="pt", padding=True, do_convert_rgb=False)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

In this code we’re using the processor to simultaneously preprocess the list of images and encode the classes. The call to processor returns a dictionary containing everything required by the model:

  • a batch of image tensors (inputs.pixel_values),
  • a tensor of sequence tokens of the class strings in the vocabulary (inputs.input_ids), and
  • a tensor containing an attention mask to avoid performing attention on padding token indices (inputs.attention_mask)

The output of the model is also a dictionary, it contains a bunch of useful things including the embeddings for the classes and images, and the scaled dot product scores between our images and classes, otherwise known as the image-text similarity score. The latter is what we’re interested in, and can be accessed through outputs.logits_per_image.

To classify each image we perform a softmax over the logits, this provides a “probability” of each of the classes for each image. The softmax assigns decimal probabilities to each class in a multi-class problem, such that the probabilities add up to 1.

The output for our mini dataset is as follows (if you’re interested in the code to visualise this checkout the Github link at the bottom of this post):

The class probabilities for each image, as predicted by CLIP.

For each of the images where one of the classes is clearly visible, the correct classes is predicted by CLIP. However, there are some issues:

  • in the second image, which contains zebras, but they are quite small, the probabilities of each of the classes are quite similar.
  • The last two images don’t contain any of the classes, but the softmax function artificially inflates the probability of the most common class.

We can help with the second issue by adding more classes (this is where the magic of zero-shot classification shows up). To create the below results I added the following two classes to our list of classes: teddybear and hotdog :

You can see that by adding classes that are actually visible in the images we have removed the issue with the model predicting the wrong class. From this example you can see how important it is to select relevant classes when doing zero-shot classification.

This is just one of the great ways you can use CLIP, I’ll explain the other ways I use it in future posts.

Checkout the full code on git here, or follow me on LinkedIn here.

--

--

Kerry Halupka

I’m a Machine Learning Engineer at Canva. Writing to fight my imposter syndrome and share my love of all things ML.