Unleashing the Potential of Zero-Shot Classification Using OpenAI CLIP

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

10 min readApr 23, 2023

What is Zero-Shot Classification?

Zero-shot classification is a task in which the model is trained on a set of labeled examples but is able to classify new examples without previously seeing that class. It can also be thought of as an instance of transfer learning where the model is used on a different task than what it was originally trained for. This method is particularly helpful when the amount of labeled data is small. Zero-shot classification differs from one-shot / few-shot classification in the fact that unlike the latter, which requires one or few examples of the task in hand, the zero-shot classification does not require an example.

Traditional zero-shot learning requires providing some kind of descriptor for an unseen class (such as a set of visual attributes or simply the class name) in order for the model to be able to predict that class without training data. In the above example figure, CINEMA, ART, and MUSIC are provided as the descriptor for language models eg CLIP which can be used for the downstream task of classifying any given text without fine-tuning these tasks directly.

One of the approaches used to implement zero-shot classification is contrastive learning.

Contrastive Learning

The fundamental idea of contrastive learning is Instance Discrimination. What it means is that the unlabelled data points are juxtaposed against each other to teach models which points are similar and which are different ie, as the name suggests, the samples are contrasted against each other, and those belonging to the same distribution are pushed towards each other in the embedding space whereas those belonging to different distribution are pulled away.

Contrastive learning is an example of self-supervised learning. Similar to unsupervised learning, the input provided is unlabelled data. However, the model annotates the data on its own and the labels that it has predicted with high confidence are used as ground truths in a future iteration of model training.

Contrastive Learning in Computer Vision

Contrastive Learning mimics the way humans learn. For example, we might not know what otters are or what grizzly bears are, but seeing the images (as shown below), we can at least infer which pictures show the same animals.

Basic contrastive learning consists of selecting a data sample called an anchor. A data point belonging to the same distribution as the anchor is called positive sample and another data point belonging to different distribution is called negative sample. The SSL model tries to minimize the distance between the positive sample in the latent space, and at the same time, it also tries to maximize the distance between the anchor and the negative sample. To achieve this, a contrastive loss function is used to penalize the model when it fails to distinguish between similar and dissimilar examples.

As shown in the example above, two images belonging to the same class lie close to each other in the embedding space d+, and those belonging to different classes lie at a greater distance from each other d-. Thus, a contrastive learning model (denoted by theta in the example above) tries to minimize the distance d+ and maximize the distance d-.

There are multiple methods present that help in segregating positive samples from negative samples but the two popular methods are:

Instance Discrimination Method — In this class of Contrastive Learning, the entirety of images are made to undergo transformations and used as positive samples to an anchor image. For example, if we select an image of a leopard as the anchor, we can apply multiple augmentations such as mirroring the image or converting it to grayscale to use as the positive sample. The negative sample can be any other image in the dataset.

Image Subsampling Method — This class of Contrastive Learning methods breaks a single image into multiple patches of a fixed dimension (say, 10x10 patch windows). There might be some degree of overlap between the patches. Now, suppose we take the image of a cat and use one of its patches as the anchor while leveraging the rest as the positive samples. Patches from other images (say, one patch each of a raccoon, an owl, and a giraffe) are used as negative samples.

Contractive Learning in NLP

Contrastive Learning has seen applications in NLP as well, where the goal is to learn such embedding space in which similar sentences are close to each other while dissimilar ones are far apart. However, Contrastive Learning in computer vision is just generating the augmentation of images. It is more challenging to construct text augmentation than image augmentation because we need to keep the sentence’s meaning intact. There are a few methods for augmenting text sequences:

Lexical Edits — This type of augmentation takes a sentence as input and randomly applies one of the following simple sets of operations for text augmentation:
Random Insertion: Insert a synonym of a randomly selected not-stop word in the sentence at a random position
Random Swap: Randomly swap two words for n number of times.
Random Deletion: Randomly delete each word in the sentence with probability p.
Synonym Replacement: Randomly choose “n” words from the sentence that are not-stop words. Replace each of these words with one of its synonyms chosen at random.
Cutoff — In this type of augmentation, once a sentence is embedded into a vector representation (say, of size n x m where n = number of features, m = length of sentence), one of the following three strategies are used:
Feature Cutoff: Remove some selected features.
Token Cutoff: Remove the information of a few selected tokens.
Span Cutoff: Remove a continuous chunk of text.

Contrastive Learning Framework: CLIP

There are many contrastive learning frameworks, the more popular ones being SimCLR, and CLIP. However, for this article, I’ll mainly focus on CLIP.

CLIP (Contrastive Language–Image Pre-training) is a multimodal model that was released by OpenAI. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance providing it the zero-shot capabilities.

But how does it work?

As discussed earlier, CLIP uses contrastive learning in order to find the relationship between image-text pairing (or image-image/ text-text pairing as we’ll see ahead).

It tries to minimise the difference between encodings of image and its corresponding text. Encodings are the lower dimension representation of the data that represent the most important features/information present in that image or text. For instance, all images of dog should have similar encodings thereby lie closely in the latent space when compared to image of cat which will have an encoding distinctly different for that of the dog and will be lying further away in the latent space. Now, knowing that similar images have similar embedding and different images have different embedding, whenever we feed an image to the model and it’s encoding is similar to that of the dog that the model has seen, it says it’s a dog.

Contrary to supervised learning where we have labeled data and can directly minimise the difference between the predicted output and the label, in CLIP we don’t have explicit labels to guide the learning process. Instead, we treat the image encodings of the training images as the model output, and the text encodings of the corresponding captions as the expected output.

The idea is that if the model learns to create good image encodings that are similar to the corresponding text encodings, it should be able to “understand” the content of the image and the associated text better. This is because images and their captions typically contain complementary information, and by making the image and text encodings similar, the model learns to extract and represent important information from both modalities.

By minimizing the difference between the image and text encodings, the model learns to generate similar encodings for similar images and text, which is useful for various downstream tasks. For example, if the model has learned to generate similar encodings for images of cats, it should be able to classify new images of cats with higher accuracy, even if it hasn’t seen those exact images during training.

CLIP as a classifier

Let’s see how we can classify an image fed to the model.

First, we try to obtain all the possible descriptors (labels). For this image, let's take the corresponding labels as plane, car, dog, bird, etc. Each label is then encoded using a pre-trained text encoder into T1, T2, T3…TN.

We can take the image that we want to classify, feed it through the pre-trained image encoder, and compute how similar the image encoding is to each text label encoding using a distance metric called cosine similarity. We now classify the image as the label with the greatest similarity to the image.

Implementing Zero-Shot Classification using CLIP

Now that we know how the zero-shot classification works, let's take an example from kaggle for Amazon Fine Food Review for sentiment analysis based on the review.

Importing necessary libraries and loading the dataset

import os
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

from transformers import CLIPProcessor, CLIPModel

review_df = pd.read_csv('./Reviews.csv')
review_df.head()

Preprocess the data — Now, we process the text data by concatenating the Summary and Text columns together and converting all the text to lowercase. The next step splits each review into words, joins the first 40 words of each sentence (since CLIP can handle not more than 77 tokens), and replaces the original column with the new column containing the first n words. The final part cleans the text by removing punctuation, replacing URLs with empty strings, and dropping reviews that have fewer than 5 or more than 40 words.

review_df['Review'] = review_df['Summary'].astype(str) + " " + review_df['Text'].astype(str)
review_df['Review'] = review_df['Review'].apply(lambda x: x.lower())
words = review_df['Review'].str.split()
    
# Join the first n words of each sentence together using the apply() function
first_n_words = words.apply(lambda x: ' '.join(x[:40]))

# Replace the original column with the new column containing the first n words
review_df['Review_CLIP'] = first_n_words

review_df['Review_CLIP'] = review_df['Review_CLIP'].replace('[^\w\s]|_', '', regex=True)
review_df['Review_CLIP'] = review_df['Review_CLIP'].replace(',', '', regex=True)
review_df['Review_CLIP'] = review_df['Review_CLIP'].replace("'", "", regex=True)

subset = review_df[review_df["Review_CLIP"].apply(lambda x : 4 < len(str(x).split()) < 40)]
subset["Review_CLIP"] = subset["Review_CLIP"].apply(lambda x : " ".join(x.split()))
subset['Review_CLIP'] = subset['Review_CLIP'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Load the CLIP model and processor

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

device = "cpu"
model.to(device)

Create sentiment embeddings — This code creates sentiment embeddings by passing a list of sentiment labels through the CLIP processor and model. The embeddings are then normalized using the L2 norm.

sentiment_list = ['positive', 'negative', 'neutral']
sentiment_list = [f'{sentiment} review' for sentiment in sentiment_list]

sentiment_embeddings = processor(
    text=sentiment_list,
    padding=True,
    images=None,
    return_tensors='pt'
).to(device)
sentiment_embeddings = model.get_text_features(**sentiment_embeddings)
sentiment_embeddings = sentiment_embeddings.detach().numpy() / np.linalg.norm(sentiment_embeddings.detach().numpy(), ord=2, axis=-1, keepdims=True)

Zero-Shot Classification — We then loop through the review descriptions in batches of size 32. For each batch, we first obtain the embeddings using CLIPProcessor and CLIPModel. We then normalize the embeddings using L2 normalization. We calculate the dot product of the sentiment embeddings and the review embeddings to obtain a probability distribution over the different sentiment classes. We predict the sentiment for each review in the batch by taking the class with the highest probability.

for i in range(0, len(review_description), batch_size):
    try:
        batch = review_description[i:i+batch_size]
        description_encode = processor(
            text=batch,
            padding=True,
            images=None,
            return_tensors='pt'
        ).to(device)

        description_encode = model.get_text_features(**description_encode)
        description_encode = description_encode.detach().numpy() / np.linalg.norm(description_encode.detach().numpy(), ord=2, axis=-1, keepdims=True)
        predicted_classes_distribution = np.dot(sentiment_embeddings, description_encode.T)
        predicted = [sentiment_list[k] for k in np.argmax(predicted_classes_distribution, axis=0)]
        
        # check that predicted has same length as batch
        if len(predicted) == len(batch):
            sentiment_output.extend(predicted)
        else:
            # if predicted has different length, fill invalid values
            sentiment_output.extend(['invalid']*len(batch))
    except RuntimeError:
        try:
            batch = [" ".join(desc.split()[:25]) for desc in review_description[i:i+batch_size]]
            description_encode = processor(
                text=batch,
                padding=True,
                images=None,
                return_tensors='pt'
            ).to(device)

            description_encode = model.get_text_features(**description_encode)
            description_encode = description_encode.detach().numpy() / np.linalg.norm(description_encode.detach().numpy(), ord=2, axis=-1, keepdims=True)
            predicted_classes_distribution = np.dot(sentiment_embeddings, description_encode.T)
            predicted = [sentiment_list[k] for k in np.argmax(predicted_classes_distribution, axis=0)]
            
            # check that predicted has same length as batch
            if len(predicted) == len(batch):
                sentiment_output.extend(predicted)
            else:
                # if predicted has different length, fill invalid values
                sentiment_output.extend(['invalid']*len(batch))
        except RuntimeError:
            batch = review_description[i:i+batch_size]
            for b in batch:
                try:
                    description_encode = processor(
                        text=b,
                        padding=True,
                        images=None,
                        return_tensors='pt'
                    ).to(device)
                    description_encode = model.get_text_features(**description_encode)
                    description_encode = description_encode.detach().numpy() / np.linalg.norm(description_encode.detach().numpy(), ord=ppy

Saving the output into the dataset

subset['sentiment'] = sentiment_output

You can also check out the implementation on my github repository.

If you’re looking to fine-tune your CLIP model, do check out my article!

A Guide to Fine-Tuning CLIP Models with Custom Data

Artificial intelligence and machine learning have come a long way in recent years, with advances in the field allowing…

medium.com

And github repository for code to fine-tune it.