CLIP: Creating Image Classifiers Without Data

A hands-on tutorial explaining how to generate a custom Zero-Shot image classifier without training, using a pre-trained CLIP model. Full code included.

Lihi Gur Arie, PhD

Published in

Towards Data Science

7 min readFeb 22, 2023

Image generated by the author with Midjourney

Introduction

Imagine you need to classify whether people wear glasses, but you have no data or resources to train a custom model. In this tutorial, you will learn how to use a pre-trained CLIP model to create a custom classifier without any training required. This approach is known as Zero-Shot image classification, and it enables classifying images of classes that were not explicitly seen during the training of the original CLIP model. An easy-to-use Jupyter notebook with the full code is provided below for your convenience.

CLIP: Theoretical Background

The CLIP (Contrastive Language-Image Pre-training) model, developed by OpenAI, is a multi-modal vision and language model. It maps images and text descriptions to the same latent space, allowing it to determine whether an image and description match. CLIP was trained in a contrastive way to predict which captions correspond to which images in a dataset of over 400 million image-text pairs from the internet [1]. Incredibly, classifiers generated by the pre-trained CLIP were shown to achieve competitive results with supervised models baseline, and in this tutorial we will utilize this pre-trained model to generate a glasses detector.

CLIP contrastive training

CLIP model consists of an Image Encoder and a Text Encoder (Figure 1). During training, a batch of images is processed through the Image Encoder (ResNet variant or ViT) to obtain an image representation tensor (embeddings). In parallel, their corresponding descriptions are processed through the Text Encoder (Transformer), to obtain text embeddings. The CLIP model was trained to predict which image embedding belongs to which text embedding in a batch. This is achieved by jointly training the Image Encoder and Text Encoder to maximize the cosine similarity [2] between the image and text embeddings of real pairs in the batch (Figure 1, blue squares on the diagonal axis) while minimizing the cosine similarity between the embeddings of incorrect pairings (Figure 1, white squares). The optimization is performed using a symmetric cross-entropy loss over these similarity scores.

Figure 1 — illustration of the CLIP training process in a mini-batch. T1 is the embedding vector of class1, I1 is the embedding vector of image1, etc. | Image is taken from Radford et al., 2021 [1]

Creating a Custom Classifier

To create a custom classifier using CLIP, the names of the classes are transformed into a text embedding vector by the pre-trained Text Encoder, while the image is embedded using the pre-trained Image Encoder (Figure 2). The cosine similarity between the image embedding and each of the text embeddings is then computed, and the image is assigned to the class with the highest cosine similarity score.

Figure 2 — Zero-shot classification with CLIP | Image from Radford et al., 2021 [1], edited by the author. The face image is taken from the ‘Glasses or No Glasses’ dataset on Kaggle [3].

Code Implementation

Dataset

In this tutorial, we will create an image classifier that detects whether people wear eyeglasses, and use the ‘Glasses or No Glasses’ dataset from Kaggle [3] to evaluate the performance of our classifier. Although the dataset contains 5000 images, we will only utilize the first 100 to expedite the demonstaration. The dataset consists of a folder with all the images, and a CSV file with the labels. To facilitate the loading of images paths and labels, we will customize the Pytorch Dataset class to create the CustomDataset() class. You can find the code for this in the provided notebook.

Loading CLIP model

After installing and importing CLIP and related libraries, we load the model and the torchvision transformation pipeline that are required by the specified model. The text encoder is a Transformer, and the image encoder can be either a Vision Transformer (ViT) or a ResNet variant such as ResNet50. To see the available image encoders, you can use the command clip.available_models().

print( clip.available_models() )
model, preprocess = clip.load("RN50")

Extracting text embeddings

The text labels are first processed by a text tokenizer (clip.tokenize()), which converts the label words into numerical values. This produces a padded tensor of size N x 77 (N is the number of classes, 2 x 77 in binary classification), which serves as input to the Text Encoder. Text Encoder then transforms the tensor to an N x 512 tensor of text embeddings, where each class is represented by a single vector. To encode the text and retrieve embedding, you can use the model.encode_text()method.

preprocessed_text = clip.tokenize(['no glasses','glasses'])
text_embedding = model.encode_text(preprocessed_text)

Extracting image embeddings

Before being fed into the Image Encoder, each image undergoes preprocessing, including center-cropping, normalization, and resizing, to meet the requirements of the image encoder. Once preprocessed, the image is passed to the Image Encoder, which generates a 1 x 512 image embedding tensor as output.

preprocessed_image = preprocess(Image.open(image_path)).unsqueeze(0)
image_embedding = model.encode_image(preprocessed_image)

Similarity results

To measure the similarity between the image encoding and each text label encoding, we’ll use the cosine similarity distance metric. The model() takes the preprocessed image and text inputs, passes them through the image and text encoders, and computes the cosine similarities between the corresponding image and text features, multiplied by 100 (image_logits). Softmax is then used to normalize the logits into a list of probability distributions for each class. Since we are not training the model, we will disable the gradient calculations using torch.no_grad().

with torch.no_grad():
    image_logits, _ = model(preprocessed_image, preprocessed_text)
proba_list = image_logits.softmax(dim=-1).cpu().numpy()[0]

The class with the highest probability is set as the predicted class, and its index, probability, and corresponding token are extracted.

y_pred = np.argmax(proba_list)
y_pred_proba = np.max(proba_list)
y_pred_token = ['no glasses','glasses'][y_pred_idx]

Wrapping the code

We can create a Python class called CustomClassifier to wrap this code. Upon initialization, the pre-trained CLIP model is loaded, and the embedded text representation vector is produced for each label. We’ll define a classify() method that takes an image path as input and returns the predicted label with its probability score (stored in a DataFrame calleddf_results). To evaluate the model’s performance, we’ll define a validate() method that uses a PyTorch Dataset instance (CustomDataset()) to retrieve images and labels, then predicts results by calling the classify() method and evaluates the model’s performance. This method returns a DataFrame with the predicted labels and probability scores for all the images. Themax_images argument is used to restrict the number of images to 100.

class CustomClassifier:

    def __init__(self, prompts):

        self.class_prompts = prompts
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model, self.preprocess = clip.load("RN50", device=self.device) # "ViT-B/32"
        self.preprocessed_text = clip.tokenize(self.class_prompts).to(self.device)
        print(f'Classes Prompts: {self.class_prompts}')

    def classify(self, image_path, y_true = None):

        preprocessed_image = self.preprocess(Image.open(image_path)).unsqueeze(0).to(self.device)

        with torch.no_grad():
            image_logits, _ = self.model(preprocessed_image, self.preprocessed_text)
            proba_list = image_logits.softmax(dim=-1).cpu().numpy()[0]

        y_pred = np.argmax(proba_list)
        y_pred_proba = np.max(proba_list)
        y_pred_token = self.class_prompts[y_pred]
        results = pd.DataFrame([{'image': image_path, 'y_true': y_true, 'y_pred': y_pred, 'y_pred_token': y_pred_token, 'proba': y_pred_proba}])
        return results

    def validate (self, dataset, max_images):

        df_results = pd.DataFrame()
        for sample in tqdm(range(max_images)):
            image_path, class_idx = dataset[sample]
            image_results = self.classify(image_path, class_idx)
            df_results = pd.concat([df_results, image_results])

        accuracy = accuracy_score(df_results.y_true, df_results.y_pred)
        print(f'Accuracy - {round(accuracy,2)}')
        return accuracy, df_results

A single image can be classified with theclassify() method:

prompts = ['no glasses','glasses']
image_results = CustomClassifier(prompts).classify(image_path)

The classifier’s performance can be evaluated by the validate() method:

accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)

Notably, using the original [‘no glasses’, ‘glasses’] classes labels, we achieved a decent accuracy of 0.82 without training any model, and we can improve our results even further through prompt engineering.

Prompt Engineering

The CLIP classifier encodes text labels, known as prompts, into a learned latent space, and compares their similarity to the image latent space. Modifying the wording of the prompts can result in a different text embedding, which can impact the performance of the classifier. To improve prediction accuracy, we’ll explore multiple prompts through trial and error, selecting the one that yields the best results. For example, using the prompts ‘photo of a man with no glasses’ and ‘photo of a man with glasses’ resulted in an accuracy of 0.94.

prompts = ['photo of a man with no glasses', 'photo of a man with glasses']
accuracy, df_results = CustomClassifier(prompts).validate(glasses_dataset, max_images =100)

Analyzing multiple prompts produced the following outcomes:

[ ‘no glasses’, ‘glasses’,] — 0.82 accuracy
[‘face without glasses’, ‘face with glasses’] — 0.89 accuracy
[‘photo of a man with no glasses’, ‘photo of a man with glasses’] — 0.94 accuracy

As we can see, adjusting the wording can significantly enhance performance. By analyzing multiple prompts, we improved accuracy performances from the 0.82 baseline to 0.94. However, it’s important to avoid overfitting the prompts to the dataset.

Concluding Remarks

The CLIP model is an incredibly powerful tool for developing zero-shot classifiers across a wide variety of tasks. With CLIP, I was able to effortlessly generate on-the-fly classifiers with highly satisfactory accuracy on my projects. However, CLIP might struggle with tasks like fine-grained classification, abstract or systematic tasks such as counting objects, and predicting truly out-of-distribution images that were not covered in its pre-training dataset. Therefore, its performance on a new assignment should be evaluated beforehand.

Using the Jupyter notebook provided below, you can easily create your own custom classifier. Just follow the instructions, add your data, and you’ll have a personalized classifier up and running in no time.