A Guide to Fine-Tuning CLIP Models with Custom Data

Shashank Vats
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
5 min readJun 1, 2023

Artificial intelligence and machine learning have come a long way in recent years, with advances in the field allowing researchers and developers to achieve unprecedented results. Emerging as a revolutionary leap in the AI arena, the CLIP (Contrastive Language–Image Pretraining) model from OpenAI, taking advantage of its multimodal capability, offers an exceptional ability to comprehend and interrelate text and images, CLIP presents enormous potential in a multitude of applications, especially zero-shot classification as discussed in my previous post.

So, wouldn’t it be fascinating to take this powerful model and fine-tune it to serve our unique needs?

But before we dive into the code, let’s first lay the groundwork by understanding the concepts of fine-tuning and its significance, particularly in the context of the CLIP model.

What is Fine-Tuning?

In machine learning, fine-tuning is a process of taking a pre-trained model and “tuning” its parameters slightly to adapt to a new, similar task. Why would we want to do this? There are a few reasons:

  1. Saves resources: Training a large model from scratch requires significant computational resources and time. By using a pre-trained model, we can leverage the patterns it has already learned, reducing the resources required.
  2. Leverages transfer learning: This is a big part of fine-tuning. The idea is that the knowledge gained while solving one problem can be applied to a different but related problem. For instance, a model trained on a large dataset of general images (like ImageNet) has learned to recognize various features in images. This knowledge can be transferred to a more specific task, such as recognizing types of clothing in images.
  3. Deals with limited data: In many cases, we might not have a large enough dataset for our specific task. Fine-tuning a pre-trained model on a smaller dataset can help prevent overfitting, as the model has already learned general features from the larger dataset it was initially trained on.

Why Fine-Tuning CLIP?

The CLIP model, as we mentioned before, is trained to understand and correlate images and text simultaneously. This is achieved by training the model on a large corpus of internet text and images. However, this generalized training might not make it an expert in understanding certain specific or specialized types of images or text. While the pre-trained CLIP model is powerful, to truly leverage its capabilities for a specific task or domain, fine-tuning is a crucial step.

The following sections of this article will provide you with a step-by-step guide on how to fine-tune the CLIP model with your own custom dataset using Python.

Importing Necessary Libraries

The initial part of the script is devoted to importing necessary libraries and modules. This includes json for handling data, PIL for image processing, and torch and clip for model loading and fine-tuning. Following this, the script loads the custom JSON dataset and a corresponding set of images from the defined paths.

import json
from PIL import Image

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

import clip
from transformers import CLIPProcessor, CLIPModel

Preparing the Dataset

The dataset that we’ll be using here is Indo Fashion Dataset available on Kaggle. It consists of 106K images and 15 unique cloth categories. There’s an equal distribution of the classes in the validation and the test set consisting of 500 samples per class for fine-grained classification of Indian ethnic clothes.

Our Python script starts by loading a custom dataset located at ‘path to train_data.json’ and a corresponding set of images at ‘path to training dataset’. These paths should be replaced with your own specific paths. The data is stored in JSON format, which we iterate over to construct our custom dataset.

with open(json_path, 'r') as f:
input_data = []
for line in f:
obj = json.loads(line)
input_data.append(obj)

Load the CLIP Model and Processor

Install the transformer library provided by the good folks at 🤗 Hugging Face using pip.

pip install transformers

Next, we load the pre-trained CLIP model from 🤗 Hugging Face’s model hub, as well as the corresponding processor for text and image data.

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

These lines load the pre-trained model and its processor, setting the groundwork for the fine-tuning we will perform later.

Custom Dataset and DataLoader

We then create a custom dataset class, which takes in the paths of our images and the corresponding texts (captions). Within this class, we have methods for preprocessing the images and tokenizing the texts.

class image_title_dataset():
def __init__(self, list_image_path,list_txt):
# Initialize image paths and corresponding texts
self.image_path = list_image_path
# Tokenize text using CLIP's tokenizer
self.title = clip.tokenize(list_txt)

def __len__(self):
return len(self.title)

def __getitem__(self, idx):
# Preprocess image using CLIP's preprocessing function
image = preprocess(Image.open(self.image_path[idx]))
title = self.title[idx]
return image, title

The DataLoader object in PyTorch then helps us to efficiently load this data in batches during the training process.

Model Fine-Tuning and Training

After loading our data, we prepare our model for fine-tuning. We choose an optimizer, in this case, the Adam optimizer, with a learning rate of 5e-5 and some specific parameters.

Then, we define our loss function, nn.CrossEntropyLoss(), which will be used to calculate the loss at each step of the training process.

optimizer = torch.optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2)
loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()

num_epochs = 30
for epoch in range(num_epochs):
pbar = tqdm(train_dataloader, total=len(train_dataloader))
for batch in pbar:
optimizer.zero_grad()

images,texts = batch

images= images.to(device)
texts = texts.to(device)

# Forward pass
logits_per_image, logits_per_text = model(images, texts)

# Compute loss
ground_truth = torch.arange(len(images),dtype=torch.long,device=device)
total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2

# Backward pass
total_loss.backward()
if device == "cpu":
optimizer.step()
else :
convert_models_to_fp32(model)
optimizer.step()
clip.model.convert_weights(model)

pbar.set_description(f"Epoch {epoch}/{num_epochs}, Loss: {total_loss.item():.4f}")

The training loop itself involves several steps:

  1. We begin each epoch by initializing a progress bar using tqdm to keep track of our progress.
  2. In each iteration, we load a batch of images and their corresponding captions.
  3. The data is passed through our model, generating predictions.
  4. These predictions are compared with the ground truth to calculate the loss.
  5. This loss is then back-propagated through the network to update the model’s parameters.

This fine-tuning process will continue for the number of epochs defined, gradually improving the model’s understanding of the relationship between our specific set of images and their corresponding captions.

The complete code implementation can be found on my github repo!

References:

  1. https://github.com/openai/CLIP
  2. https://www.kaggle.com/datasets/validmodel/indo-fashion-dataset
  3. https://github.com/openai/CLIP/issues/83

--

--