Fine-tuning Small Vision Language Models: Phi-3-vision

13 min readJul 12, 2024

Image generated by Copilot — Designer (Microsoft), 2024 (https://www.bing.com/images/create)

In this article, we will explore the process of fine-tuning Phi-3-vision, a small vision-language model developed by Microsoft. We will use HuggingFace transformers for fine-tuning. With 4.2 billion parameters and support for context prompts up to 128,000 tokens, Phi-3-vision is designed to handle tasks that involve both visual and textual data [1]. As a multimodal model, it takes image and text inputs and generates text outputs. Similarly to some other vision language models, Phi-3-vision’s potential capabilities include chatting about images, image recognition via instructions, visual question answering, explaining charts or diagrams, image captioning, etc [2].

Content overview:

Practical applications for multimodal fine-tuning
Architecture of Phi-3-vision
Application example: Domain-specific image captioning
Datasets
Inference using base model
Fine-tuning process
Fine-tuning result

1. Practical applications for multimodal fine-tuning

Multimodal fine-tuning of small vision language models like Phi-3-vision presents new opportunities in various industries, for example those utilizing edge devices and on-premises systems. In particular, the institutions requiring on-premises processing due to data confidentiality can potentially benefit from the model’s ability to be fine-tuned without requiring extensive computational resources, thereby facilitating the maintenance and security of data within their own infrastructure.

Furthermore, edge devices can also leverage vision language models for practical use cases. For example, in agriculture, in areas with limited or no internet connectivity, edge devices equipped with vision language models can analyze pictures of crops and textual data such as pre-uploaded plant health guidelines to provide real-time analysis.

Fine-tuning offers several advantages, including reduced latency, smaller prompt sizes during inference, and enhanced data security. Reduced latency is achieved by optimizing the model for specific tasks, which decreases processing time. Smaller prompt sizes result from the model’s ability to understand more concise inputs. Data security is enhanced by keeping processing within local infrastructure. Additionally, fine-tuned models provide more precise outputs in specialized domains, increasing their utility.

2. Architecture of Phi-3-vision

The Phi-3-vision architecture integrates both visual and textual data processing to generate text outputs. It leverages a pre-trained language model (Phi-3) and a CLIP Vision Transformer (ViT-L/14) to handle images. The process begins with raw images being fed into the Vision Encoder, which converts them into visual patch and position embeddings. These embeddings are then processed by the Visual Embedding Projector (Multi-Layer Perceptron), transforming them into embeddings compatible with the text feature space.

On the textual side, language instructions are tokenized by the LlamaTokenizer, converting raw text into textual tokens. These tokens are then converted by the Text Encoder into text embeddings. Both the visual and text embeddings are finally fed into the pre-trained small language model (Phi-3), which integrates these inputs to generate the final text output. This architecture ensures that both visual and textual data are processed and embedded into a unified feature space, allowing for multimodal learning and output generation.

The architecture of Phi-3-vision (this diagram is the author’s elaboration based on a review of the model’s structure using `print(model)` and is not an official diagram)

3. Application example: Domain-specific image captioning

The task we’ll focus on in this article is the generation of domain-specific image captions. The aim is to show how fine-tuning can improve the model’s ability to produce detailed descriptions in a specific expected format based on visual inputs. This approach is potentially applicable to various real-world tasks, such as describing car damages, manufacturing defects, animal species, and more.

In particular, we will consider publicly available airplane photos [3]. The goal will be to recognize specific features and details of the aircraft, such as airline logos, airplane model types, and other distinguishing characteristics.

Sample images of various airplanes with their respective captions [3]

4. Datasets

For domain-specific image captioning, we will use two datasets of airplane photos from the HuggingFace hub:

These datasets are designed for multimodal tasks and contain various fields that provide detailed information about the images.

HuggingFace dataset “Multimodal-Fatima/FGVC_Aircraft_train” used for fine-tuning

To tailor the datasets for the needs of fine-tuning, we will focus on two fields: image and clip_tags_ViT_L_14. The image field contains the actual visual data (photos of airplanes), while clip_tags_ViT_L_14 provides captions.

We will filter the dataset to include only few airplane types, such as Boeing 707, Boeing 737, Boeing 777 and Boeing 787. This approach ensures that the training, validation, and testing datasets contain the same airplane types. To accomplish this, we will define a function that checks for the presence of these airplane types in the clip_tags_ViT_L_14field of each record.

from datasets import load_dataset

def filter_by_values(record, filtering_values, filtering_field):
    return any(model in record[filtering_field] for model in filtering_values)

filtering_values = ["boeing 707","boeing 737","boeing 777","boeing 787"]
filtering_field = "clip_tags_ViT_L_14"

filtered_train_dataset = raw_train_dataset.filter(lambda x: filter_by_values(x, filtering_values, filtering_field))
filtered_test_dataset = raw_test_dataset.filter(lambda x: filter_by_values(x, filtering_values, filtering_field))

split_dataset = filtered_train_dataset["train"].train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset["train"]
val_dataset = split_dataset["test"]
test_dataset = filtered_test_dataset["test"]

The final dataset sizes for fine-tuning are as follows:

Length of train_dataset: 108
Length of val_dataset: 28
Length of test_dataset: 16

These data sizes are intentionally kept small for demonstration purposes. Despite the limited size, we can still achieve meaningful results.

5. Inference using base model

Establishing a baseline with the base model’s performance is crucial for comparison with the fine-tuned model. This baseline highlights the base model’s limitations and sets a reference point for improvements.

Let’s consider a sample photo of an airplane and the prompt “Generate a concise caption for this image, mentioning specific airplane types”. As we may see, the base model (Phi-3-vision) returns an answer that seems to lack the granularity needed for a specialized domain. While base models can generate comprehensive descriptions for domain-specific images, they may not always correctly identify specific details within a specialized domain. The base models are typically trained on generic datasets, which are not always domain-precise or representative of specific fields.

The answer of the base Phi-3-vision given the prompt “*Generate a concise caption for this image, mentioning specific airplane types*”

While prompt engineering (e.g. few-shot learning) could potentially address this issue, using long prompts is not ideal in environments where the prompt needs to be as concise as possible. Fine-tuning the model on domain-specific data might be a more effective solution in such cases.

6. Fine-tuning process

In this section, we will detail the process of fine-tuning the Phi-3-vision model to generate captions for domain-specific photos. The complete code is available here.

The Phi-3-vision model uses flash attention by default, which requires certain types of GPU hardware to run efficiently. For our experiments, we used the NVIDIA A100 GPU.

First, let’s load the processor and model using bfloat16as the data type. The bfloat16is a 16-bit floating point format for deep learning that has a higher number of exponent bits. According to the studies, it increases training throughput and is significantly less prone to weight growth.

import torch
import transformers

dtype=torch.bfloat16
base_model_id = "microsoft/Phi-3-vision-128k-instruct"

model = transformers.AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=dtype,
    device_map="auto",
    trust_remote_code=True,
)

processor = transformers.AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)

Second, we will define the data collator, which is responsible for batching and preprocessing the input data. Specifically, it takes the input prompts and images and returns the following input tensors ready to be fed into the model:

input_ids: These are the tokenized input sequences. They represent the input text as a sequence of integer IDs, each corresponding to a specific token in the model’s vocabulary.
attention_mask: The attention mask for the input_ids in the language model, indicating which tokens should be attended to and which should be ignored (typically padding tokens).
pixel_values: The pre-processed pixel values that encode the image(s).
image_sizes: The dimensions of the images, preserving their aspect ratios and native resolution.
labels: The target indices used for computing the loss during training, corresponding to the expected output sequences.

The input prompt and answers (expected captions) are processed separately, with the prompt_input_ids representing the tokenized version of the prompt and the answer_input_ids generated by tokenizing the answer string. The prompt is tokenized according to the chat template of Phi-3-vision. Chat templates, which are part of the tokenizer, specify how to convert conversations (represented as lists of messages) into a single tokenizable string in the format expected by the model. Masking is applied to ensure the model is not penalized for predicting parts of the input that are not meant to be predicted, i.e. the prompt itself. By using an ignore_index value equal to -100, the prompt tokens are masked so the loss function can ignore them and focus solely on the answer tokens.

class DataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples):
        example = examples[0]

        image = example["image"]

        user_prompt = "Generate a concise caption for this image, mentioning specific airplane types"
        answer = ",".join(example["clip_tags_ViT_L_14"])

        messages = [
            {"role": "user", "content": f"<|image_1|>\n{user_prompt}"}
        ]

        prompt = self.processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        answer = f"{answer}<|end|>\n<|endoftext|>"

        # Mask user_prompts for labels
        batch = self.processor(prompt, [image], return_tensors="pt")
        prompt_input_ids = batch["input_ids"]

        answer_input_ids = self.processor.tokenizer(answer, add_special_tokens=False, return_tensors="pt")["input_ids"]

        concatenated_input_ids = torch.cat([prompt_input_ids, answer_input_ids], dim=1)
        ignore_index = -100
        labels = torch.cat(
            [
                torch.tensor([ignore_index] * len(prompt_input_ids[0])).unsqueeze(0),
                answer_input_ids,
            ],
            dim=1,
        )

        batch["input_ids"] = concatenated_input_ids
        batch["labels"] = labels

        # Ensure only floating-point tensors require gradients
        for key, value in batch.items():
            if isinstance(value, torch.Tensor) and torch.is_floating_point(value):
                batch[key] = value.clone().detach().requires_grad_(True)

        return batch

To optimize the fine-tuning process, we use Parameter Efficient Fine-Tuning (PEFT). Full fine-tuning can be resource-intensive, but PEFT updates only a selected subset of model parameters, reducing memory usage and minimizing the risk of catastrophic forgetting. One effective PEFT method is Low-Rank Adaptation (LoRA), which targets and updates crucial layers in the model without modifying the original weights. LoRA integrates a low-rank product with the existing model parameters during computation. More details about how to configure LoRA can be found in my previous article.

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "self_attn.q_proj.weight",
        "self_attn.k_proj.weight",
        "self_attn.v_proj.weight",
        "self_attn.qkv_proj.weight",
        "self_attn.out_proj.weight",
        "mlp.gate_up_proj",
        "mlp.down_proj",
        "lora_magnitude_vector"
    ],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    use_dora=False
)

peft_model = get_peft_model(model, lora_config)

train_dataset.start_iteration = 0

Next, we will define the metrics for training and evaluation. It is typical to train text generation models by maximizing the log likelihood on cross-entropy loss while evaluating with a separate metric that cannot be optimized through gradient evaluation. In this case, we will use ROUGE as the evaluation metric.

ROUGE is measured between 0 and 1, where a higher score is better. It is well-suited for image captioning evaluation as it measures the overlap of n-grams, word sequences, and word pairs between the generated captions and a set of reference captions. ROUGE metrics, including ROUGE-N (unigrams, bigrams, etc.), ROUGE-L (longest common subsequence), and ROUGE-S (skip-bigrams), provide a comprehensive assessment of how well the generated text matches the references.

When computing the ROUGE metric, we replace the labels -100 with padding tokens. These -100 labels are used to indicate positions in the sequence that should be ignored during the metric evaluation. By replacing these labels with padding tokens, we ensure that the evaluation metric reflects the model’s performance on meaningful data.

import evaluate

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predicted = logits.argmax(-1)
    labels = np.where(labels != -100, labels, processor.tokenizer.pad_token_id)

    decoded_labels = processor.batch_decode(labels, skip_special_tokens=True)
    decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True)
    rouge_scores = rouge.compute(predictions=decoded_predictions, references=decoded_labels)
    rouge1_score = rouge_scores["rouge1"]
    return {"rouge": rouge1_score}

Before using the metric in a custom trainer, let’s check it on a sample data.

predictions = ["A large commercial airplane with a blue tail and a red logo, possibly a Qantas Boeing 747, is taking off from an airport runway."]
references = ["tupolev sb,707,boeing 707,douglas dc-8,boeing 2707"]
rouge_score = rouge.compute(predictions=predictions, references=references)
print(rouge_score)

# {'rouge1': 0.058823529411764705, 'rouge2': 0.0, 'rougeL': 0.058823529411764705, 'rougeLsum': 0.058823529411764705}

As expected, the values of ROUGE metrics are low for a sample pair of predicted and true captions: “A large commercial airplane with a blue tail and a red logo, possibly a Qantas Boeing 747, is taking off from an airport runway.” and “tupolev sb,707,boeing 707,douglas dc-8,boeing 2707”.

We will use a custom trainer to ensure that the data collator is properly integrated into the training and evaluation processes. Below is the definition of the custom trainer:

class CstomTrainer(transformers.Trainer):
    def get_train_dataloader(self):
        # Ensure the DataLoader uses our custom DataCollator for training
        return DataLoader(
            self.train_dataset,
            batch_size=self.args.train_batch_size,
            collate_fn=self.data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=self.args.dataloader_num_workers,
        )

    def get_eval_dataloader(self, eval_dataset=None):
        # Ensure the DataLoader uses our custom DataCollator for evaluation
        eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
        return DataLoader(
            eval_dataset,
            batch_size=self.args.eval_batch_size,
            collate_fn=self.data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=self.args.dataloader_num_workers,
        )

    def compute_loss(self, model, inputs, return_outputs=False):
        outputs = model(**inputs)
        loss = outputs.loss if isinstance(outputs, dict) else outputs[0]
        return (loss, outputs) if return_outputs else loss

Finally, we will set up the training arguments and initialize the custom trainer to start the training process. The training arguments define the configuration for training and evaluation, including the number of epochs, batch sizes, gradient accumulation steps, and various other settings to control the training procedure.

training_args = transformers.TrainingArguments(
    num_train_epochs=4,                          # Number of training epochs
    per_device_train_batch_size=batch_size,      # Batch size for training
    per_device_eval_batch_size=batch_size,       # Batch size for evaluation
    gradient_accumulation_steps=6,               # Number of steps to accumulate gradients before updating
    gradient_checkpointing=True,                 # Enable gradient checkpointing to save memory
    do_eval=True,                                # Perform evaluation during training
    save_total_limit=2,                          # Limit the total number of saved checkpoints
    evaluation_strategy="steps",                 # Evaluation strategy to use (here, at each specified number of steps)
    save_strategy="steps",                       # Save checkpoints at each specified number of steps
    save_steps=10,                               # Number of steps between each checkpoint save
    eval_steps=10,                               # Number of steps between each evaluation
    max_grad_norm=1,                             # Maximum gradient norm for clipping
    warmup_ratio=0.1,                            # Warmup ratio for learning rate schedule
    weight_decay=0.01,                           # Regularization technique to prevent overfitting
    # fp16=True,                                 # Enable mixed precision training with fp16 (enable it if Ampere architecture is unavailable)
    bf16=True,                                   # Enable mixed precision training with bf16
    logging_steps=10,                            # Number of steps between each log
    output_dir="outputs",                        # Directory to save the model outputs and checkpoints
    optim="adamw_torch",                         # Optimizer to use (AdamW with PyTorch)
    learning_rate=1e-4,                          # Learning rate for the optimizer
    lr_scheduler_type="constant",                # Learning rate scheduler type
    load_best_model_at_end=True,                 # Load the best model found during training at the end
    metric_for_best_model="rouge",               # Metric used to determine the best model
    greater_is_better=True,                      # Indicates if a higher metric score is better
    push_to_hub=False,                           # Whether to push the model to Hugging Face Hub
    run_name="phi-3-vision-finetuning",          # Name of the run for experiment tracking
    report_to="wandb"                            # For experiment tracking (login to Weights & Biases needed)
)

# Ensure the model is in training mode
peft_model.train()

trainer = CustomTrainer(
    model=peft_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    #callbacks=[early_stopping]
)

peft_model.config.use_cache = False

trainer.train()

7. Fine-tuning result

Let’s review the results of fine-tuning. The following plots display the metrics for the training experiment conducted using the fine-tuned Phi-3 Vision model. The first plot shows the ROUGE score progression, while the second and third plots depict the cross-entropy loss over the training and evaluation steps. These metrics provide insights into the model’s performance and learning progress.

Cross-entropy loss over the evaluation steps

Cross-entropy loss over the training steps

The steady increase in the ROUGE score, coupled with the decrease in cross-entropy loss, demonstrates that the model is learning to generate meaningful captions for the domain-specific photos.

In the figure below, the first five airplane photos from a testing dataset are used to compare the expected captions (green), base model outcomes (red), and fine-tuned model outcomes (blue). The relevant ROUGE-1 metric values are 0.025 and 0.49 for the base model and fine-tuned model, respectively.

Comparison of expected captions (green) and base model outcomes (red) on 5 examples from testing dataset

Comparison of expected captions (green) and fine-tuned model outcomes (blue) on 5 examples from testing dataset

Finally, let’s revisit the same example we used for evaluating the base model at the beginning of this article to see how the answer changed after fine-tuning. Given a photo of an airplane and the prompt “Generate a concise caption for this image, mentioning specific airplane types”, the fine-tuned model returns an answer that captures more specific details about the airplanes in the expected comma-separated format. While the fine-tuned model shows improvement in recognizing multiple airplane types, there is still some room for enhancement. The model correctly identifies “tupolev sb” and “boeing” but incorrectly adds “737” and “next generation”, as well as it repeats some aircraft types. The expected response includes a broader variety of airplanes, including “boeing 707” and “douglas dc-8,” which the model did not capture.

Answer of Phi-3-vision after fine-tuning

Further fine-tuning and data augmentation (we only used 108 examples for training) is required to achieve more accurate results.

For deeper insights and the complete code, refer to the GitHub repository linked below:

GitHub Repository: Link to GitHub Repository

Conclusions

As explored in this article, fine-tuning Phi-3-vision is a nuanced process that requires customization. By leveraging HuggingFace transformers, we can fine-tune such models to handle tasks involving both visual and textual data. To achieve production-ready results, it is important to invest time in the data preparation and curation, as well as the customization of fine-tuning process (hyperparameters optimization, custom metrics, etc).

Please note that the views expressed in this article are my own and do not necessarily reflect those of Microsoft, my current employer.