Fine-Tuning the Multimodal Marvel: Qwen-2 VL with LlamaFactory

6 min readSep 8, 2024

Hey there, AI enthusiasts! Today we’re diving deep into the exciting world of multimodality with Qwen-2 VL, a cutting-edge open-source vision-language model developed by Alibaba Cloud. It’s a powerful tool that can understand and process both text, images, and video, making it incredibly versatile for a wide range of applications.

In this blog post, we’ll be exploring how to fine-tune this model using the user-friendly LlamaFactory framework. Whether you’re a seasoned AI developer or just starting out, this guide will equip you with the knowledge to customize Qwen-2 VL to your specific needs.

Qwen-2 VL: A Multimodal Champion

Before we jump into the fine-tuning process, let’s take a moment to appreciate the capabilities of this amazing model.

Open Source: This means it’s freely available for everyone to use and modify, fostering innovation and collaboration within the AI community.
Compact Size: Unlike many large language models (LLMs) that require massive computing resources, Qwen-2 VL is surprisingly compact, making it accessible for individuals and smaller teams with limited resources.
Multimodality: The ability to work with both text and images allows Qwen-2 VL to tackle a variety of tasks, from image captioning to visual question answering.

LlamaFactory: The Easy Way to Fine-Tune

Fine-tuning is the process of adapting a pre-trained model to a specific task. This is crucial for enhancing the model’s performance and achieving optimal results. LlamaFactory simplifies this process with its user-friendly interface and powerful functionalities.

LlamaFactory is like having a toolbox full of AI magic tools that let you:

Fine-tune various AI models: From LLMs to multimodality models like Qwen-2 VL.
Use a “low-code” or “no-code” approach: Meaning you don’t have to be a coding expert to get started.
Customize models for specific tasks: Train your model for image captioning, text summarization, or any other task you can dream up.

Two Ways to Fine-Tune: LlamaBoard and LlamaFactory CLI

Now, let’s explore the two main ways to fine-tune your Qwen-2 VL model with LlamaFactory:

1. LlamaBoard: The No-Code Approach

LlamaBoard is a visual, user-friendly interface that lets you fine-tune models without writing a single line of code. It’s perfect for beginners and those who prefer a more intuitive approach.

2. LlamaFactory CLI: Command-Line Flexibility

LlamaFactory CLI offers greater flexibility and control over the fine-tuning process through command-line commands. This is ideal for experienced users who want to experiment with various parameters and settings.

Getting Started: Setting Up Your Environment

Let’s set the stage for our fine-tuning adventure:

Google Colab Pro: You’ll need access to Google Colab Pro for the necessary computing resources. Free Colab won’t cut it for this task!
Clone LlamaFactory: Use git clone to download the LlamaFactory repository from GitHub.
Install Dependencies: Ensure you have all the required packages by running pip install -r requirements.txt.
Prepare Your Data: Gather the text and image data that you’ll use to fine-tune your model.


!git clone https://github.com/hiyouga/LLaMA-Factory.git


%cd LLaMA-Factory


!pip install -r requirements.txt


!pip install bitsandbytes


!pip install git+https://github.com/huggingface/transformers.git

!pip install -e ".[torch, metrics]"

!pip install liger-kernel

Llama Board


import os
!GRADIO_SHARE=1 llamafactory-cli webui

The Fine-Tuning Process: A Step-by-Step Guide

For this blog post, we’ll focus on the LlamaFactory CLI method, but the steps are similar for LlamaBoard.

1. Create a Configuration File (JSON):

Start by creating a JSON file that outlines the parameters for your fine-tuning process. This includes things like the model you’re using, the data sets, and the desired training settings.

2. Launch the Fine-Tuning Process:

Use the llama_factory train command, passing the path to your JSON configuration file.

3. Monitor the Training:

Observe the output and progress of your fine-tuning process. This will give you insights into how your model is learning.

4. Merge the Fine-Tuned Model:

Once the training is complete, you can merge the fine-tuned model with the original model using the merge_adapter function provided in LlamaFactory.

5. Test and Deploy:

Finally, evaluate the performance of your fine-tuned model and deploy it for use in your applications.

Llama Factory CLI


import json

args = dict(
  stage="sft",                        # do supervised fine-tuning
  do_train=True,
  model_name_or_path="Qwen/Qwen2-VL-2B-Instruct", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  dataset="mllm_demo,identity",             # use alpaca and identity datasets
  template="qwen2_vl",                     # use llama3 prompt template
  finetuning_type="lora",                   # use LoRA adapters to save memory
  lora_target="all",                     # attach LoRA adapters to all linear layers
  output_dir="qwen2vl_lora",                  # the path to save LoRA adapters
  per_device_train_batch_size=2,               # the batch size
  gradient_accumulation_steps=4,               # the gradient accumulation steps
  lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
  logging_steps=10,                      # log every 10 steps
  warmup_ratio=0.1,                      # use warmup scheduler
  save_steps=1000,                      # save checkpoint every 1000 steps
  learning_rate=5e-5,                     # the learning rate
  num_train_epochs=3.0,                    # the epochs of training
  max_samples=500,                      # use 500 examples in each dataset
  max_grad_norm=1.0,                     # clip gradient norm to 1.0
  loraplus_lr_ratio=16.0,                   # use LoRA+ algorithm with lambda=16.0
  fp16=True,                         # use float16 mixed precision training
  use_liger_kernel=True,                   # use liger kernel for efficient training
)


json.dump(args, open("train_qwen2vl.json", "w", encoding="utf-8"), indent=2)

!llamafactory-cli train train_qwen2vl.json


args = dict(
  model_name_or_path="Qwen/Qwen2-VL-2B-Instruct", # use official non-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="qwen2vl_lora",            # load the saved LoRA adapters
  template="qwen2_vl",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  export_dir="qwen2vl_2b_instruct_lora_merged",              # the path to save the merged model
  export_size=2,                       # the file shard size (in GB) of the merged model
  export_device="cpu",                    # the device used in export, can be chosen from `cpu` and `cuda`
  #export_hub_model_id="your_id/your_model",         # the Hugging Face hub ID to upload model
)


json.dump(args, open("merge_qwen2vl.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/


!llamafactory-cli export merge_qwen2vl.json


final_model_path = "/content/LLaMA-Factory/qwen2vl_2b_instruct_lora_merged"


hf_model_repo = "skuma307/Qwen2-VL-2B-Instruct-LoRA-FT"


from huggingface_hub import notebook_login

notebook_login()


from huggingface_hub import HfApi, HfFolder, Repository


# Create an instance of HfApi
api = HfApi()


api.upload_folder(
    folder_path=final_model_path,    # The folder containing the model files
    repo_id=hf_model_repo,                # Your authentication token
    commit_message="Initial model upload"  # Optional commit message
)


print(f"Model pushed to: {hf_model_repo}")

Troubleshooting: Common Errors and How to Fix Them

GPU Memory Issues: If you encounter out-of-memory errors, try cleaning the cache, freeing up GPU memory, or reducing the batch size.
Missing Dependencies: Double-check that you have all the necessary dependencies installed.
Data Format Issues: Ensure your data is properly formatted and compatible with LlamaFactory.

Conclusion: The Power of Fine-Tuning

Fine-tuning a multimodal model like Qwen-2 VL with LlamaFactory opens up a world of possibilities. It allows you to customize your model’s capabilities for specific tasks, leading to improved accuracy and performance.

Don’t forget to:

Check out the LlamaFactory GitHub repository: You’ll find comprehensive documentation, code examples, and helpful resources.
So go forth and unleash the power of Qwen-2 VL! Let’s build amazing AI applications together.

Call to Action:

Please do reach out for more insightful AI tutorials.-gunderichardson@gmail.com
Linkedin-Richardson Gunde
Please share this blog post with your network and let’s spark conversation.