Fine-Tuning the Multimodal Marvel: Qwen-2 VL with LlamaFactory

Richardson Gunde
6 min readSep 8, 2024

--

Hey there, AI enthusiasts! Today we’re diving deep into the exciting world of multimodality with Qwen-2 VL, a cutting-edge open-source vision-language model developed by Alibaba Cloud. It’s a powerful tool that can understand and process both text, images, and video, making it incredibly versatile for a wide range of applications.

In this blog post, we’ll be exploring how to fine-tune this model using the user-friendly LlamaFactory framework. Whether you’re a seasoned AI developer or just starting out, this guide will equip you with the knowledge to customize Qwen-2 VL to your specific needs.

Qwen-2 VL: A Multimodal Champion

Before we jump into the fine-tuning process, let’s take a moment to appreciate the capabilities of this amazing model.

  • Open Source: This means it’s freely available for everyone to use and modify, fostering innovation and collaboration within the AI community.
  • Compact Size: Unlike many large language models (LLMs) that require massive computing resources, Qwen-2 VL is surprisingly compact, making it accessible for individuals and smaller teams with limited resources.
  • Multimodality: The ability to work with both text and images allows Qwen-2 VL to tackle a variety of tasks, from image captioning to visual question answering.

LlamaFactory: The Easy Way to Fine-Tune

Fine-tuning is the process of adapting a pre-trained model to a specific task. This is crucial for enhancing the model’s performance and achieving optimal results. LlamaFactory simplifies this process with its user-friendly interface and powerful functionalities.

LlamaFactory is like having a toolbox full of AI magic tools that let you:

  • Fine-tune various AI models: From LLMs to multimodality models like Qwen-2 VL.
  • Use a “low-code” or “no-code” approach: Meaning you don’t have to be a coding expert to get started.
  • Customize models for specific tasks: Train your model for image captioning, text summarization, or any other task you can dream up.

Two Ways to Fine-Tune: LlamaBoard and LlamaFactory CLI

Now, let’s explore the two main ways to fine-tune your Qwen-2 VL model with LlamaFactory:

1. LlamaBoard: The No-Code Approach

LlamaBoard is a visual, user-friendly interface that lets you fine-tune models without writing a single line of code. It’s perfect for beginners and those who prefer a more intuitive approach.

2. LlamaFactory CLI: Command-Line Flexibility

LlamaFactory CLI offers greater flexibility and control over the fine-tuning process through command-line commands. This is ideal for experienced users who want to experiment with various parameters and settings.

Getting Started: Setting Up Your Environment

Let’s set the stage for our fine-tuning adventure:

  1. Google Colab Pro: You’ll need access to Google Colab Pro for the necessary computing resources. Free Colab won’t cut it for this task!
  2. Clone LlamaFactory: Use git clone to download the LlamaFactory repository from GitHub.
  3. Install Dependencies: Ensure you have all the required packages by running pip install -r requirements.txt.
  4. Prepare Your Data: Gather the text and image data that you’ll use to fine-tune your model.

!git clone https://github.com/hiyouga/LLaMA-Factory.git

%cd LLaMA-Factory

!pip install -r requirements.txt

!pip install bitsandbytes

!pip install git+https://github.com/huggingface/transformers.git
!pip install -e ".[torch, metrics]"
!pip install liger-kernel

Llama Board


import os
!GRADIO_SHARE=1 llamafactory-cli webui

The Fine-Tuning Process: A Step-by-Step Guide

For this blog post, we’ll focus on the LlamaFactory CLI method, but the steps are similar for LlamaBoard.

1. Create a Configuration File (JSON):

Start by creating a JSON file that outlines the parameters for your fine-tuning process. This includes things like the model you’re using, the data sets, and the desired training settings.

2. Launch the Fine-Tuning Process:

Use the llama_factory train command, passing the path to your JSON configuration file.

3. Monitor the Training:

Observe the output and progress of your fine-tuning process. This will give you insights into how your model is learning.

4. Merge the Fine-Tuned Model:

Once the training is complete, you can merge the fine-tuned model with the original model using the merge_adapter function provided in LlamaFactory.

5. Test and Deploy:

Finally, evaluate the performance of your fine-tuned model and deploy it for use in your applications.

Llama Factory CLI


import json

args = dict(
stage="sft", # do supervised fine-tuning
do_train=True,
model_name_or_path="Qwen/Qwen2-VL-2B-Instruct", # use bnb-4bit-quantized Llama-3-8B-Instruct model
dataset="mllm_demo,identity", # use alpaca and identity datasets
template="qwen2_vl", # use llama3 prompt template
finetuning_type="lora", # use LoRA adapters to save memory
lora_target="all", # attach LoRA adapters to all linear layers
output_dir="qwen2vl_lora", # the path to save LoRA adapters
per_device_train_batch_size=2, # the batch size
gradient_accumulation_steps=4, # the gradient accumulation steps
lr_scheduler_type="cosine", # use cosine learning rate scheduler
logging_steps=10, # log every 10 steps
warmup_ratio=0.1, # use warmup scheduler
save_steps=1000, # save checkpoint every 1000 steps
learning_rate=5e-5, # the learning rate
num_train_epochs=3.0, # the epochs of training
max_samples=500, # use 500 examples in each dataset
max_grad_norm=1.0, # clip gradient norm to 1.0
loraplus_lr_ratio=16.0, # use LoRA+ algorithm with lambda=16.0
fp16=True, # use float16 mixed precision training
use_liger_kernel=True, # use liger kernel for efficient training
)

json.dump(args, open("train_qwen2vl.json", "w", encoding="utf-8"), indent=2)
!llamafactory-cli train train_qwen2vl.json

args = dict(
model_name_or_path="Qwen/Qwen2-VL-2B-Instruct", # use official non-quantized Llama-3-8B-Instruct model
adapter_name_or_path="qwen2vl_lora", # load the saved LoRA adapters
template="qwen2_vl", # same to the one in training
finetuning_type="lora", # same to the one in training
export_dir="qwen2vl_2b_instruct_lora_merged", # the path to save the merged model
export_size=2, # the file shard size (in GB) of the merged model
export_device="cpu", # the device used in export, can be chosen from `cpu` and `cuda`
#export_hub_model_id="your_id/your_model", # the Hugging Face hub ID to upload model
)

json.dump(args, open("merge_qwen2vl.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_qwen2vl.json

final_model_path = "/content/LLaMA-Factory/qwen2vl_2b_instruct_lora_merged"

hf_model_repo = "skuma307/Qwen2-VL-2B-Instruct-LoRA-FT"

from huggingface_hub import notebook_login

notebook_login()

from huggingface_hub import HfApi, HfFolder, Repository

# Create an instance of HfApi
api = HfApi()

api.upload_folder(
folder_path=final_model_path, # The folder containing the model files
repo_id=hf_model_repo, # Your authentication token
commit_message="Initial model upload" # Optional commit message
)

print(f"Model pushed to: {hf_model_repo}")

Troubleshooting: Common Errors and How to Fix Them

  • GPU Memory Issues: If you encounter out-of-memory errors, try cleaning the cache, freeing up GPU memory, or reducing the batch size.
  • Missing Dependencies: Double-check that you have all the necessary dependencies installed.
  • Data Format Issues: Ensure your data is properly formatted and compatible with LlamaFactory.

Conclusion: The Power of Fine-Tuning

Fine-tuning a multimodal model like Qwen-2 VL with LlamaFactory opens up a world of possibilities. It allows you to customize your model’s capabilities for specific tasks, leading to improved accuracy and performance.

Don’t forget to:

  • Check out the LlamaFactory GitHub repository: You’ll find comprehensive documentation, code examples, and helpful resources.
  • So go forth and unleash the power of Qwen-2 VL! Let’s build amazing AI applications together.

Call to Action:

  • Please do reach out for more insightful AI tutorials.-gunderichardson@gmail.com
  • Linkedin-Richardson Gunde
  • Please share this blog post with your network and let’s spark conversation.

--

--

Richardson Gunde

"Experienced Gen AI Researcher & Innovator | Driving Digital Transformation at Infosys | Transforming Ideas into Reality with Cutting-Edge Solutions"