Qwen2 -Finetuning Qwen2

7 min readJun 8, 2024

The newest series of language models from Alibaba Cloud quickly rose to the top of the rankings for open-source LLMs after its launch on Friday, due to its improved performance and enhanced safety features.

The Qwen2 series includes various base language models and instruction-tuned language models, ranging in size from 0.5 to 72 billion parameters, along with a Mixture-of-Experts (MoE) model.

These updated features have placed it in the top position on the Open LLM Leaderboard on the collaborative artificial intelligence platform Hugging Face, where it can be used for commercial or research activities.

Alibaba, the Chinese e-commerce giant, plays a significant role in China’s AI industry. Today, it unveiled its latest AI model, Qwen2, which is considered one of the best open-source options available at the moment.

Developed by Alibaba Cloud, Qwen2 represents the next phase in the firm’s Tongyi Qianwen (Qwen) model series, which includes Tongyi Qianwen LLM (Qwen), the vision AI model Qwen-VL, and Qwen-Audio.

The Qwen model series comes pre-trained on multilingual data across diverse industries and domains, with Qwen-72B being the most robust model in the lineup. This model has been trained on a remarkable 3 trillion tokens of data. In comparison, Meta’s most potent Llama-2 variant is built on 2 trillion tokens. Meanwhile, Llama-3 is currently processing 15 trillion tokens.

Model Details

Qwen2 is a series of language models that includes decoder models of various sizes. Alibaba has released base language models and aligned chat models for each size. These models are built on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, a mixture of sliding window attention and full attention, among other features. Furthermore, Alibaba has developed an enhanced tokenizer that is adaptive to multiple natural languages and codes.

Model sizes: Qwen2–0.5B, Qwen2–1.5B, Qwen2–7B, Qwen2–57B-A14B, and Qwen2–72B;
Trained on data in 27 languages beyond English and Chinese;
Outstanding results in various benchmarks;
Enhanced coding and math capabilities;
Context length support extended to 128K tokens with Qwen2–7B-
Instruct and Qwen2–72B-Instruct.

Model Information

The Qwen2 series consist of base and instruction-tuned models in 5 sizes: Qwen2–0.5B, Qwen2–1.5B, Qwen2–7B, Qwen2–57B-A14B, and Qwen2–72B. Below is a table detailing the essential information about these models:

Performance

Comparative evaluations show significant performance improvements in large-scale models (70B+ parameters) compared to Qwen1.5. In this study, the focus is on assessing the performance of the large-scale model Qwen2–72B. Qwen2–72B and cutting-edge open models are compared in various aspects such as natural language understanding, knowledge acquisition, coding ability, mathematical aptitude, and multilingual skills. Thanks to carefully curated datasets and refined training techniques, Qwen2–72B demonstrates superior performance when pitted against top models like Llama-3–70B. Notably, it outperforms its predecessor, Qwen1.5–110B, despite having fewer parameters.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-7B-Chat", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-7B-Chat")

prompt = "Give me a short introduction to large language model."

messages = [{"role": "user", "content": prompt}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Qwen2Config

( vocab_size = 151936hidden_size = 4096intermediate_size = 22016num_hidden_layers = 32num_attention_heads = 32num_key_value_heads = 32hidden_act = 'silu'max_position_embeddings = 32768initializer_range = 0.02rms_norm_eps = 1e-06use_cache = Truetie_word_embeddings = Falserope_theta = 10000.0use_sliding_window = Falsesliding_window = 4096max_window_layers = 28attention_dropout = 0.0**kwargs )

Parameters

vocab_size (int, optional, defaults to 151936) — Vocabulary size of the Qwen2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling Qwen2Model
hidden_size (int, optional, defaults to 4096) — Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 22016) — Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 32) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 32) — Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (int, optional, defaults to 32) — This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout [this paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to 32`.
hidden_act (str or function, optional, defaults to "silu") — The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 32768) — The maximum sequence length that this model might ever be used with.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.
tie_word_embeddings (bool, optional, defaults to False) — Whether the model’s input and output word embeddings should be tied.
rope_theta (float, optional, defaults to 10000.0) — The base period of the RoPE embeddings.
use_sliding_window (bool, optional, defaults to False) — Whether to use sliding window attention.
sliding_window (int, optional, defaults to 4096) — Sliding window attention (SWA) window size. If not specified, will default to 4096.
max_window_layers (int, optional, defaults to 28) — The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
attention_dropout (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities.

Ollama — Qwen2

ollama serve
# You need to keep this service running whenever you are using ollama

To pull a model checkpoint and run the model, use the ollama run command. You can specify a model size by adding a suffix to qwen2, such as :0.5b, :1.5b, :7b, or :72b:

ollama run qwen2:7b
# To exit, type "/bye" and press ENTER

You can also access the ollama service via its OpenAI-compatible API. Please note that you need to (1) keep ollama serve running while using the API, and (2) execute ollama run qwen2:7b before utilizing this API to ensure that the model checkpoint is prepared.

from openai import OpenAI
client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)
chat_completion = client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Say this is a test',
        }
    ],
    model='qwen2:7b',
)

Finetuning Qwen 2 with Alpaca dataset

The code below fine-tunes the Qwen2–0.5B language model using the Alpaca dataset. Language models are effective for natural language processing tasks like text generation, summarization, and question answering.

Prerequisites: The required packages and libraries are Unsloth, Xformers (Flash Attention), trl, peft, accelerate, bitsandbytes, transformers, and datasets. This code is designed to run on Google Colab or a compatible environment.

Step 1: Install Dependencies

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

Step 2: Load the Model and Tokenizer

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
fourbit_models = [
    "unsloth/Qwen2-0.5b-bnb-4bit",
]
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2-0.5B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

The FastLanguageModel.from_pretrained() function loads the Qwen2–0.5B model and its tokenizer. The parameters max_seq_length, dtype, and load_in_4bit are used to configure the model.

Step 3: Apply PEFT (Parameter-Efficient Fine-Tuning)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

The FastLanguageModel.get_peft_model() function is used to apply PEFT to the loaded model. Various parameters, such as r, target_modules, lora_alpha, lora_dropout, bias, use_gradient_checkpointing, random_state, use_rslora, and loftq_config, are used to configure the PEFT process.

Step 4: Define the Alpaca Prompt

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

The alpaca_prompt variable is a template for formatting the input data. The EOS_TOKEN and formatting_prompts_func() function are used to prepare the data for training.

Step 5: Load and Preprocess the Dataset

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

The Alpaca dataset is loaded using the load_dataset() function from the Hugging Face datasets library. The formatting_prompts_func() function is applied to the dataset using the map() function.

Step 6: Set up the Training Configuration

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps = 20,
        max_steps = 120,
        learning_rate = 5e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

The SFTTrainer class from the trl library is used for training the language model. The TrainingArguments are used to configure various parameters, such as batch size, gradient accumulation steps, warmup steps, max steps, learning rate, mixed precision settings, logging steps, optimizer, weight decay, learning rate scheduler, and seed.

Step 7: Train the Model

trainer_stats = trainer.train()

The trainer.train() function is called to start the training process. The trainer_stats variable stores the training statistics.

The fine-tuned model can be used for various natural language processing tasks, such as text generation, summarization, and question answering.