Reproducing Guanaco

Does evaluation loss matter when training chat bots (with QLoRA) ?

Geronimo
7 min readAug 5, 2023
The NVIDIA stock chart (Source: Google). This is not how you want your loss curve to look like.

tldr; QLoRA is a method of finetuning LLMs at home. Also at home, you are confronted with a number of training hyperparameters affecting your model quality. Model quality is not easy to judge for chatbots, the commonly used metric evaluation loss starts to increase early in training. Does this mean your chatbot gets worse if you continue training? The answer seems to be ‘No’.

QLoRA is a method of finetuning large language models (LLMs) on consumer hardware. Shortly after it was published, people noticed peculiarities in the training metrics.

Increasing evaluation loss

The input data used for training is split into a training and evaluation set. During training, sort of in real-time, the trainer reports how well the current checkpoint performs in terms of predicting the evaluation set = evaluation loss.

Ideally, the evaluation loss decreases with training and at the point of overtraining, the model overfits to the training set, it has reached the point where it stopped learning the general problem and started learning the training data. When people tried to reproduce Guanaco, the model published along with QLoRA, increasing evaluation loss has been noted (one, two, and myself on podcast data).

How long should we train?

The official response by the QLoRA authors on this issue is the following:

You are observing diverging loss and oscillating MMLU for the following reasons.

1. In NLP eval loss is not always directly related to downstream performance (like task accuracy measured by GPT-4 eval).

2. Dev set of MMLU is small explaining swings in MMLU accuracy while finetuning. These values are indicative and you have to compute test set performance to have a more stable result. We use the last checkpoint for this.

3. As shown in our paper, finetuning on the OpenAssistant dataset significantly improves chatbot performance, as measured by Vicuna GPT-4 eval. However, it does not help much on MMLU (performance degrades or stays the same compared to no OA finetuning).

Ultimately, we showed that you should be evaluating on your downstream task when finetuning on a dataset. And you should think very carefully about what target benchmark you are optimizing as this is not always indicative of the desired performance. MMLU dev results were used to tune hyperparameters in our paper but Vicuna eval was much more relevant for chatbot evaluation.

Ok, so we cannot observe model quality while training, we should be evaluating on the downstream task. I did not find anything on performance vs. evaluation loss in the QLoRA paper and I haven’t seen anyone else doing it, so I did it and this is the result.

QLoRA finetune of Llama2–7B using the OASST1 dataset = Guanaco. Trained for >6 epochs. Standard hyperparameters as suggested in the QLoRA paper. Shown are evaluation loss (bottom) and chatbot performance of each training checkpoint compared to GPT-3 (top) as judged by GPT-4 using the LMSYS MT-Bench, a successor of the Vicuna benchmark.

What am I looking at and what does this mean?

  • In the course of finetuning, evaluation loss decreases for two epochs and after that starts increasing with each following epoch.
  • Chatbot performance as judged by GPT-4 increases (despite this increase in eval. loss) for ~4.5 epochs and then flattens out.
  • This means we can ignore evaluation loss, continue training for some time and the model will still improve.
  • When to stop training is still unclear, but 4–5 epochs seem to be safe and not hurt the model. At least with the standard hyperparameters.
  • Why on earth did I evaluate after epochs 0.9, 1.8, .. ? The standard hyperparameters suggest setting save_steps to 500 and 500 steps do not correspond to 1 epoch but to 0.9 epochs.

Read on to find out how exactly the data above was genera ..

Wait, there is something I don’t understand!

Me too.

  • Why does the loss increase sharply exactly after each epoch? If this happens due to overfitting, the evaluation loss should go up smoothly. Right? Maybe I am missing something here. It might be something specific to QLoRA.
  • What about batch size, gradient accumulation steps, learning rate, training on input and (any other hyperparameter) — how do these parameters affect performance? I don’t know yet, but I will find out. Maybe. Probably. If OpenAI increases my GPT-4 quota and allows me to send them some more of my money.

Update: Jeremy Howard took a close look, it seems like this is due to the LLM learning from a single example.

Fine-tuning Llama2–7B using QLoRA

I used the almost original finetune_guanaco_7b.sh from the QLoRA repository. The parameters I modified were max_memory_MB (trained on 2x RTX4090 with 24GB VRAM on runpod), increased per_device_eval_batch_size (speed), removed do_mmlu_eval (speed), and custom model paths. I intentionally kept per_device_train_batch_size the same since this might change the models quality, even though it was painfully slow.

python qlora.py \
--model_name_or_path "/workspace/models/llama2-7b-hf" \
--use_auth \
--output_dir /workspace/loras/llama2-guanaco-7b \
--logging_steps 10 \
--save_strategy steps \
--data_seed 42 \
--save_steps 500 \
--save_total_limit 100 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--per_device_eval_batch_size 16 \
--max_new_tokens 32 \
--dataloader_num_workers 1 \
--group_by_length \
--logging_strategy steps \
--remove_unused_columns False \
--do_train \
--do_eval \
--lora_r 64 \
--lora_alpha 16 \
--lora_modules all \
--double_quant \
--quant_type nf4 \
--bf16 \
--bits 4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--gradient_checkpointing \
--dataset oasst1 \
--source_max_len 16 \
--target_max_len 512 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--max_steps 6000 \
--eval_steps 100 \
--learning_rate 0.0002 \
--adam_beta2 0.999 \
--max_grad_norm 0.3 \
--lora_dropout 0.1 \
--weight_decay 0.0 \
--seed 0 \
--report_to wandb \
--max_memory_MB 23000

Training metrics (Weights and biases)

Weird oscillations caused by — group_by_length, see this.
For the nerds: beautiful runpod cooling. That’s what my GPUs at home look like doing nothing.
4090 hungry. 4090 want bigger batch.

Model evaluation using LMSYS MT-Bench

Chatbot evaluation is hard. Was hard — a lot has happened since I wrote this post on automated chatbot evaluation three months ago.

Human evaluation

The gold standard of chatbot performance is how well a human likes talking to it (aka the downstream task). LMSYS came up with the chatbot arena, a ChatGPT-like website that shows the answers of two models side-by-side, anonymised, the identity of the models is not given. The user selects which answers he prefers. Many models in the arena and thousands of pairwise comparisons led to a single human evaluation score for each model. See this for details.

GPT-4 based evaluation: MT-Bench

The LMSYS researchers in Berkeley and others have adapted their original Vicuna benchmark where each model was asked 80 questions and the answers were passed on to GPT-4 who scored a 1 to 10 rating for each answer.

The successor of the Vicuna benchmark is MT-Bench (code, paper) and evaluates multi-turn conversations.

MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models’ responses.

Having both human evaluation data and a new automated benchmark they found that MT-bench performs very well, apparently “achieving over 80% agreement, the same level of agreement between humans”.

It was a pleasant surprise to see how easy their code was to use.

Generating model answers

python3 gen_model_answer.py --model-path models/llama-2-7b-guanaco-checkpoint-500 --model-id llama-2-7b-guanaco-checkpoint-500

Evaluation by Judge GPT4

export OPENAI_API_KEY=not_a_real_OpenAI_kay

models="alpaca-13b \
gpt-4 \
llama-2-7b-guanaco-checkpoint-500 \
llama-2-7b-guanaco-checkpoint-1000 \
llama-2-7b-guanaco-checkpoint-1500 \
llama-2-7b-guanaco-checkpoint-2000 \
llama-2-7b-guanaco-checkpoint-2500 \
llama-2-7b-guanaco-checkpoint-3000 \
llama-2-7b-guanaco-checkpoint-3500"

python3 gen_judgment.py --mode pairwise-baseline --model-list ${models} --parallel 4

The mode pairwise-baseline compares each models answer to a baseline model, the default is GPT-3.5. As controls I added answers by alpaca-13b (negative) and GPT-4 (positive).

Raw MT-Bench output:

Mode: pairwise-baseline
Input file: data/mt_bench/model_judgment/gpt-4_pair.jsonl
win loss tie win_rate loss_rate win_rate_adjusted
model
gpt-4 102 17 41 0.63750 0.10625 0.765625
llama-2-7b-guanaco-checkpoint-2500 24 98 38 0.15000 0.61250 0.268750
llama-2-7b-guanaco-checkpoint-3500 22 99 39 0.13750 0.61875 0.259375
llama-2-7b-guanaco-checkpoint-3000 21 100 39 0.13125 0.62500 0.253125
llama-2-7b-guanaco-checkpoint-2000 20 107 33 0.12500 0.66875 0.228125
llama-2-7b-guanaco-checkpoint-1500 12 109 39 0.07500 0.68125 0.196875
llama-2-7b-guanaco-checkpoint-1000 10 119 31 0.06250 0.74375 0.159375
llama-2-7b-guanaco-checkpoint-500 6 129 25 0.03750 0.80625 0.115625
alpaca-13b 6 134 20 0.03750 0.83750 0.100000

.. and as a bar plot:

If you like this story, have additional ideas or questions, or wonder why in hell anyone would spend time doing this, please leave a comment here or on twitter.

--

--