Experimenting with fine-tuning: Llama 2, Mistral and Zephyr
By: Benjamin Ye & Rohit Saha
The world of LLMs evolved quickly in 2023. In July, Meta made big news in the LLM world by releasing its open-access Llama 2 model. Three months later, Mistral AI released its 7B model, claiming to be leaner and smarter than models with greater parameter counts, and purported to beat Llama2–13B on several benchmarks. In November, Hugging Face’s H4 team produced a fine-tuned version of Mistral, Zephyr 7B, with reported MT-Bench result that even surpasses that of the Llama2–70B, which up to that point had been considered by some to be one of the largest and best-performing models to date.
Each of these models improved upon one another with refinements such as tweaking the self-attention mechanism and distilling knowledge from a larger LLM. Moreover, these smaller, open-access models allow researchers and hobbyists alike to run them on their own infrastructure — even locally with a consumer-grade GPU. These models can also be fine-tuned to enable them to perform specialized downstream tasks — with many anecdotal reports of comparable results to closed-source OpenAI models at a fraction of the cost and inference time.
With so many different models, it is hard to choose which one to start our experiment with. To help with the decision, we have done a systematic benchmarking of different models under different tasks. In this blog, we’ll cover:
- Model highlights: What sets these models apart from another?
- Out-of-the-box benchmarks: How well do these models do on their own?
- Fine-tuning benchmarks: How well do these models adapt to downstream tasks?
- Access our experiments in our GitHub repository: How do I fine-tune with my own data?
Model Highlights
Llama 2
Llama 2, released by Meta, is an update with improved performance to its predecessor, Llama 1. The model uses a standard transformer architecture with these key features:
Model varieties: The base model is offered in a range of parameter sizes — 7B, 13B and 70B — as well as a version fine-tuned and aligned via supervised fine-tuning and reinforcement learning through human feedback (RLHF).
Pre-training corpus: The model is pre-trained on a corpus with 2 trillion tokens, 40% more than Llama 1.
Context length: the model supports a context length of 4,096 tokens, doubling from that of Llama 1.
We have written an overview and evaluation of Llama 2 in our previous blog post. To learn more, you can access it here.
Mistral-7B
Mistral-7B was released by Mistral AI. The model’s key differentiating factor is the use of Sliding Window Attention (SWA), proposed by Beltagy et al. in their Longformer paper. According to the paper, the use of SWA allows the model to be more efficient in both memory and compute. In contrast to the standard Global Attention mechanism, where the number of computational operations increases quadratically in relation to the length of the input, the SWA method has a linear growth in operations as the input length increases. Another benefit of using this attention mechanism is that the model can retain more information from the past, thereby handling longer sequences more effectively.
As a result, the model showcased impressive benchmark results in its paper: beating Llama-2–7B in all benchmarks and the larger Llama-2–13B models in the majority of benchmarks.
Zephyr-7B
Zephyr-7B was released by Hugging Face’s H4 team. The model is a fine-tuned version of Mistral-7B. The key contribution of the model is the demonstration of the effectiveness of using larger teacher models to generate synthetic datasets upon which the model fine-tunes on.
For its fine-tuning step, Zephyr used Ultrachat dataset generated from ChatGPT and further aligned using GPT-4-generated UltraFeedback via Direct Preference Optimization (DPO).
This combination allowed Zephyr to essentially distill knowledge from the teacher models, indirectly learning human preferences the larger models were trained on via RLHF.
From its paper, Zephyr’s team reported SOTA MT-Bench results in the 7B model space, even beating most of the open-access models that are 10x bigger. The model also achieved comparable results versus closed-source commercial models like GPT-3.5, demonstrating the effectiveness of its fine-tuning technique.
Out-of-the-Box Benchmarks
To analyze how these models compare, we evaluated the performance through summarization tasks. We used the Samsum dataset — the corpus consists of 16,369 pairs of dialogues and their accompanying human-labeled gold-standard summaries. To evaluate model performance, we generated summaries on Samsum’s test set and evaluated results using standard ROUGE metrics. In particular, we used ROUGE-1 and ROUGE-2 which measures unigram and bigram overlaps, respectively, between the generated output and the gold label.
Environment Setup
Instance Type: AWS EC2 g5.2xlarge
GPU: 1x A10G GPU (24GB VRAM)
Model Weight Quantization: 4-bit (NF4)
Calculation Precision: 16-bit (bfloat16)
Summarization Experiment Setup
Dataset: Samsum
Prompt Template:
Summarization Experiment Results (Zero-Shot & Few-Shot)
As we can see from the results, they match our expectations as the newest Zephyr model performed better than both Llama 2 and Mistral in zero-shot generation. However, we did not see as meaningful of a performance uplift from Zephyr with few-shot learning compared to Mistral. Remarkably, all of the 7B open-access models performed in line with the closed-source GPT-3.5-Turbo model offered by OpenAI.
We were surprised to see the performance of the 13B Llama 2 model lagging behind the rest. However, we noted that ROUGE metrics do not account for nuance such as synonyms.
Fine-Tuning Benchmarks
To gauge how well these models adapt to the style of the gold label summaries after fine-tuning, we used the following fine-tuning setup:
Summarization Fine-Tuning Setup
Fine-Tuning Method: QLoRA
Learning Rate: 2e-4
Dropout: 0.1
Rank: 64
Alpha: 16
Epochs: 1
Target Modules: Attention Modules Only
Summarization Fine-Tuning Result
We saw a marked increase in performance after QLoRA fine-tuning, significantly surpassing the summarization performance. Mistral-7B performed better than Llama 2 and Zephyr after fine-tuning. The open-access models continue to hold their own after fine-tuning.
We noted the drastic improvement of ROUGE metrics from Llama2–13B after fine-tuning, affirming our earlier hypothesis of artificially low scores due to non-conformance of the syntactic style and word choice.
To further evaluate the effectiveness of QLoRA fine-tuning, we devised another experiment — this time with classification as the downstream task. We used the newsgroup dataset: 18,828 messages with 20-class labels separated into a train and test set. We conducted an ablation study to see how quickly each model learns the classification task. We fed the model progressively more training examples and evaluated how well the model performed using accuracy metrics at each sample fraction.
Classification Fine-Tuning Setup
Fine-Tuning Method: QLoRA
Learning Rate: 2e-4
Dropout: 0.1
Rank: 8
Alpha: 16
Epochs: 5
Target Modules: Attention Modules Only
Classification Experiment Setup
Dataset: newsgroup
Prompt Template:
Classification Fine-Tuning Results
We believe our experiment shows that Llama-2–13B is the most sample-efficient model among models we tested; it was able to adapt quicker than the smaller 7B models. Among 7B models, Llama-2–7B seems to adapt very quickly when there’s a low amount (2.5%-5.0%) of training examples. At higher sample fractions, Mistral and Zephyr quickly catch up, performing similarly to Llama2–7B. At 100%, all the 7B models appeared to perform similarly, and unsurprisingly, the 13B version of Llama 2 appeared to perform better than models with lower parameter counts.
We noted the rather inconsistent performance of GPT-3.5-Turbo across different sample fractions. However, we are unable to draw conclusions as we don’t have any visibility into OpenAI’s internal training method.
Access our experiments in our GitHub repository
All of our experiments and scripts to fine-tune these LLMs are available on our LLM-Tuning-Hub. The repository is organized by each model we tested. In each folder, the results from our benchmarks as well as the code that does the fine-tuning and inference can be viewed. These scripts can be used as a starting point for custom experimentation.
Conclusion
In this blog post, we believe that we have demonstrated that:
- Smaller open-access models may be able to compete with commercial offerings out-of-the-box: With the refinement of training and alignment techniques, we believe there is an increasing number of smaller models that show promising results when compared to closed-source commercial LLMs.
- Fine-tuning can substantially enhance performance: Across all models, fine-tuning with techniques like QLoRA appeared to significantly improve performance and domain adaptation. In the case of summarization and classification tasks, we believe all models showed a marked improvement in performance metrics after fine-tuning, indicating that these models can adapt well to specific styles and requirements of a given task.
- Smaller models are viable choices for domain-specific tasks after fine-tuning: For practitioners who seek a balance between performance and resource constraints, our view is that fine-tuning smaller models like Mistral-7B, Zephyr-7B, or Llama-2–7B with specific datasets can yield results that are on par with, and sometimes even surpass, untuned commercial models. This approach also appears to result in faster inference times and reduced computational costs.
Thanks to open source resources, it is easier to experiment with open-source LLMs. Developers can host LLMs on their own, or tweak the behavior of the models using their own data.
When it comes to choosing which model to use for fine-tuning, we suggest trying out smaller models first for simpler natural language processing tasks such as summarization — something a lot of the models already do well out-of-the-box. For more complex tasks or when you have limited training samples, bigger models may be a helpful first step.
Lastly, if you’d like to know more about what we’ve learned about LoRA parameter settings — levers that can be used to potentially get more performance out of the models — we will share more tips in an upcoming blog post.