Experimenting with fine-tuning: Llama 2, Mistral and Zephyr

Georgian
Georgian Impact Blog
8 min readJan 10, 2024

By: Benjamin Ye & Rohit Saha

A cartoonish cover art depicting three llamas engaging in a playful battle in a field.
This image was generated using DALL-E 3.

The world of LLMs evolved quickly in 2023. In July, Meta made big news in the LLM world by releasing its open-access Llama 2 model. Three months later, Mistral AI released its 7B model, claiming to be leaner and smarter than models with greater parameter counts, and purported to beat Llama2–13B on several benchmarks. In November, Hugging Face’s H4 team produced a fine-tuned version of Mistral, Zephyr 7B, with reported MT-Bench result that even surpasses that of the Llama2–70B, which up to that point had been considered by some to be one of the largest and best-performing models to date.

Each of these models improved upon one another with refinements such as tweaking the self-attention mechanism and distilling knowledge from a larger LLM. Moreover, these smaller, open-access models allow researchers and hobbyists alike to run them on their own infrastructure — even locally with a consumer-grade GPU. These models can also be fine-tuned to enable them to perform specialized downstream tasks — with many anecdotal reports of comparable results to closed-source OpenAI models at a fraction of the cost and inference time.

With so many different models, it is hard to choose which one to start our experiment with. To help with the decision, we have done a systematic benchmarking of different models under different tasks. In this blog, we’ll cover:

  • Model highlights: What sets these models apart from another?
  • Out-of-the-box benchmarks: How well do these models do on their own?
  • Fine-tuning benchmarks: How well do these models adapt to downstream tasks?
  • Access our experiments in our GitHub repository: How do I fine-tune with my own data?

Model Highlights

Llama 2

Llama 2, released by Meta, is an update with improved performance to its predecessor, Llama 1. The model uses a standard transformer architecture with these key features:

Model varieties: The base model is offered in a range of parameter sizes — 7B, 13B and 70B — as well as a version fine-tuned and aligned via supervised fine-tuning and reinforcement learning through human feedback (RLHF).

Pre-training corpus: The model is pre-trained on a corpus with 2 trillion tokens, 40% more than Llama 1.

Context length: the model supports a context length of 4,096 tokens, doubling from that of Llama 1.

We have written an overview and evaluation of Llama 2 in our previous blog post. To learn more, you can access it here.

Mistral-7B

Mistral-7B was released by Mistral AI. The model’s key differentiating factor is the use of Sliding Window Attention (SWA), proposed by Beltagy et al. in their Longformer paper. According to the paper, the use of SWA allows the model to be more efficient in both memory and compute. In contrast to the standard Global Attention mechanism, where the number of computational operations increases quadratically in relation to the length of the input, the SWA method has a linear growth in operations as the input length increases. Another benefit of using this attention mechanism is that the model can retain more information from the past, thereby handling longer sequences more effectively.

Source: Jiang et al.

As a result, the model showcased impressive benchmark results in its paper: beating Llama-2–7B in all benchmarks and the larger Llama-2–13B models in the majority of benchmarks.

Source: Jiang et al.

Zephyr-7B

Zephyr-7B was released by Hugging Face’s H4 team. The model is a fine-tuned version of Mistral-7B. The key contribution of the model is the demonstration of the effectiveness of using larger teacher models to generate synthetic datasets upon which the model fine-tunes on.

For its fine-tuning step, Zephyr used Ultrachat dataset generated from ChatGPT and further aligned using GPT-4-generated UltraFeedback via Direct Preference Optimization (DPO).

Source: Tunstall et al.

This combination allowed Zephyr to essentially distill knowledge from the teacher models, indirectly learning human preferences the larger models were trained on via RLHF.

From its paper, Zephyr’s team reported SOTA MT-Bench results in the 7B model space, even beating most of the open-access models that are 10x bigger. The model also achieved comparable results versus closed-source commercial models like GPT-3.5, demonstrating the effectiveness of its fine-tuning technique.

Source: Tunstall et al.

Out-of-the-Box Benchmarks

To analyze how these models compare, we evaluated the performance through summarization tasks. We used the Samsum dataset — the corpus consists of 16,369 pairs of dialogues and their accompanying human-labeled gold-standard summaries. To evaluate model performance, we generated summaries on Samsum’s test set and evaluated results using standard ROUGE metrics. In particular, we used ROUGE-1 and ROUGE-2 which measures unigram and bigram overlaps, respectively, between the generated output and the gold label.

Environment Setup

Instance Type: AWS EC2 g5.2xlarge

GPU: 1x A10G GPU (24GB VRAM)

Model Weight Quantization: 4-bit (NF4)

Calculation Precision: 16-bit (bfloat16)

Summarization Experiment Setup

Dataset: Samsum

Prompt Template:

Summarization Experiment Results (Zero-Shot & Few-Shot)

Source: LLM Finetuning Hub

As we can see from the results, they match our expectations as the newest Zephyr model performed better than both Llama 2 and Mistral in zero-shot generation. However, we did not see as meaningful of a performance uplift from Zephyr with few-shot learning compared to Mistral. Remarkably, all of the 7B open-access models performed in line with the closed-source GPT-3.5-Turbo model offered by OpenAI.

We were surprised to see the performance of the 13B Llama 2 model lagging behind the rest. However, we noted that ROUGE metrics do not account for nuance such as synonyms.

Fine-Tuning Benchmarks

To gauge how well these models adapt to the style of the gold label summaries after fine-tuning, we used the following fine-tuning setup:

Summarization Fine-Tuning Setup

Fine-Tuning Method: QLoRA

Learning Rate: 2e-4

Dropout: 0.1

Rank: 64

Alpha: 16

Epochs: 1

Target Modules: Attention Modules Only

Summarization Fine-Tuning Result

Source: LLM Finetuning Hub

We saw a marked increase in performance after QLoRA fine-tuning, significantly surpassing the summarization performance. Mistral-7B performed better than Llama 2 and Zephyr after fine-tuning. The open-access models continue to hold their own after fine-tuning.

We noted the drastic improvement of ROUGE metrics from Llama2–13B after fine-tuning, affirming our earlier hypothesis of artificially low scores due to non-conformance of the syntactic style and word choice.

To further evaluate the effectiveness of QLoRA fine-tuning, we devised another experiment — this time with classification as the downstream task. We used the newsgroup dataset: 18,828 messages with 20-class labels separated into a train and test set. We conducted an ablation study to see how quickly each model learns the classification task. We fed the model progressively more training examples and evaluated how well the model performed using accuracy metrics at each sample fraction.

Classification Fine-Tuning Setup

Fine-Tuning Method: QLoRA

Learning Rate: 2e-4

Dropout: 0.1

Rank: 8

Alpha: 16

Epochs: 5

Target Modules: Attention Modules Only

Classification Experiment Setup

Dataset: newsgroup

Prompt Template:

Classification Fine-Tuning Results

Source: LLM Finetuning Hub

We believe our experiment shows that Llama-2–13B is the most sample-efficient model among models we tested; it was able to adapt quicker than the smaller 7B models. Among 7B models, Llama-2–7B seems to adapt very quickly when there’s a low amount (2.5%-5.0%) of training examples. At higher sample fractions, Mistral and Zephyr quickly catch up, performing similarly to Llama2–7B. At 100%, all the 7B models appeared to perform similarly, and unsurprisingly, the 13B version of Llama 2 appeared to perform better than models with lower parameter counts.

We noted the rather inconsistent performance of GPT-3.5-Turbo across different sample fractions. However, we are unable to draw conclusions as we don’t have any visibility into OpenAI’s internal training method.

Access our experiments in our GitHub repository

All of our experiments and scripts to fine-tune these LLMs are available on our LLM-Tuning-Hub. The repository is organized by each model we tested. In each folder, the results from our benchmarks as well as the code that does the fine-tuning and inference can be viewed. These scripts can be used as a starting point for custom experimentation.

Conclusion

In this blog post, we believe that we have demonstrated that:

  1. Smaller open-access models may be able to compete with commercial offerings out-of-the-box: With the refinement of training and alignment techniques, we believe there is an increasing number of smaller models that show promising results when compared to closed-source commercial LLMs.
  2. Fine-tuning can substantially enhance performance: Across all models, fine-tuning with techniques like QLoRA appeared to significantly improve performance and domain adaptation. In the case of summarization and classification tasks, we believe all models showed a marked improvement in performance metrics after fine-tuning, indicating that these models can adapt well to specific styles and requirements of a given task.
  3. Smaller models are viable choices for domain-specific tasks after fine-tuning: For practitioners who seek a balance between performance and resource constraints, our view is that fine-tuning smaller models like Mistral-7B, Zephyr-7B, or Llama-2–7B with specific datasets can yield results that are on par with, and sometimes even surpass, untuned commercial models. This approach also appears to result in faster inference times and reduced computational costs.

Thanks to open source resources, it is easier to experiment with open-source LLMs. Developers can host LLMs on their own, or tweak the behavior of the models using their own data.

When it comes to choosing which model to use for fine-tuning, we suggest trying out smaller models first for simpler natural language processing tasks such as summarization — something a lot of the models already do well out-of-the-box. For more complex tasks or when you have limited training samples, bigger models may be a helpful first step.

Lastly, if you’d like to know more about what we’ve learned about LoRA parameter settings — levers that can be used to potentially get more performance out of the models — we will share more tips in an upcoming blog post.

--

--

Georgian
Georgian Impact Blog

Investors in high-growth business software companies across North America. Applied artificial intelligence, security and privacy, and conversational AI.