Llama 2 — new best open-source LLM is here and it’s better than ChatGPT

13 min readAug 3, 2023

Meta recently released a new open-source LLM called LLama2 that achieves state-of-the-art results. Moreover, they also released Chat LLama 2 — a model that is similar to ChatGPT but can achieve even better results and be self-hosted.

In this article, we will discuss the most important differences between the previous version, look into the results, and explain how the model works.

We will also show how you can use the model for just 9$ a month.

😒 TLDR; and why should I care?

The new version of Llama, called Llama 2, includes a set of language learning models. These models vary in size, from smaller ones with 7 billion parameters to larger ones with 70 billion parameters. These new models are better than the old ones because they’ve been trained with more data, can understand longer pieces of text, and can process information faster, especially the largest model.

The most exciting part of Llama 2 is the models that have been specially trained for conversations. These models, called Llama 2-Chat, have been trained using feedback from humans. They perform really well in tests of helpfulness and safety, doing better than most other models and as well as ChatGPT, another language model. You can find more details in the research paper here.

Llama2 Chat is an open-source model allowing you complete control they offer over the system prompt in chat applications. This control is crucial for defining the behaviour of your chat assistant and even infusing it with a unique personality — a feature that is not accessible in models that are served behind APIs.

Moreover, as the model is open-source it can be self-hosted. OpenAI collects a lot of user information, so you should be careful what data you are sending them. In some cases, you might even be in violation of GDPR and other privacy laws while using ChatGPT. Self-hosting the model allows you to have complete control over your data.

· 😒 TLDR; and why should I care?
· 🦙LLama 2
∘ What's new?
∘ What’s not new?
· 📊Performance
∘ Best open-source model
∘ Not quite the best model overall
∘ What do the results tell us?
∘ Other benchmarks in the test set
· 🦙Llama-2 Chat
· 🕵️A deeper dive : how Llama-2 Chat was trained?
∘ Pre-training
∘ Supervised fine-tuning
∘ Reinforcement Learning with Human Feedback (RLHF)
∘ System Message for Multi-Turn Consistency
· 🦺 Safety
· 👨‍💻How can you use the model?

🦙LLama 2

What's new?

Larger training dataset: the new model uses a dataset that is 40% larger than the previous version, containing 2 trillion tokens
Bigger model size: 7B, 13B, and 70B parameters
Bigger context length: the new model uses a 2x larger context length of 4k tokens, which means it can take longer prompts and generate more extended responses.
Chat version: Meta also introduced a Chat model that is similar to ChatGPT and achieves even better results while being open-source.

What’s not new?

The model has a similar architecture and is trained in a similar fashion to the previous version and ChatGPT model with the usage of casual language modelling and Reinforcement Learning with Human Feedback (RLHF) for the chat model. The training process with its differences is explained in the next parts of the article.

📊Performance

Best open-source model

As we can see LLama 2 is currently the best open-source model.

Not quite the best model overall

LLama 2 currently falls short of GPT-4 and PaLM-2-L in all of the above-mentioned categories. Those models are however closed source and can only be accessed by API where LLama 2 weights can be easily downloaded from the Meta website

What do the results tell us?

LLama2 is the best open-source model so if privacy and diversity is your concern it should be your first choice. The model is much better than the previous iteration on BBH, AGIEval, and MMLU datasets. Those benchmarks involve zero-shot and few-shot settings scenarios so you should consider the Llama2 if you looking for a model to solve custom tasks.

When compared to ChatGTP, LLama2 is worse at every task, however, the differences are not large, except for GSM8K and HumanEval benchmarks. The first dataset is connected to math problems while the second one is related to programming tasks, so you might be careful when trying to use LLama2 for code generation or solving equations.

Other benchmarks in the test set

Code

HumanEval -l,000 crowd-sourced Python programming problems
MBPP -164 original programming problems, assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions.

Commonsense Reasoning

PIQA — 3k physical commonsense knowledge questions focusing on everyday situations with a preference for atypical solutions
WinoGrande — 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset
ARC — 3k multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9.
CommonsenceQA — 12k questions with 5 choices each. The dataset was generated by Amazon Mechanical Turk workers.

World Knowledge

NaturalQuestion — s8k questions from Google paired with relevant Wikipedia articles,
TriviaQA -95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents.

Reading Comprehension

SQuAD — 88k reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text.
QuAC -14K crowdsourced Question Answering dialogues with 98K question-answer pairs in total. Data instances consist of an interactive dialogue between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.

MATH

GSM8K — 8.5K high-quality linguistically diverse grade school math word problems created by human problem writers
MATH — 12,5k challenging competition mathematics problems

Popular Aggregated Benchmarks

MMLU — is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more.
BBH — diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models.
AGIEval — benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams

🦙Llama-2 Chat

Meta also introduced Llama-2 Chat which is similar to ChatGTP.

Results show that the model is able to achieve better results than other assistant models such as ChatGPT, PaLM, Falcon, and Vicuna.

Let’s explain how the models work.

🕵️A deeper dive : how Llama-2 Chat was trained?

Pre-training

In simple terms, pre-training is like teaching a model the basics of the language by exposing it to a vast amount of text data. The model learns to predict the next word in a sentence, given all the previous words. This process helps the model to understand the structure of the language and recognize patterns in the text.

The model was pre-trained on a large corpus of data, around 2 trillion tokens, from publicly available sources. This data does not include any user data from Meta. The data used for pre-training has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.

The pre-training process involved a significant amount of computational resources. A cumulative 3.3 million GPU hours of computation was performed on the hardware of type A100–80GB. The carbon emissions resulting from the pre-training of Llama 2 models were estimated and offset by Meta’s sustainability program.

The result of the pre-training process is the Llama 2 model is later fine-tuned to follow instructions and become Lama2 Chat.

Supervised fine-tuning

With the introduction of ChatGPT, OpenAI presents a few major flaws in language models. Those models, trained to predict the next word in a sequence, are not good at following instructions. And why should they be? Language models are trained to predict the most probable sequence, not the most helpful one. This can lead to the models resenting unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions.

In order to obtain a model that follows the instructions, the model has to be fine-tuned. The first step in the fine-tuning process was supervised fine-tuning where the model was presented with pairs of prompts and desired responses created by humans.

The supervised fine-tuning process is akin to a student learning from a teacher. The model, like the student, is given a problem (the prompt) and the correct solution (the desired response). By learning from these pairs, the model begins to understand not just the structure of language, but also the context and intent behind different prompts. This process helps the model to generate responses that are more in line with what a human would expect or desire.

As a starting point, the authors used a dataset from the Flan-PaLM model that contains 1.8K diverse tasks. However, they emphasized the importance of not just the quantity but the quality of the data used for fine-tuning. They found that using fewer but higher-quality examples from their own vendor-based annotation efforts led to improved results. This suggests that careful data curation and quality control are crucial for the success of the fine-tuning process.

Reinforcement Learning with Human Feedback (RLHF)

Supervised fine-tuning is however not efficient because it requires a lot of manually annotated data. That’s why next, the authors used a similar approach to the one used in ChatGPT called Reinforcement Learning with Human Feedback (RLHF)

Reinforcement Learning with Human Feedback (RLHF) is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. Here’s how it works:

Collecting Data: The first step involves collecting data that represents empirically how the model should behave. This data is used to guide the model’s learning process. Meta has collected a large dataset of over 1 million binary comparisons based on humans applying their specified guidelines.

LLama 2 Chat reward modeling dataset compared to other datasets

Reward Modeling: The role of the reward model is to learn what the desired response for humans is. The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, the model can be optimized during RLHF for better human preference alignment and improved helpfulness and safety.

Separate Reward Models for Helpfulness and Safety: A notable difference between LLama Chat and ChatGPT is the use of 2 different reward objectives. The authors found that helpfulness and safety sometimes trade-offs, which can make it challenging for a single reward model to perform well on both. To address this, they trained two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM). This might be a controversial approach as one can argue that the need for safety limits the cognitive capabilities of the model. As this parameter is “build-in” into the model it can’t be removed and turned off.

Fine-tuning the model.

Next, the model is fine-tuned (in a similar fashion to supervised fine-tuning) to generate responses desired by the reward model. Once we have trained the reward model we can use an essentially unlimited amount of unlabeled data. For each prompt, the model is asked to generate several responses. The scores generated by the reward model act as probabilities — the model thought that the completions that have higher scores should appear more often. This process doesn't require any completions generated by humans so it scales nicely.

Iterative Model Updates: The authors hypothesize that iterative model updates may help to prevent divergence in the model’s performance. As a last verification step to ensure no regression between the new model and the previous one, they use both for sampling during the next annotation iteration. This enables a model comparison “for free” on new prompts and can help to increase diversity when sampling.

Progression of Models: The authors reported the progress of their different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by their in-house Safety and Helpfulness reward models. They found that they outperformed ChatGPT on both axes after RLHF-V3.

System Message for Multi-Turn Consistency

A notable difference between the ChatGPT model is also the introduction of Ghost Attention. It is a method used to improve the consistency of AI responses over multiple turns of dialogue. It’s particularly useful when there are instructions that should apply throughout the entire conversation. For example, if the AI is instructed to respond succinctly or to “act as” a certain public figure, it should respect this instruction in all subsequent responses. However, initial models often forgot these instructions after a few turns of dialogue.

Here’s how the GAtt method works in simple terms:

Data Setup: Assume we have a multi-turn dialogue dataset between two participants (e.g., a user and an assistant), with a list of messages [u1, a1, …, un, an], where un and an correspond to the user and assistant messages for turn n, respectively. Meta also defines an instruction, inst, that should be respected throughout the dialogue. For example, inst could be “act as.”

Instruction Concatenation: This instruction is synthetically concatenated to all the user messages of the conversation.

Sampling and Fine-Tuning: We can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling.

Loss Adjustment: To avoid a mismatch at training time between the system message (i.e., all the intermediate assistant messages that come before the last turn) and our sample, the authors set the loss to 0 for all the tokens from the previous turns, including assistant messages.

Training Instructions: For the training instructions, a few synthetic constraints are created to sample from, such as Hobbies (“You enjoy e.g. Tennis”), Language (“Speak in e.g. French”), or Public Figure (“Act as e.g. Napoleon”). To make the instructions more complex and diverse, the final instruction is constructed by randomly combining the above constraints.

In essence, the GAtt method helps the model maintain focus on the initial instruction throughout the conversation, leading to more consistent and contextually appropriate responses.

🦺 Safety

The authors have taken extensive measures to ensure the safety of the AI model. They have developed separate reward models for safety and helpfulness, used diverse and balanced data for training, and rigorously evaluated the model’s performance. However, they also acknowledge that safety is an ongoing concern that requires continuous monitoring and adjustment. Here are highlights of the model's safety evaluation:

Safety and Helpfulness Reward Models: The authors developed two separate reward models, one for safety and one for helpfulness. The safety reward model is trained to recognize and reward safe responses, while the helpfulness reward model rewards responses that are helpful to the user. These models are trained on data collected from human annotators, who rate different model-generated responses.

Data Mixing: The authors experimented with different combinations of data for training the reward models. They found that mixing safety and helpfulness data in a certain proportion resulted in the best performance. This highlights the importance of using diverse and balanced data for training AI models.

Training Details: The authors trained the reward models for one epoch over the training data, using a specific learning rate and batch size. They found that training for longer can lead to overfitting, which is when a model performs well on the training data but poorly on new, unseen data.

Reward Model Results: The authors evaluated the performance of the reward models on a diverse set of human preference benchmarks. They found that their reward models outperformed other models, including GPT-4. This suggests that their approach to safety and helpfulness is effective.

False Refusal Rate: The authors also examined the false refusal rate, which is when the model incorrectly refuses to generate a response. They found that the false refusal rate increases with the percentage of safety data, indicating a trade-off between safety and response generation.

Safety Evaluation Prompts: The authors provided examples of prompts used for safety evaluation. These prompts cover various categories, including illicit & criminal activities, hateful & harmful activities, and unqualified advice. The model’s responses to these prompts are evaluated to ensure they are safe and appropriate.

👨‍💻How can you use the model?

LLama2 and LLama are open-source models, meaning (almost) anyone can download them and use them for free. All you need to do is download the wights from the Meta site. There are however several challenges as it requires substantial computing power with 36–38GB VRAM.

If you don’t have that much GPU power there is an alternative approach. You can use the quantized model version to run it on the CPU.

If you want to play with the model and don’t care for the self-hosted part there is also an alternative approach: you can subscribe to a hugging face pro account and use the model for just 9$ a month.

Once you have an HF Pro subscription you can use their API. All you need is to provide your token and input to the model and send it to the right endpoint.

Don’t forget to leave your thoughts

If you liked the article don’t forget to give it a 👏 If you have any thoughts about Llama2 or want to share your perspective on open-source LLMs, leave a comment!

About the author

Hello, my name is Aleksander Obuchowski, I am passionate about Natural Language Processing and AI in Medicine. Follow me on LinkedIn if you like my stories.