An Overview of Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2 open-source models were released by Meta. Along with other information a technical paper discussing various model training details was also released. Here is a brief overview of details shared in the paper.
Introduction
Meta has released pretrained and finetuned Large Language Models Llama-2 and Llama-2. An accompanied paper describes pretraining of Llama 2–7B, Llama 2–13B, and Llama 2–70B models and fine-tuning of Llama 2–Chat of all three variants. It also provides information on pretraining corpus, pretraining process, fine tuning of these models as well as safety training to improve their safety. Llama 2 is an updated version of Llama 1 model that is trained on 40% larger pretraining corpus and Llama 2-Chat is fine tuned for dialog use cases.
Pretraining Preparation
Pretraining data
Large language models need a large corpus of text for their training. Training corpus was sourced primarily from internet and due care was taken to remove sites containing highly personalized information. Additionally, no data from Meta resources was used. The training corpus has two trillion tokens. This is 40% more than the size of Llama 1 pretraining corpus.
Neural Network Parameters
Parameters define the size of neural network. A bigger neural network is preferred for better results, but computing requirement also increases as the size of neural network grows.
Tokenization
Each word in the training corpus was converted into a token. Bytepair encoding algorithm was used for tokenization of pretraining corpus for Llama 2 models.
Pretraining Process
All models were pretrained by using most of pretraining settings and underlying architecture from Llama 1. Context window size was doubled to 4096 tokens from 2048 tokens in Llama 1, context window size determines how much text can be inputted at a time. A large context window is preferred as this allows for a large text to be summarized or a longer conversation on a finetuned model for chatting application. A large context window impacts memory cost associated with KV cache in larger models. Grouped-Query Attention improves inference scalability and is used for 34B and 70B model variants.
Computing requirements
Pretraining any large language model is a compute intensive task. Meta’s Research Super Cluster (RSC) as well as internal production clusters having NVIDIA A100s were used. Computation requirement was almost liner i.e. model with 7B parameters took 184,320 GPU hours and model with 70B parameters took 1,729,320 GPU hours. All 4 variants took 3311616 GPU hours and total CO2 equivalent emission was 539 metric tons.
Pretrained evaluation
Pretrained models were evaluated on Code, Common Reasoning, Reading Comprehension, World Knowledge, and Math along with popular benchmark MMLU, BBH, and AGI Eval. Llama 2 outperforms Llama 1 and open-sourced models MPT and Falcon on all these benchmarks.
Performance of Llama 2–70B was compared with close-sourced models GPT-3.5, GPT 4, PaLM and PaLM-2-L. Llama 2–70B was found close to GPT-3.5 in MMLU (5-shot) and GSM8K (8-shot) and was on par or better than PaLM (540B) but there was a large gap between PaLM-2-L and GPT-4.
Fine Tuning
Llama 2-Chat is fine-tuned version of pretrained model. Fine tuning starts with Supervised Fine Tuning and later RLHF is used for fine tuning completion.
Supervised Fine Tuning
In this process, a set of prompts and its answers are used to fine tune the model. Publicly available instruction tuning data was used to bootstrap the fine tuning. Third-party SFT data was found lacking in diversity and quality. High quality SFT data improved the quality and with tens of thousands of SFT annotations superior results were achieved. In summary 27,540 annotations were used at this stage. Further a set of 180 examples were selected and compared with human annotation and model responses and responses from SFT were competitive with human annotators.
Other details
All prompts and responses were concatenated and separated by a special token. Only answer tokens were back propagated and models were fine tuned for 2 epochs.
Reinforcement Learning by Human Feedback
RLHF is applied on a fine-tuned model to further align model with human preferences and instructions. A reward model is developed by providing binary responses of a prompt.
Reward Model
Human preference data collection for reward modeling
A human is presented two different responses of a prompt and is asked to pick one response over other. These two different responses are generated from 2 different model variants and varying the temperature hypermeter. A human annotator also adds a label to the selected response. These labels are significantly better, better, slightly better, negligibly better/unsure. Responses are also evaluated for their helpfulness and safety. Human preference data was collected on weekly basis for model finetuning.
Reward modeling
A reward model takes prompts and model responses as an input parameter and evaluates it for helpfulness and safety and output is a scalar score to indicate quality of the model generation.
Two reward models, one optimized for helpfulness and another one optimized for safety, were trained. It was done to avoid tradeoff that may arise if a single model was trained for both. A higher score was assigned to chosen response and a margin component was added to incorporate helpfulness and safety score.
Reward models were bootstrapped with open-source dataset. Subsequently, reward model optimized for helpfulness had Meta Helpfulness dataset, Meta Safety dataset and open-source dataset. Reward model optimized for Safety has Meta Safety dataset, Anthropic Harmless dataset, Meta Helpfulness dataset and open-source dataset.
RLHF Fine Tuning
Two algorithms Proximal Policy Optimization and Rejection Sampling fine tuning were used for iterative fine tuning. In Proximal Policy Optimization, one response was generated and evaluated, in Rejection Sampling, k responses were generated, and each response was scored. Rejection sampling was done on Llama 2-Chat 70B model and smaller models were fine-tuned with the rejection sample data.
In a multiturn dialog, an instruction was given in the beginning such as “Act as:” and that should have been part of each turn. During RLHF training, model was forgetting this instruction. To overcome this, Ghost Attention Method (GAtt Method) was used, and such instructions were respected throughout the dialog.
Model evaluation
RLHF Training results were evaluated using model evaluation as well human evaluation. Human evaluation is a preferred choice but not scalable. Instead, reward model was evaluated for its robustness using a set of prompts selected for helpfulness and safety, 3 annotators evaluated quality of responses on 7 points Likert scale.
For Human evaluation, Llama 2-chat models were compared with open-sourced as well as close sourced model. Llama 2-chat models did well against all open-sourced models. For example, Llama 2-chat 7B model performed better than open source MPT-7B-chat. Llama 2-chat 70B model was found competitive with ChatGpt (gpt-3.5-turbo-0301) model.
Safety
Safety is an important aspect of a large language model. It is important that LLM generates responses that are truthful, non-toxic, and free of any bias. Understanding properties of training corpus is important as well as training the model for safe and helpful responses.
Data Toxicity and language in pretraining corpus:
A random sample of 10% English language corpus was tested for toxicity using a HateBERT classifier fine-tuned on the ToxiGen dataset. The corpus was not filtered for toxic content as it would have resulted in removing any demography accidentally. Not filtering of toxic content was also opted to enable fine tuning the model on safety use cases such as hate speech classifier.
Pre-trained models were also benchmarked for truthfulness, toxicity, and bias by using TruthfulQA, ToxiGen, and BOLD benchmarks.
English language makes the most of pretraining corpus (89.70%) so best to be used in English language use cases.
Safety Fine-Tuning
In supervised fine tuning, adverse prompts with safe responses were used to train pre-trained models.
Safety RLHF Training
A safety-specific reward model was developed. Annotators wrote prompts that could elicit unsafe behavior, and then compared multiple model responses to the prompts, selecting the response that was safest according to a set of guidelines. After gathering only a few thousand supervised demonstrations, training was switched entirely to RLHF.
Context Distillation
Context Distillation is another technique that make LLMs safer. Here a pre-prompt is added such as “You are a safe and responsible assistant”. Context distillation helps a lot in generating safe responses, however this can also degrade quality of responses by emphasizing on generic concerns. Safety reward model was used to decide if context distillation was to be used for a prompt or not. Context distilled outputs were used only if reward model gave better score, then the original answer.
Red Teaming
Red teaming included various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cybersecurity, election fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing.
Conclusion
Llama 2 has become a popular open-source large language model. The technical paper is available here and it provides many technical details and can be referred as needed. Meta continues to work on Llama 2 model and recently released Code Llama 2 having 7, 13, and 34 billion fine-tuned for coding tasks.