Understanding LLaMA-2 Architecture & its Ginormous Impact on GenAI

A ‘human’ summary of the 77-page LLaMA-2 paper by Meta and what you need to know to fine-tune it on your dataset.

Kunal Sawarkar
Towards Generative AI

--

The greatest thing since the sliced bread dropped last week in the form of Llama-2. Meta released it with an open license for both research & commercial purposes. A closer look at the license terms shows that it is not exactly “open source” but more like open innovation. It’s heartening to see Meta (once hated for its AI practices) now being the biggest contributor to Open Innovation in AI as compared to the company whose name has “Open” in it; but its AI models are actually closed.

LLAMA2 Architecture (Credit-Meta)

Why it is the biggest leap forward in AI since the legendary AlexNet paper on Image recognition or “Attention is All you Need” paper on Transformer?

  • The LLaMA-2 paper describes the architecture in good detail to help data scientists recreate & fine-tune the models. (unlike OpenAI papers where you have to deduce it indirectly).
  • It’s trained on 2 Trillion tokens, beats all open source benchmarks by a huge margin, and is comparable to GPT3.5 in terms of performance on human evaluation
  • The biggest novelty is the improvement over OpenAI architecture, on safety vs Helpfulness model with models performance not degrading as it becomes safer. Provides copious details on alignment to human evaluation; which is the most expensive part of the LLM pipeline. It is a ginormous step forward in making LLMs safer for enterprise adoption.
  • New advancements in LLM like Grouper query attention, Ghost Attention, In-Context Temperature re-scaling and Temporal Perception.
  • It’s available on HuggingFace, WatsonX, and Azure, easing the cost of adoption. Now, you can even fine-tune a 70B LLM on a single GPU (unthinkable just 6 months ago).

Here is a detailed paper review on LLaMA-2’s 77-page paper, describing how the model is trained, fine-tuned, and refined using RLHF with results comparing it to open source models.

Which Models are released?

Meta is releasing LLaMA-2 with7B, 13B, and 70B parameters. It is also releasing the instruction-tuned version of the same as LLaMA-Chat in same 3 varieties.

The key difference between the previous Llama-1 models is license terms, the size of the pretraining corpus increased by 40%, doubled the context length of the model to 4K, and adopted grouped-query attention for its 70B variant. The most impactful part I felt was new approach to safety with two rewards models for Safety and Helpfulness which outperforms most other models on human evaluation benchmarks as seen below.

The instruction-tuned version of Lama-2 Chat is clearly better than ChatGPT on above benchmarks and other open-source models by a huge margin of about 60–75%. Hence it’s a big deal to open innovation.

PreTraining Details

It is trained on 2 trillion tokens of data. The tokenizer uses bytepair encoding (BPE) algorithm. It uses the standard transformer architecture, applies pre-normalization using RMSNorm, uses the SwiGLU activation function, and rotary positional embedding. The key difference includes increased context length.
Hyperparameters- AdamW optimizer, uses a cosine learning rate schedule, with a warmup of 2000 steps, and decays the final learning rate down to 10% of the peak learning rate. It uses a weight decay of 0.1 and gradient clipping. It performed well on various tasks such as coding, Q&A in context, commonsense reasoning & knowledge benchmarks. Details below.

Fine-Tuning

The approach to fine-tuning is shown in above architecture diagram with Supervised Fine-Tuning(SFT) and Reinforcement Learning with Human Feedback (RLHF) part.

SFT (Supervised Fine-Tuning) Details

Meta uses a novel approach here by segmenting on the lines of helpfulness and safety prompts as part of the set.

To initiate the process, they began the SFT stage by utilizing publicly available instruction tuning data (Chung et al., 2022) and meticulously annotating approximately 27,540 instances, with a strong emphasis on data quality. For the supervised fine-tuning phase, they employed a cosine learning rate schedule with an initial learning rate of 2*10–5, a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. These hyperparameters were fine-tuned over the course of 2 epochs. The training objective followed an auto-regressive approach, wherein the loss on tokens from the user prompt was zeroed out, and back-propagation was only performed on answer tokens

RLHF Human Data Collection

Meta instructed the annotators to follow a specific process: first, they had to create a prompt, and then they were presented with two model-generated responses, which they had to evaluate based on given criteria. To ensure enhanced diversity, the two responses to each prompt were sampled from two distinct model variants, using different temperature hyper-parameters. As depicted earlier, the gathered data was categorized along the dimensions of safety and helpfulness. This collected data served as the basis for the Reward Model.

Reward Model

The reward model is designed to take both a model-generated response and its corresponding prompt (including previous context) as inputs and then produce a scalar score, indicating the quality of the generated output, such as its helpfulness and safety. The most significant breakthrough introduced by LLAMA2 is overcoming the commonly observed tradeoff between safety and helpfulness, achieving superior performance on both criteria.

To accomplish this, Meta trained two distinct reward models: one optimized for helpfulness, referred to as the Helpfulness RM, and another for safety, referred to as the Safety RM. The model architecture and hyper-parameters remain the same as those of the pre-trained language models, except for the classification head for next-token prediction, which is replaced with a regression head for generating the scalar reward.

For training the reward model, human preference data was structured into a binary ranking label format, with responses categorized as chosen and rejected. It was ensured that the chosen response always received a higher score than its counterpart. Meta conducted extensive data mixing, combining the Helpfulness data with other open-source datasets, ultimately achieving a composition of 90–10%. The reward model was trained for one epoch, employing a learning rate of 1*10x-5.

Overall rewards model outperforms all benchmarks including GPT4 and does not saturate in their own domain datasets.

RLHF IFT (Iterative Fine-Tuning)

Meta created various versions of RLHF from V1 to V5 using IFT with two algorithms.
Proximal Policy Optimization (PPO) — Same as the OpenAI approach, which uses the reward model as an estimate for the true reward function (human preference) and the pre-trained language model as the policy to optimize.
Rejection Sampling fine-tuning. Sample K outputs from the model and select the best candidate with a reward, and use the selected outputs for a gradient update. The highest reward score is considered the new gold standard and then fine-tune our model on the new set of ranked samples, reinforcing the reward.

I personally found this Rejection Sampling approach quite intuitive and easier to interpret for learning. This is performed on 70B model and all smaller models are distilled from it. The end result is the gap between median vs max keeps growing; showing net gain.

Two distinct models, namely the Safety reward model (R_s) and the helpfulness reward model (R_h), were trained. To ensure safety, Meta identified prompts in the dataset that could potentially elicit unsafe responses and prioritized the scores generated by the safety model. A threshold of 0.15 was chosen to filter out unsafe responses, resulting in a precision of 0.89 and a recall of 0.55, as evaluated on the Meta Safety test set.

For the training process, the AdamW optimizer with a weight decay of 0.1 was employed, and gradient clipping of 1.0 was applied. A constant learning rate of 10*-6 was used during training. Each Proximal Policy Optimization (PPO) iteration utilized a batch size of 512, a PPO clip threshold of 0.2, and a mini-batch size of 64. Additionally, one gradient step was taken per mini-batch.

Ghost Attention (GAtt)

The loss of context in multi-turn conversations has been recognized as a known issue. To address this, Meta implemented a GAtt (GHost Attention) method by artificially concatenating the instruction to all user messages within the conversation. Subsequently, they sampled from this augmented dataset using the latest RLHF (Reinforcement Learning with Human Feedback) model. As a result, they obtained a context-rich dialogue and a corresponding sample, which they used for fine-tuning the model, akin to the process of Rejection Sampling. The overall outcome showcased improved attention compared to the existing model. However, it’s important to note that this approach was evaluated on 70B models exclusively.

Model Results

Meta asked three annotators to judge the quality of the answers based on a 7-point Likert scale (the higher the better) and calculated IRR (inter-rater reliability) to ensure consistency in quality.

LLAMA-2 Chat the outperform open-source models by a significant
margin(60–75%) on both single-turn and multi-turn prompts and comparable to ChatGPT.

Safety Model

Meta cleaned the data for bias like people as men or sexual orientation norms and balanced the training dataset as well removed toxicity. It used benchmarks such as TruthfulQA for fact-fullness, ToxicGen for hateful content and BOLD for social bias.

They followed a process similar to general fine-tuning, which encompassed three main steps:

  1. Supervised Safety Fine-Tuning: Initially, they initiated the process by collecting adversarial prompts and safe demonstrations, which were then incorporated into the general supervised fine-tuning procedure. This step ensured that the model adhered to their safety guidelines even before RLHF (Reinforcement Learning with Human Feedback) and laid the groundwork for obtaining high-quality human preference data annotations.
  2. Safety RLHF: After the supervised safety fine-tuning, they proceeded to integrate safety into the general RLHF pipeline. This involved training a safety-specific reward model and acquiring more challenging adversarial prompts for the fine-tuning process using rejection sampling-style fine-tuning and Proximal Policy Optimization (PPO) optimization.
  3. Safety Context Distillation: Finally, they refined the RLHF pipeline by employing context distillation for safety. This step entailed generating safer model responses by adding a safety preprompt, such as “You are a safe and responsible assistant,” to the prompts. The model was then fine-tuned on the safer responses without the preprompt, essentially distilling the safety preprompt (context) into the model. A targeted approach was used, allowing the safety reward model to decide whether to use context distillation for each sample.

Safety is of course a hard problem to solve as not all edge cases can be envisioned. The most interesting result is its impact on Helpfulness alignment.

The mean helpfulness score remains constant which is a great leap forward in making LLMs safer. Like any ML model, there is a risk of false negative or False Refusal if we make model too safer. Meta measured it as well and found to be only 0.05%.

Meta also performed Red Teaming with teams of over 350 people, including domain experts in cybersecurity, election fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing.

Overall the model performs really well on the safety benchmarks compared to all other LLMs.

New Findings

Meta also shared some interesting learning from their research for AI community.

  1. Supervised Data may no longer be the gold standard. The model’s performance is capped by the writing abilities of the most skilled annotator. A Feedback is perhaps more apt.
  2. In-Context Temperature Rescaling- Where diversity of responses get reduced. But it does not happen uniformly and happen more to creative prompts than factual ones.

The rising temperature, the model learns to consistently provide the same response to factual prompts.

3. Temporal Perception- Perhaps the most fascinating aspect is ability of LLM to understand the temporal nature of questions. This was the longest running problem in language models.

The observation suggests that LLMs have internalized the concept of time to a greater extent than previously assumed, despite their training being solely based on next-token prediction and data that is randomly shuffled
without regard to their chronological context.

Why LLAMA-2 is Seismic Shift?

  • The approach and scale it takes beats everything that came in open innovation so far like FALCON or VACUNA.
  • Meta also takes a dedicated approach to human evaluation comparable to OpenAI. To compare the models, it collected a diverse set of over 4000 single and multi-turn prompts spanning the following categories: factual questions, writing and content creation, language assistance, recommendations, and dialogue.
  • Tension between Safety and Helpfulness in Reward Modeling is finally tackled head-on with data, training and a dedicated model and to much delight solved to an acceptable degree of adoption. They also used a scaling approach to safety which I found, will have a much larger impact on the future direction of research.
  • The appendix of the paper details enough technical details on evaluation, fine-tuning to take advance the state of LLMs. This is true in the spirit of “Open Innovation

Meta Vs OpenAI

Meta and OpenAI took two different paths. Initially, OpenAI was enthusiastic about ethics and had ambitious ideas to change the world. However, they eventually became overly self-confident and closed the door to open innovation. They shifted towards a closed approach, which led to criticism and disapproval from many due to their rigid AI style. On the other hand, Meta started with a closed approach, and it was widely disliked for its restrictive AI methods. Surprisingly, Meta’s strategy has had a significant impact on the field of AI, especially with its contributions to the development of PyTorch.

While a week is a long time in GenAI, I believe this paper will continue to be considered most influential in the advancement of GenAI, years from now.

Citation

Link to original paper- https://arxiv.org/abs/2307.09288

All images credit to Meta AI.

Follow Towards Generative AI for more content related to latest in AI.

Subscribe to the 3 min newsletter to learn about 3 most impactful things in Generative AI every week.

--

--

Kunal Sawarkar
Towards Generative AI

Distinguished Engg- Gen AI & Chief Data Scientist@IBM. Angel Investor. Author #RockClimbing #Harvard. “We are all just stories in the end, just make a good one"