Llama 2 : Explained in simple step by step process 🤟

Pratik
8 min readOct 13, 2023

--

In the rapidly evolving landscape of artificial intelligence (AI), the relentless pursuit of mimicking human cognitive abilities has fueled groundbreaking advancements. From the inception of simple rule-based systems to the modern era of complex neural networks and deep learning, AI has transformed from an abstract concept into a pervasive and influential force across various domains.

Llama 2 operates by leveraging a vast dataset of 2 trillion "tokens" drawn from publicly accessible sources, including Common Crawl, Wikipedia, and public domain books from Project Gutenberg. Each token represents a word or semantic fragment that enables the model to comprehend text and predict subsequent content plausibly. This enables Llama 2 to discern relationships between concepts, like understanding that "Apple" and "iPhone" are closely related but distinct from "apple," "banana," and "fruit."

To ensure responsible usage, the developers employed various training strategies, such as reinforcement learning with human feedback (RLHF), to refine the model for safety and appropriateness. Human testers ranked different responses to guide the model toward generating more suitable outputs. Chat versions were further fine-tuned with specific data to enhance their ability to engage in natural dialogues.

These models serve as a foundation for customization. Organizations can train Llama 2 with their particular brand style or voice to generate article summaries or improve customer support responses by providing it with relevant information like FAQs and chat logs.

How does llama 2 work?? 🤔

Llama 2's functioning involves training its neural network with an extensive dataset of 2 trillion "tokens" sourced from publicly available materials such as Common Crawl, Wikipedia, and public domain books from Project Gutenberg. Each token, representing a word or semantic fragment, enables the model to comprehend text, anticipate subsequent content, and establish connections between related concepts like "Apple" and "iPhone," distinguishing them from "apple," "banana," and "fruit."

Recognizing the potential pitfalls of training an AI model on the open internet, the developers employed additional techniques, including reinforcement learning with human feedback (RLHF), to refine the model's capacity for producing safe and constructive responses. Human testers played a role in ranking different AI model responses to guide it toward generating more appropriate outputs. Moreover, the chat versions underwent fine-tuning using specific data to enhance their ability to engage in natural conversations.

Llama-2 Chat’s instruction-tuned version clearly surpasses ChatGPT and other open-source models by a significant margin, ranging from 60% to 75%. This is a major development in the realm of open innovation.

Pre-Training:

The model is trained on a vast dataset of 2 trillion tokens, using a bytepair encoding (BPE) algorithm for tokenization. It employs the standard transformer architecture, pre-normalization with RMSNorm, the SwiGLU activation function, and rotary positional embedding. Notably, it offers an increased context length.In terms of hyperparameters, the model uses the AdamW optimizer, incorporates a cosine learning rate schedule with a warm-up period of 2000 steps, and decays the final learning rate to 10% of the peak learning rate. It applies a weight decay of 0.1 and gradient clipping. The model exhibits strong performance across various tasks, including coding, Q&A in context, commonsense reasoning, and knowledge benchmarks.

Fine-Tuning:

The approach to fine-tuning is depicted in the architecture diagram above, consisting of Supervised Fine-Tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF) components.

SFT (Supervised Fine-Tuning):

In this phase, Meta introduces an innovative approach by categorizing data into prompts related to helpfulness and safety. The process commenced with the utilization of publicly available instruction tuning data (Chung et al., 2022), meticulously annotating around 27,540 instances with a strong focus on data quality. During supervised fine-tuning, a cosine learning rate schedule was employed, starting with an initial learning rate of 2*10–5. The process involved a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. These hyperparameters underwent fine-tuning over a span of 2 epochs. The training objective followed an auto-regressive approach, where the loss on tokens from the user prompt was zeroed out, and back-propagation was exclusively applied to answer tokens.

RLHF :

Meta established a precise procedure for annotators during the data collection process. Initially, annotators created a prompt, following which they were presented with two responses generated by the model. Their task was to evaluate these responses based on predefined criteria. To enhance diversity, the two responses for each prompt were drawn from two different model variants, each employing distinct temperature hyper-parameters. As previously illustrated, the data collected was classified according to safety and helpfulness dimensions, forming the foundation for the Reward Model.

Meta developed multiple iterations of RLHF, spanning from V1 to V5, utilizing an Instructed Fine-Tuning (IFT) approach supported by two distinct algorithms:

1. Proximal Policy

Optimization (PPO): This method aligns with OpenAI’s approach, employing the reward model as an estimate for the genuine reward function, which reflects human preferences. The pre-trained language model serves as the policy, subject to optimization.

2. Rejection Sampling Fine-Tuning: This approach involves sampling K outputs from the model and selecting the most promising candidate based on a reward score. The chosen outputs form a new gold standard for further model fine-tuning. This process reinforces the reward mechanism, iteratively enhancing model performance.

The Rejection Sampling approach used on the 70B model is seen as intuitive and easier to understand for learning purposes. It helps maintain a growing gap between the median and maximum performance, indicating overall progress.

Meta trained two distinct reward models, Safety reward model (R_s) and Helpfulness reward model (R_h). To prioritize safety, prompts with potential for unsafe responses were identified, and responses were filtered using a threshold of 0.15, resulting in a precision of 0.89 and a recall of 0.55 based on evaluation with the Meta Safety test set.

The training process employed the AdamW optimizer with a weight decay of 0.1 and gradient clipping at 1.0. A constant learning rate of 10*-6 was used during training. Proximal Policy Optimization (PPO) iterations used a batch size of 512, a PPO clip threshold of 0.2, and a mini-batch size of 64, with one gradient step per mini-batch.

G Host Attention(GAtt) :

The issue of context loss in multi-turn conversations has been acknowledged and addressed by Meta through the implementation of the GAtt (GHost Attention) method. This method involved artificially concatenating instructions to all user messages in the conversation. Subsequently, Meta used the latest RLHF (Reinforcement Learning with Human Feedback) model to sample from this augmented dataset. This process resulted in the acquisition of context-rich dialogues and corresponding samples, which were employed for fine-tuning the model, somewhat similar to the concept of Rejection Sampling. The overall outcome demonstrated enhanced attention compared to the existing model. It’s worth noting that this approach was specifically evaluated on 70B models.

Conclusion:

These models serve as a foundation for customization. Users can train Llama 2 to create article summaries in their company's unique style or voice by providing it with numerous examples. Similarly, they can further enhance chat-optimized models to better respond to customer support requests by providing relevant information like FAQs and chat logs.

Many well-known Language Model Models (LLMs), such as OpenAI's GPT-3 and GPT-4, Google's PaLM and PaLM 2, and Anthropic's Claude, are typically closed source. While researchers and businesses can access these models through official APIs and fine-tune them for specific responses, they lack transparency about the model's inner workings.

However, Llama 2 stands out by offering openness. Interested individuals can access a detailed research paper explaining how the model was created and trained. They can download the model and, with the necessary technical expertise, run it on their computers or delve into its code, although it's important to note that even the smallest version requires over 13 GB of storage.

Furthermore, users can deploy Llama 2 on cloud infrastructures like Microsoft Azure and Amazon Web Services via platforms such as Hugging Face. This enables them to train the model on their own data to generate tailored text. It's essential to follow Meta's guidelines for responsible use when working with Llama.

Meta's open approach with Llama fosters greater control for companies looking to develop AI-powered applications. The primary restriction is that companies with over 700 million monthly users must seek special permission to use Llama, making it unavailable to tech giants like Apple, Google, and Amazon.

This openness in AI development is significant as it aligns with the historical trend of advancements in computing built upon open research and experimentation. While companies like Google and OpenAI will remain key players in the field, the release of Llama ensures the existence of credible alternatives to closed-source AI systems, reducing the potential for monopolies and promoting innovation.

Meta AI's inaugural version of the Large Language Model, LLaMA 1, was introduced in February of this year. It represents a remarkable assembly of foundational models, encompassing models with parameters spanning from 7 billion to 65 billion.

What sets LLaMA 1 apart is its remarkable training on trillions of tokens, demonstrating that achieving state-of-the-art language models is possible solely through publicly available datasets, without relying on proprietary or inaccessible data sources.

Remarkably, the LLaMA-13B model has outperformed ChatGPT, despite having a significantly smaller parameter size of 13 billion compared to ChatGPT's 175 billion, across most benchmark datasets. This achievement underscores LLaMA's efficiency in achieving top-tier performance with a considerably reduced number of parameters.

Even the largest model in the LLaMA collection, LLaMA-65B, holds its own against other prominent models in the field of natural language processing (NLP), such as Chinchilla-70B and PaLM-540B.

LLaMA's distinguishing feature lies in its strong commitment to openness and accessibility. Meta AI, the creators of LLaMA, have demonstrated their dedication to advancing the AI field through collaborative efforts by making all their models available to the research community. This approach notably differs from OpenAI's GPT-3 and GPT-4.

Llama 2-Chat is a specialized variant of Llama 2 tailored for dialogue-oriented applications. It has undergone fine-tuning to enhance its performance, ensuring it provides more contextually relevant responses during conversations.

While Llama 2 was initially pretrained using openly accessible online data sources, Llama 2-Chat has been fine-tuned using publicly available instruction datasets and incorporates over 1 million human annotations to refine its dialogue capabilities.

Meta’s researchers have introduced multiple versions of Llama 2 and Llama 2-Chat with diverse parameter sizes, including 7 billion, 13 billion, and 70 billion. These options are designed to accommodate a range of computational needs and application scenarios, empowering researchers and developers to select the most appropriate model for their specific tasks. This accessibility enables startups to leverage Llama 2 models for developing their machine learning products, encompassing various generative AI applications and AI chatbots similar to Google’s Bard and OpenAI’s ChatGPT.

--

--

Pratik

My interest:- Fundamentals of: AGI, Quantum_Computing, Longevity, Brain_Computer_Interface. Also, I'm thankful for being part of an accidental Universe.