Brief Introduction to Llama 2

Florian June
3 min readSep 21, 2023

--

Llama 2 is Meta’s latest large language models (LLM), which has wide applications and great influence.

However, the information available online is quite scattered. This article focuses on providing a concise introduction to the technical details of Llama 2.

Model Architecture

Llama 2 adopts most of the pre-training settings and model architecture from Llama 1. It uses the standard Transformer architecture, applies RMSNorm for pre-normalization, uses SwiGLU activation function, and employs rotated position embeddings (RoPE). The main differences in architecture compared to Llama 1 include extending the context length from 2048 to 4096 and utilizing Grouped Query Attention (GQA).

Llama 2 vs Llama 1

Grouped Query Attention (GQA)

This attention mechanism improves the scalability of large models. It works by sharing the projection of keys and values across multiple heads without significant performance degradation. The original Multi-Query Attention format (MQA) with a single KV projection or the Grouped Query Attention variant (GQA) with 8 KV projections can be used.

Model Size

Llama2 comes in 4 different model sizes: 7B, 13B, 34B, and 70B (with the 34B version not being released).

Pre-training

The Pre-training data for Llama 2 model includes 2 trillion tokens, which is almost 40% larger than Llama 1. AdamW optimizer is used for training with β1=0.9, β2=0.95, and eps=10^-5. A cosine learning rate schedule is used with a warm-up of 2000 steps and a decay of the final learning rate to 10% of the peak learning rate. Weight decay of 0.1 and gradient clipping of 1.0 are applied.

Fine-tuning

An initial version of Llama-2-chat is created through supervised fine-tuning (SFT). Next, Llama-2-chat undergoes iterative refinement using reinforcement learning human feedback(RLHF), which includes rejection sampling and proximal policy optimization (PPO).

Tokenizer

Llama 2 uses the same tokenizer as Llama 1. Both employ the Byte Pair Encoding (BPE) algorithm implemented with SentencePiece. Similar to Llama 1, all numbers are split into separate digits, and unknown UTF-8 characters are decomposed into bytes. The total vocabulary size is 32k tokens.

Evaluation

Evaluation in the paper shows that Llama 2 outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests.

Safety

Evaluating the safety of Llama 2 using three commonly used benchmarks, focusing on three key dimensions: “Truthfulness”, which refers to whether the language model produces incorrect information, using the TruthfulQA benchmark; “Toxicity”, which refers to whether the language model generates “toxic”, rude, or harmful content, using the ToxiGen benchmark; “Bias”, which refers to whether the language model produces biased content, using the BOLD benchmark.

Furthermore, the latest AI-related content can be found in my newsletter.

Lastly, if there are any errors or omissions in this article, please kindly point them out.

Reference Material

Llama 2: Open Foundation and Fine-Tuned Chat Models

https://ai.meta.com/llama/

--

--

Florian June

AI researcher, focusing on LLMs, RAG, Agent, Document AI, Data Structures. Find the newest article in my newsletter: https://florianjune.substack.com/