LLAMA FROM SCRATCH

Rania _Hossam
13 min readOct 22, 2023

Meta AI and Microsoft have joined forces to introduce Llama 2, the next generation of Meta’s open-source large language model.

The best part? Llama 2 is available for free, both for research and commercial use.

LLaMA: Large Language Model Meta AI
Large Language Model Meta AI (LLaMA 1) is the first version of the state-of-the-art foundational large language model that was released by Meta in February this year. It is an impressive collection of foundational models, comprised of models with parameter sixes ranging from 7 billion to 65 billion.

LLaMA 1 stands out due to its extensive training on trillion of tokens, showcasing that state-of-the-art models can be attained solely though publicly available datasets and without the need for proprietary or inaccessible data.

💡 Read the published paper LLaMA: Open and Efficient Foundation Language Models.
Notably, the LLaMA-13B model outperformed ChatGPT, which has a significantly larger parameter size of 175 billion, across most benchmark datasets. This accomplishment highlights LLaMA’s efficiency in delivering top-tier performance with significantly fewer parameters.

The largest model of the collection, LLaMA-65B, holds its own amongst other leading models in the field of natural language processing (NLP) like Chinchilla-70B and PaLM-540B.

LLaMA stands out due to its strong emphasis on openness and accessibility. Meta AI, the creators of LLaMA, have demonstrated their dedication to advancing the field of AI through collaborative efforts by releasing all their models to the research community. This is notably in contrast to OpenAI’s GPT-3 or GPT-4.

you can have a look at the full code on my github

LLAMA2 :

Llama 2 is a family of pre-trained and fine-tuned large language models (LLMs), ranging in scale from 7B to 70B parameters, from the AI group at Meta, the parent company of Facebook. According to Meta AI, Llama 2 Chat LLMs are optimized for dialogue use cases and outperform open-source chat models on most benchmarks they tested. Based on Meta’s human evaluations for helpfulness and safety, the company says Llama 2 may be “a suitable substitute for closed source models.” Llama 2, like the original Llama model, is based on the Google transformer architecture, with improvements. Llama’s improvements include RMSNorm pre-normalization, inspired by GPT-3; a SwiGLU activation function, inspired by Google’s PaLM; multi-query attention instead of multi-head attention; and rotary positional embeddings (RoPE), inspired by GPT Neo. Llama training used the AdamW optimizer. Llama 2’s primary differences from Llama are increased context length (4096 vs. 2048 tokens) and grouped-query attention (GQA) instead of multi-query attention (MQA) in the two larger models.

Now we are going to discuses the architecture from scratch :

To dive right into action, our first step is installing the necessary library and importing the required packages. To get our hands dirty swiftly, I’ll begin by downloading a compact dataset from Hugging Face, providing us with a set of text sentences. These sentences will be transformed into tokens using the prebuilt tokenizer from ‘daryl149/llama-2–7b-chat-hf’ the very same tokenizer used during LLaMA’s pre-training.

!pip install transformers datasets SentencePiece
import random
import math
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from transformers import LlamaTokenizer
from datasets import load_dataset
model_id="daryl149/llama-2-7b-chat-hf"
tokenizer=LlamaTokenizer.from_pretrained(model_id)
tokenizer.pad_token=tokenizer.eos_token
config = {
'vocab_size': tokenizer.vocab_size,
'n_layers': 1,
'embed_dim': 2048,
'n_heads': 32,
'n_kv_heads': 8,
'multiple_of': 64,
'ffn_dim_multiplier': None,
'norm_eps': 1e-5,
'max_batch_size': 16,
'max_seq_len': 64,
'device': 'cuda',
}
dataset=load_dataset('glue','ax',split='test')
dataset=dataset.select_columns(['premise','hypothesis'])
test_set=tokenizer(random.sample(dataset['premise'],config['max_batch_size']),
truncation=True,
max_length=config['max_seq_len'],
padding='max_length',
return_tensors='pt')


  1. RMSNorm pre-normalization:-

RMSNorm: Root Mean Square Layer Normalization

LLaMA normalizes the input of each transformer sub-layer, instead of normalizing the output.

Inspiration of including pre-normalization is taken from GPT3.

RMSNorm is extension of Layer Normalization (LayerNorm). Reason behind using RMSNorm is the computational overhead in LayerNorm. This makes improvements slow and expensive. RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64%.

Let first understand LayerNorm, It has two properties.

a. re-centring: It make model insensitive to shift noises on both inputs and weights.

b. re-scaling: It keeps the output representations intact when both inputs and weights are randomly scaled.

RMSNorm claims that most of the benefits comes from re-scaling.

RMSNorm does re-scaling invariance and regularizes the summed inputs simply according to the root mean square (RMS) statistic.

Intuitively, RMSNorm simplifies LayerNorm by totally removing the mean statistic in Eq. ) at the cost of sacrificing the invariance that mean normization affords. When the mean of summed inputs is zero, RMSNorm is exactly equal to LayerNoFm. Although RMSNorm does not re-center

class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))

def _norm(self, x: torch.Tensor):
# (m, seq_len, dim) * (m, seq_len, 1) = (m, seq_len, dim)
# rsqrt: 1 / sqrt(x)
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

def forward(self, x: torch.Tensor):
# weight is a gain parameter used to re-scale the standardized summed inputs
# (dim) * (m, seq_len, dim) = (m, seq_Len, dim)
return self.weight * self._norm(x.float()).type_as(x)

This custom script first standardizes the input x, by dividing it by its root mean square, thereby making it invariant to scaling changes. The learned weight parameter self.weight is applied to each element in the standardized tensor. This operation adjusts the magnitude of the values based on the learned scaling factor.

Rotary Embeddings[ROPE]

What’s the difference between the absolute positional encodings and the relative ones?

1.Absolute positional encodings are fixed vectors that are added to the embedding of a token to represent its absolute position in the sentence. So, it deals with one token at a time. You can think of it as the pair (latitude, longitude) on a map: each point on earth will have a unique pair.

2. Relative positional encodings, on the other hand, deals with two tokens at a time and it is involved when we calculate the attention: since the attention mechanism captures the “intensity” of how much two words are related two each other, relative positional encodings tells the attention mechanism the distance between the two words involved in it. So, given two tokens, we create a vector that represents their distance

So What are Rotary Embeddings?
In simple terms, Rotary Position Embedding, or RoPE, is a way to encode positional information in natural language processing models. This type of position embedding uses a rotation matrix to include explicit relative position dependency in self-attention formulation. RoPE has many valuable properties, such as being flexible enough to work with any sequence length, decaying inter-token dependency with increasing relative distances, and the ability to equip linear self-attention with relative position encoding.

important for natural language processing because they allow models to better understand the context in which words are used. When a model has a better idea of the position of the input tokens, it can produce more accurate predictions. For example, a language model that uses RoPE might be better able to understand that “I love pizza” and “Pizza is what I love” have different meanings due to word position. With a better understanding of relative positioning, a model can make more nuanced predictions.

so let’s say the problems that this approach solve:

in absolute positional encoding you have only bounded length just

  1. position vectors from [1–512]
  2. max length is bounded
  3. and another way to represent each positional encoding with a unique vector we can use sinusoidal function [but they have the similar performance]
  4. every single positional embedding is independent of each other
  5. if you want to have a look how Rotary Embedding work:
def get_rotary_matrix(context_window, embedding_dim):
R = torch.zeros((context_window, embedding_dim, embedding_dim), requires_grad=False)
for position in range(context_window):
for i in range(embedding_dim//2):
theta = 10000. ** (-2.*(i - 1) / embedding_dim)
m_theta = position * theta
R[position, 2*i,2*i] = np.cos(m_theta)
R[position, 2*i,2*i+1] = - np.sin(m_theta)
R[position, 2*i+1,2*i] = np.sin(m_theta)
R[position, 2*i+1,2*i+1] = np.cos(m_theta)
return R
def precompute_theta_pos_frequencies(head_dim, seq_len, device, theta=10000.0):

# theta_i = 10000^(-2(i-1)/dim) for i = [1, 2, ... dim/2]
# (head_dim / 2)
theta_numerator = torch.arange(0, head_dim, 2).float()
theta = 1.0 / (theta ** (theta_numerator / head_dim)).to(device)

# (seq_len)
m = torch.arange(seq_len, device=device)

# (seq_len, head_dim / 2)
freqs = torch.outer(m, theta).float()

# complex numbers in polar, c = R * exp(m * theta), where R = 1:
# (seq_len, head_dim/2)
freqs_complex = torch.polar(torch.ones_like(freqs), freqs)
return freqs_complex
def apply_rotary_embeddings(x, freqs_complex, device):

# last dimension pairs of two values represent real and imaginary
# two consecutive values will become a single complex number

# (m, seq_len, num_heads, head_dim/2, 2)
x = x.float().reshape(*x.shape[:-1], -1, 2)

# (m, seq_len, num_heads, head_dim/2)
x_complex = torch.view_as_complex(x)

# (seq_len, head_dim/2) --> (1, seq_len, 1, head_dim/2)
freqs_complex = freqs_complex.unsqueeze(0).unsqueeze(2)

# multiply each complex number
# (m, seq_len, n_heads, head_dim/2)
x_rotated = x_complex * freqs_complex

# convert back to the real number
# (m, seq_len, n_heads, head_dim/2, 2)
x_out = torch.view_as_real(x_rotated)

# (m, seq_len, n_heads, head_dim)
x_out = x_out.reshape(*x.shape)

return x_out.type_as(x).to(device)

The rotary position embeddings are only applied to the query and the keys, but not the values.

  • The rotary position embeddings are applied after the vector q and k have been multiplied by the W matrix in the attention mechanism, while in the vanilla transformer they’re applied before.

KV Caching

models like GPT and Llama. In these models, generating tokens one by one is a common practice, but it can be computationally expensive because it repeats certain calculations at each step. To address this, KV caching comes into play. It involves caching the previous Keys and Values, so we don’t need to recalculate them for each new token. This significantly reduces the size of matrices used in calculations, making matrix multiplications faster. The only trade-off is that KV caching requires more GPU memory (or CPU memory if a GPU isn’t used) to store these Key and Value states.

class KVCache:
def __init__(self, max_batch_size, max_seq_len, n_kv_heads, head_dim, device):
self.cache_k = torch.zeros((max_batch_size, max_seq_len, n_kv_heads, head_dim)).to(device)
self.cache_v = torch.zeros((max_batch_size, max_seq_len, n_kv_heads, head_dim)).to(device)

def update(self, batch_size, start_pos, xk, xv):
self.cache_k[:batch_size, start_pos :start_pos + xk.size(1)] = xk
self.cache_v[:batch_size, start_pos :start_pos + xv.size(1)] = xv

def get(self, batch_size, start_pos, seq_len):
keys = self.cache_k[:batch_size, :start_pos + seq_len]
values = self.cache_v[:batch_size, :start_pos + seq_len]
return keys, values

During inference, the process operates on one token at a time, maintaining a sequence length of one. This means that, for Key, Value, and Query, both the linear layer and rotary embedding exclusively target a single token at a specific position. The attention weights are precomputed and stored for Key and Value as caches, ensuring that these calculations occur only once and their results are cached. The script getmethod retrieves past attention weights for Key and Value up to the current position, extending their length beyond 1. During the scaled dot-product operation, the output size matches the query size, which generate only a single token.

Grouped Query Attention

Llama incorporates a technique called grouped-query attention (GQA) to address memory bandwidth challenges during the autoregressive decoding of Transformer models. The primary issue stems from the need to load decoder weights and attention keys/values at each processing step, which consumes excessive memory.

In response, two strategies are introduced: and .

  • Multi-query attention (MQA) involves utilizing multiple query heads with a single key/value head, which speeds up decoder inference. However, it has drawbacks such as quality degradation and training instability.
  • Grouped-Query attention (GQA), is an evolution of MQA and strikes a balance by using an intermediate number of key-value heads (more than one but fewer than the query heads). The GQA model efficiently breaks the query into n_heads segments like the original multi-head attention, and the key and value are divided into n_kv_headsgroups, enabling multiple key-value heads to share the same query.
  • By repeating key-value pairs for computational efficiency, the GQA approach optimizes performance while maintaining quality, as evidenced by the code implementation.
  • The provided code is for implementing grouped query attention (GQA) within the context of an autoregressive decoder using a Transformer model. Notably, during inference, the sequence length (seq_len) is always set to 1.
def repeat_kv(x, n_rep):

batch_size, seq_len, n_kv_heads, head_dim = x.shape
if n_rep == 1:
return x
else:
# (m, seq_len, n_kv_heads, 1, head_dim)
# --> (m, seq_len, n_kv_heads, n_rep, head_dim)
# --> (m, seq_len, n_kv_heads * n_rep, head_dim)
return (
x[:, :, :, None, :]
.expand(batch_size, seq_len, n_kv_heads, n_rep, head_dim)
.reshape(batch_size, seq_len, n_kv_heads * n_rep, head_dim)
)

class SelfAttention(nn.Module):
def __init__(self, config):
super().__init__()

self.n_heads = config['n_heads']
self.n_kv_heads = config['n_kv_heads']
self.dim = config['embed_dim']
self.n_kv_heads = self.n_heads if self.n_kv_heads is None else self.n_kv_heads
self.n_heads_q = self.n_heads
self.n_rep = self.n_heads_q // self.n_kv_heads
self.head_dim = self.dim // self.n_heads

self.wq = nn.Linear(self.dim, self.n_heads * self.head_dim, bias=False)
self.wk = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False)
self.wv = nn.Linear(self.dim, self.n_kv_heads * self.head_dim, bias=False)
self.wo = nn.Linear(self.n_heads * self.head_dim, self.dim, bias=False)

self.cache = KVCache(
max_batch_size=config['max_batch_size'],
max_seq_len=config['max_seq_len'],
n_kv_heads=self.n_kv_heads,
head_dim=self.head_dim,
device=config['device']
)

def forward(self, x, start_pos, freqs_complex):

# seq_len is always 1 during inference
batch_size, seq_len, _ = x.shape

# (m, seq_len, dim)
xq = self.wq(x)

# (m, seq_len, h_kv * head_dim)
xk = self.wk(x)
xv = self.wv(x)

# (m, seq_len, n_heads, head_dim)
xq = xq.view(batch_size, seq_len, self.n_heads_q, self.head_dim)

# (m, seq_len, h_kv, head_dim)
xk = xk.view(batch_size, seq_len, self.n_kv_heads, self.head_dim)
xv = xv.view(batch_size, seq_len, self.n_kv_heads, self.head_dim)

# (m, seq_len, num_head, head_dim)
xq = apply_rotary_embeddings(xq, freqs_complex, device=x.device)

# (m, seq_len, h_kv, head_dim)
xk = apply_rotary_embeddings(xk, freqs_complex, device=x.device)

# replace the entry in the cache
self.cache.update(batch_size, start_pos, xk, xv)

# (m, seq_len, h_kv, head_dim)
keys, values = self.cache.get(batch_size, start_pos, seq_len)

# (m, seq_len, h_kv, head_dim) --> (m, seq_len, n_heads, head_dim)
keys = repeat_kv(keys, self.n_rep)
values = repeat_kv(values, self.n_rep)

# (m, n_heads, seq_len, head_dim)
# seq_len is 1 for xq during inference
xq = xq.transpose(1, 2)

# (m, n_heads, seq_len, head_dim)
keys = keys.transpose(1, 2)
values = values.transpose(1, 2)

# (m, n_heads, seq_len_q, head_dim) @ (m, n_heads, head_dim, seq_len) -> (m, n_heads, seq_len_q, seq_len)
scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)

# (m, n_heads, seq_len_q, seq_len)
scores = F.softmax(scores.float(), dim=-1).type_as(xq)

# (m, n_heads, seq_len_q, seq_len) @ (m, n_heads, seq_len, head_dim) -> (m, n_heads, seq_len_q, head_dim)
output = torch.matmul(scores, values)

# ((m, n_heads, seq_len_q, head_dim) -> (m, seq_len_q, dim)
output = (output.transpose(1, 2).contiguous().view(batch_size, seq_len, -1))

# (m, seq_len_q, dim)
return self.wo(output)

SelfAttentionis a class that combines mechanism that we have discussed. The key components of this class are as follows:

  • Linear transformations are applied to the input tensor for queries (xq), keys (xk), and values (xv). These transformations project the input data into a form suitable for processing.
  • The rotary embedding is applied to the query, key, and value tensors using the provided frequency complex number. This step enhances the model’s ability to consider positional information and perform attention computations.
  • The key-value pairs (k and v) are cached for efficient memory usage. The cached key-value pairs are retrieved up to current position (start_pos + seq_len)
  • The query, key, and value tensors are prepared for Grouped-Query attention calculation by repeating key-value pairs n_rep times, where n_rep corresponds to the number of query heads that share the same key-value pair.
  • Scaled dot-product attention computation. The attention scores are computed by taking the dot product of the query and key, followed by scaling. Softmax is applied to obtain the final attention scores. During the computation, the output size matches the query size, which is also 1.
  • Finally, the module applies a linear transformation (wo) to the output, and the processed output is returned.

SwiGlu

SwiGLU, as utilized in LLaMA2 models, is an activation function designed to enhance the performance of the position-wise feed-forward network (FFN) layers in the Transformer architecture.

def sigmoid(x, beta=1):
return 1 / (1 + torch.exp(-x * beta))

def swiglu(x, beta=1):
return x * sigmoid(x, beta)

Feedforward

In the Transformer architecture, the feedforward layer plays a crucial role, typically following the attention layer and normalization. The feedforward layer consists of three linear transformations.

class FeedForward(nn.Module):
def __init__(self, config):

super().__init__()

hidden_dim = 4 * config['embed_dim']
hidden_dim = int(2 * hidden_dim / 3)

if config['ffn_dim_multiplier'] is not None:
hidden_dim = int(config['ffn_dim_multiplier'] * hidden_dim)

# Round the hidden_dim to the nearest multiple of the multiple_of parameter
hidden_dim = config['multiple_of'] * ((hidden_dim + config['multiple_of'] - 1) // config['multiple_of'])

self.w1 = nn.Linear(config['embed_dim'], hidden_dim, bias=False)
self.w2 = nn.Linear(config['embed_dim'], hidden_dim, bias=False)
self.w3 = nn.Linear(hidden_dim, config['embed_dim'], bias=False)

def forward(self, x: torch.Tensor):
# (m, seq_len, dim) --> (m, seq_len, hidden_dim)
swish = swiglu(self.w1(x))
# (m, seq_len, dim) --> (m, seq_len, hidden_dim)
x_V = self.w2(x)

# (m, seq_len, hidden_dim)
x = swish * x_V

# (m, seq_len, hidden_dim) --> (m, seq_len, dim)
return self.w3(x)

During the forward pass, the input tensor x is subjected to multi layer of linear transformations. The SwiGLU activation function, applied after first transformation, enhances the expressive power of the model. The final transformation maps the tensor back to its original dimensions. This unique combination of SwiGLU activation and multiple FeedForward layer enhances the performance of the model.

Ultimate Transformer Model

The final culmination of Llama2, a powerful Transformer model, brings together the array of advanced techniques we’ve discussed so far. The DecoderBlock, a fundamental building block of this model, combines the knowledge of KV caching, Grouped Query Attention, SwiGLU activation, and Rotary Embedding to create a highly efficient and effective solution.

class DecoderBlock(nn.Module):
def __init__(self, config):
super().__init__()

self.n_heads = config['n_heads']
self.dim = config['embed_dim']
self.head_dim = self.dim // self.n_heads

self.attention = SelfAttention(config)
self.feed_forward = FeedForward(config)

# rms before attention block
self.attention_norm = RMSNorm(self.dim, eps=config['norm_eps'])

# rms before feed forward block
self.ffn_norm = RMSNorm(self.dim, eps=config['norm_eps'])

def forward(self, x, start_pos, freqs_complex):

# (m, seq_len, dim)
h = x + self.attention.forward(
self.attention_norm(x), start_pos, freqs_complex)
# (m, seq_len, dim)
out = h + self.feed_forward.forward(self.ffn_norm(h))
return out

class Transformer(nn.Module):
def __init__(self, config):
super().__init__()
self.vocab_size = config['vocab_size']
self.n_layers = config['n_layers']
self.tok_embeddings = nn.Embedding(self.vocab_size, config['embed_dim'])
self.head_dim = config['embed_dim'] // config['n_heads']

self.layers = nn.ModuleList()
for layer_id in range(config['n_layers']):
self.layers.append(DecoderBlock(config))

self.norm = RMSNorm(config['embed_dim'], eps=config['norm_eps'])
self.output = nn.Linear(config['embed_dim'], self.vocab_size, bias=False)

self.freqs_complex = precompute_theta_pos_frequencies(
self.head_dim, config['max_seq_len'] * 2, device=(config['device']))

def forward(self, tokens, start_pos):
# (m, seq_len)
batch_size, seq_len = tokens.shape

# (m, seq_len) -> (m, seq_len, embed_dim)
h = self.tok_embeddings(tokens)

# (seq_len, (embed_dim/n_heads)/2]
freqs_complex = self.freqs_complex[start_pos:start_pos + seq_len]

# Consecutively apply all the encoder layers
# (m, seq_len, dim)
for layer in self.layers:
h = layer(h, start_pos, freqs_complex)
h = self.norm(h)

# (m, seq_len, vocab_size)
output = self.output(h).float()
return output

model = Transformer(config).to(config['device'])
res = model.forward(test_set['input_ids'].to(config['device']), 0)
print(res.size())

The Transformer model encompasses a stack of DecoderBlocks to create a robust and efficient deep learning architecture. The accompanying code showcases how the DecoderBlock, with its SelfAttention, FeedForward, and RMSNorm layers, effectively processes data. The code also highlights the larger Transformer architecture’s structure, including token embeddings, layer stacking, and output generation. Furthermore, the use of precomputed frequencies and advanced techniques, combined with customized configurations, ensures the model’s remarkable performance and versatility in various natural language understanding tasks.

https://github.com/rania-hossam/LLAMA_FROM_SCRATCH_PYTORCH

--

--