“Mastering Llama Math (Part-1): A Step-by-Step Guide to Counting Parameters in Llama-2”

6 min readOct 22, 2023

Since Meta’s release of the Llama model in February, we’ve witnessed a surge in Llama-based fine-tuned open-source models such as Alpacca and Vicuna. As of the time of writing this blog in October 2022, the landscape is dominated by popular open-source models from the Falcon, MPT, and Llama-2 series.

Notably, all of these models have a common architectural foundation, which is a Decoder-Only Transformer model. Their key distinctions arise from variations in positional embeddings and attention mechanisms.

Table-1: Key Differences in Architecture b/w MPT, Falcon, and Llama-2

In this series of articles, I will guide you through the comprehensive process of calculating parameters, FLOPs (Floating Point Operations), and memory requirements for both inference and training in complete detail.

In this article, which is the first in this series, I’ll guide you through the process of calculating the parameter count for the Llama-2–13B model. We’ll take a two-fold approach: first, we will delve into the mathematical intricacies, offering a step-by-step breakdown for a deep understanding. Following that, we confirm the calculation using PyTorch in Google Colab.

Downloading and Loading the Llama-2 model

To access the Llama-2 model, you’ll need to follow a simple two-step process. Firstly, register your email address and complete the necessary license agreement by clicking on this link. Once your registration with Meta is confirmed, you can proceed to log in to the Hugging Face website using the same email address. The Huggingface platform will verify your Meta registration and provide you with the access you need to download the model

The following code snippet demonstrates how to load the Llama-2 model in a Google Colab environment:

from transformers import AutoTokenizer, AutoModelForCausalLM

# You can create token for ur account here: https://huggingface.co/settings/tokens

model_name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
model = AutoModelForCausalLM.from_pretrained(model_name, token=token)

The figure below shows the Llama-2–13B model architecture. In the next section, we will delve deeper into each component and analyze the number of parameters.

A Closer Look into the Model Architecture

In this section, we will understand each line of the model architecture from Figure 1 and calculate the number of parameters in each block.

Embedding Block

(embed_tokens): Embedding(32000, 5120)

Language models see the text in terms of tokens, tokens are sub-word units. Llama-2 uses the Bytepair encoding algorithm to define these sub-word units with a vocab size of 32,000. Once the model tokenized the text, it represents each token with a fixed dimensional embedding, of size ‘d’. Llama-2, in particular, uses an embedding dimension of 5120.

As a result, the number of parameters in the Embedding block (embed_parameters) totals to 32,000 x 5,120 = 163,840,000.

Attention block

(self_attn): LlamaAttention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )

As outlined in Table 1, Llama-2 adopts Grouped-query attention (GQA) for the 70B version, while the 13B version utilizes Multi-head attention (MHA). Notably, GQA involves sharing Key and Value pairs within each group, resulting in a reduction in the Key-Value (KV) cache size for preceding tokens in the sequence during inference.

In this blog post, our focus will be on calculating the parameter count for the Multi-head attention (MHA) block. As an exercise, you can explore the parameter count for the Multi-query attention (MQA) block, which is implemented in Llama-2–70B.

Llama-2–70B uses GQA with num_groups as 8, Llama-2–13B uses MHA and Falcon uses Multi-query Attn

Within the MHA block of Llama-2–13B, there are 40 attention heads, each with a dimensionality of 128. Consequently, the size of the W_Q matrix is calculated as 5120 x (128 x 40), which results in 26,214,400 parameters. Importantly, W_O, W_K, and W_V matrices share the same dimensions as W_Q within the MHA block. For a deeper understanding of MHA, you can refer to this excellent blog.

Hence, the parameter count for the entire attention block (attn_block_parameters) is 4 x 5120 x (128 x 4) = 104,857,600.

MLP Block

(mlp): LlamaMLP(
          (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )

Llama-2 incorporates a distinct Multi-Layer Perceptron (MLP) block architecture, setting it apart from the conventional up_proj and down_proj operations commonly found in transformer models. In the traditional context, the standard MLP block is expressed as follows:

out = down_proj(actn_fn(up_proj(input)))

However, in the case of Llama-2, the MLP block comprises three essential layers: up_proj, down_proj, and gate_proj, which combine to create a unique architecture:

out = down_proj( act_fn(gate_proj(input)) x up_proj(input) ).

The size of the up_proj layer is calculated as 5120 x 13824, resulting in 70,778,880 parameters. Similarly, the down_proj layer measures 13824 x 5120, while the gated_proj layer encompasses 5120 x 13824.

In total, the mlp_block_parameters = 3 x 5120 x 13,824 = 212,336,640

RMS Norm layers

(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()

Llama-2 uses RMSNorm instead of LayerNorm which is in the Attention is All You Need paper. RMSNorm normalizes with the Root Mean Square of activations and scales them with learnable parameters.

Figure-2: RMSNorm equation, g_i is a learnable parameter.

The dimension of g_i in the above equation is the same as a_i, which is 5,120 for Llama-2–13B. The RMSNorm is applied before the Attention block and MLP block in each layer. RMSNorm is also used before the LM head.

In total per transformer layer, per_layer_rms_norm_ parameters = 2 x 5120 and pre_lm_head_rms_norm_parameters = 5120.

LM Head

  (lm_head): Linear(in_features=5120, out_features=32000, bias=False)

The final LM classification head takes in a feature size of 5,120 and categorizes it into 32,000 classes.

In total, the lm_head_parameters = 5,120 x 32,000 = 163,840,000

Putting everything together

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)

In the transformer architecture, the Attention block and MLP block are combined as one transformer layer and this is repeated multiple times. To compute the total number of parameters we use the equation below.

Total parameters = embed_parameters + num_layers x (attn_module_parameters + mlp_block_parameters + per_layer_rms_norm_ parameters) + pre_lm_head_rms_norm_parameters + lm_head_parameters

Substituting the respective values:

Total parameters = 163,840,000 + 40 x ( 104,857,600 + 212,336,640 + 5,120 x 2) + 5, 120 + 163,840,000 = 13,015,864,320

Verify with count with PyTorch:

To determine the number of parameters in a PyTorch model you loaded above, you can use the following code snippet

num_parameters = sum(p.numel() for p in model.parameters())
print(num_parameters)

# Number of parameters in Llama-2-13B: 13015864320

Hurray! Our count is exactly right!!

You can find the Llama-2 model counting notebook here.

As we conclude our exploration of Llama-2–13B’s parameter counting, you’re now equipped to tackle parameter counts for other models like Falcon or any latest model that will be released.

If you found this post valuable, please consider liking and subscribing for more insightful content. Thank you for being part of our community!