DeepSeek-V3 is a cutting-edge model boasting 671 billion parameters, yet it cleverly activates only 37 billion per token, achieving remarkable efficiency. Its architecture is built on three core innovations: Multi-head Latent Attention, DeepSeekMoE, and a multi-token prediction training objective. These are further complemented by rule-based Group Relative Policy Optimization (GRPO) for reinforcement learning, advanced quantization techniques, and more. The result? The full training process required just 2.788 million H800 GPU hours, producing state-of-the-art (SOTA) performance. This architecture also serves as the foundation for DeepSeek-R1.
In my previous blogpost, I’ve thoughtful explained DeepSeek R1 reinforcement learning approach and it’s training pipeline. In this blog post, I’ll delve into the details the DeepSeek-V3’s architecture key components which strike a remarkable balance between computational efficiency and benchmark performance.
1. Multi-Head Latent Attention (MLA): Compressed Attention with Reconstructive Upsampling
Motivation
Transformer attention uses KV-caching to avoid redundant re-computation by storing intermediate Key and Value vectors for all previous tokens. Storing separate Key (K) and Value (V) vectors for each attention head leads to a KV-cache memory footprint of O(n⋅d) for sequence length n and model dimension d, creating memory bottlenecks during long-context and batch inferences. MLA reduces this to O(n * d_kv), when d_kv << d, via low-rank latent projections while preserving attention fidelity.
Standard multi-head attention recap, from DeepSeekV2 paper:
Key and Value Compression and up-sampling:
Meaning: MLA projects the input embeddings into a low-rank latent K & V space, via the down-projection matrix W_{DKV} which compresses input hidden states h_t into the latent vector c_{KV}.
Up-projection matrices W_{UK} and W_{UV} reconstruct the keys and values for attention.
Matrix Absorption for Efficiency
During inference, the up-projection matrices W^{UK}, W^{UV} are absorbed into the query and output projection weights to avoid explicit recomputation:
This eliminates the need to store full K/V vectors, converting the attention into a latent-space computation.
Decoupled Keys with Rotary Positional Embeddings (RoPE):
Rotary Positional Embeddings (RoPE) pose a challenge because they introduce position-dependent rotations to K and Q, which is incompatible with low-rank KV compression. Addressing this, MLA separates positional information from position-agnostic information by generating decoupled keys and queries through RoPE transformations. As a result, a small subset of query/key dimensions (d_r=64) are reserved for RoPE, while the rest (d_c=512) are position-agnostic:
The RoPE component is computed separately and added to the attention score:
This preserves positional sensitivity without disrupting the low-rank structure
Low-Rank Query Compression:
Queries undergo a similar compression and reconstruction process, reducing activation memory during training. Queries are projected to a hidden space of 1536. This is not used during inference but reduces intermediate memory during training
Conclusion:
Hyperparameters — key Dimensions in MLA:
- Embedding dimension d = 7168
- Number of attention heads n_h = 128
- Per-head dimension d_h = 128
- KV compression dimension d_c = 512 (significantly smaller than d×nh = 7168×128)
- Query compression dimension d’_c = 1536
- Decoupled rotary position embedding dimension dʳ_h = 64
Memory Savings Analysis: Without MLA, for each token you would need to cache:
- Keys: n_h × d_h = 128 × 128 = 16,384 elements
- Values: n_h × d_h = 128 × 128 = 16,384 elements
- Total: 32,768 elements per token
With MLA, you only cache:
- Compressed latent vector cᵗᴷᵛ: dc = 512 elements
- Decoupled rotary key kᵗᴿ: dʳh = 64 elements Total: 576 elements per token
This represents a ~57x reduction in KV cache size (32,768/576 ≈ 57).
The approach achieves FLOPs of 1.8 TFLOPs/layer vs. 4.2 TFLOPs for standard attention. Despite this, the Performance Retention >90% of standard attention accuracy on perplexity tasks.
Up-sampling occurs on-the-fly during attention computation, adding minimal overhead (3% of layer FLOPs)
2. DeepSeekMoE: Dynamic Bias for Load Balancing
Motivation
LLMs tens to use Mixture of Experts architecture to lower the number of active parameters per token prediction while delivering accurate token predictions. There, an MLP module is divided into a set of smallr MLP modules and a small router module, where the router module directs each incoming token to be processed by a subset of the MLP modules. To keep a load balance between the small MLP, standard MoE approaches use auxiliary losses to balance expert utilization — introduce conflicting gradients and overhead computation. DeepSeekMoE replaces them with token-level dynamic bias adjustments. Therefore, DeepSeekMoE is a load-balanced mixture of experts (almost) without the need for auxiliary loss.
DeepSeekMoE elements:
Each MoE layer contains:
Each MoE Layer contains 1 shared expert (always active, ensures baseline capacity) and 256 routed experts, among which 8 are activated per token via top-k selection.
Routing Mechanism: Each expert i is represented by a centroid vector. The centroid vectors reside in the same embedding space as the input tokens’ hidden states and are learned during training via backpropagation. For token t, the affinity score s_{i,t} for expert i is computed via cosine similarity and a sigmoid gate. A dynamic bias bi per expert is added to s_{i,t} before selecting top-K experts to balance expert utilization.
The bias b_i is updated after each training step:
- Overloaded experts: b_i←b_i−γ
- Underloaded experts: b_i←b_i+γ
- During training, 𝛾 is set to 0.001 for the first 14.3T tokens, and to 0.0 for the remaining 500B tokens.
Sequence-Wise Auxiliary Loss: To prevent extreme imbalance within individual sequences, a complementary auxiliary loss is applied:
where N_r is the number of experts, f_i measures sequence-wise usage of expert i; P_i denotes sequence-wise relative affinity, set as probabilities; 1 is the indicator function and T denotes the number of token per sequence. The balance factor 𝛼 is a hyper-parameter, which assigned an extremely small value, to avoid extreme imbalance within any single sequence while keeping the minimal disturbance to the gradients.
Benefits:
- (Almost) No Auxiliary Loss: Eliminates performance degradation caused by balancing loss.
- Improved Efficiency: Maintains balanced utilization across experts with minimal overhead.
- Scalability: Supports fine-grained experts across distributed GPUs with near-zero communication overhead using techniques like node-limited routing. DeepSeekMoE achieves 95.4% expert utilization on 2048 GPUs, reducing communication by 22% vs. vanilla MoE.
Hyperparameters:
- Routed Experts/Layer: 256 experts with intermediate hidden dimension of 2048 for each expert.
- Activated Experts/Token: 8
- Bias Annealing, to compute the running mean of the bias terms: γ_initial=0.001, linear decay over training.
Multi-Token Prediction (MTP): Dense Training Signals
Next-token prediction limits a model’s ability to “plan ahead.” Multi-token prediction (MTP) (Gloeckle at el. 2024), trains the model to predict D future tokens per position, improving sample efficiency and enabling faster inference via speculative decoding. In addition, MTO may enable the model to pre-plan its representations for better prediction of future tokens. Different from Gloeckle et al. (2024), which team parallelly predicts 𝐷 additional tokens using independent output heads, DeepSeek uses 𝐷 sequential modules to sequentially predict 𝐷 additional tokens while keeping the complete causal chain at each prediction depth. At DeepSeek3, D was set to 1.
Architecture
Sequential Prediction Modules: The 𝑘-th MTP module consists of a shared embedding layer, a shared output head, an unshared Transformer block, and au unshared projection matrix. For each input token t, MTP modules predict t+1,t+2,…,t+D tokens. A causal masking forces predictions to depend only on prior tokens.
Cross-entropy losses are computed for each prediction depth and averaged, then added to the total model loss, weighted by a hyperparameter λ. λ is initially set to 0.3 for the first 10T tokens and decreased to 0.1 or the remaining 4.8T tokens.
Inference Acceleration
MTP enables speculative decoding:
- Draft D tokens in parallel using MTP modules.
- Verify drafts in a single pass with the base model.
This reduces median latency by 1.8.
Hyperparameters of DeepSeek-V3
- Parameters count and high-level architecture: 671 billion (37 billion active per token). 61 Transformer layers with hidden dimension of 7168 (MLA dimentions are smaller, as mentioned) and 128 attentions heads with 128 per-head dimension 𝑑_ℎ. All FFNs except for the first three layers where replaced with MoE layers. Number of attention heads balances parallelizability (more heads) with attention resolution (fewer heads risk losing fine-grained patterns).
- Context Length: 128,000 tokens.
- MOE Experts count: 2048 routed experts, 128 shared experts, 8 routed experts per token, and each token is ensured to be sent to at most 4 computational nodes. This number of experts per token was chosen to optimize H800 GPU utilization while keeping accurate results.
- MTP Depth: D=1.
- Training volume: for the “pre-training” stage, DeepSeek-V3 was SFT on 14.8T tokens, with maximum sequence length to 4K. later on, the model was refined via SFT of selected data and responses and group relative policy optimization. I’ve elaborated on this here.
- Optimizer: AdamW (Loshchilov and Hutter, 2017), with hyper-parameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and weight_decay = 0.1.
- Learning rate scheduling: LR was first linearly increase from 0 to 2.2 × 10−4 during the first 2K steps. Then, kept a constant at 2.2 × 10−4 until the model consumes 10T training tokens. Subsequently, LR was gradually decay the to 2.2 × 10−5 in 4.3T tokens, following a cosine decay curve. During the training of the final 500B tokens, LR was kept constant to 2.2 × 10−5 in the first 333B tokens, and decresed to 7.3 × 10−6 in the remaining 167B tokens.
- Gradient clipping norm: 1.0.
- Batch size scheduling strategy: the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training.
- Training framework: the training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a), 64-way Expert Parallelism (EP) (Lepikhin et al., 2021) spanning 8 nodes, and ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020)
4. FP8 Training: Precision-Aware Optimization
Strategy
- FP8 for Compute-Intensive Ops: GEMM operations (e.g., attention, FFN) use FP8 with E4M3 format (4 exponent, 3 mantissa bits).
- High-Precision Critical Ops: LayerNorm, embeddings, and MoE gating use BF16/FP32 to preserve stability.
- Tile/Block-Wise Quantization: Activations are quantized per-tile (128×128 blocks), weights per-block (64×64), minimizing error propagation.
Results
- 40% reduction in GPU memory usage vs. BF16.
- 1.7× higher training throughput on NVIDIA H800 GPUs.
5. Distributed Training Infrastructure
Parallelism Strategy
- Pipeline Parallelism (PP): 16-way with DualPipe overlap.
- Expert Parallelism (EP): 64-way for MoE layers.
- Data Parallelism (DP): ZeRO-1 (optimizer states sharded).
Cluster Configuration:
- Cluster Setup: 2048 H800 GPUs interconnected via NVLink (intra-node) and InfiniBand (inter-node)
6. Benchmarks
Conclusion
DeepSeek-V3 demonstrates that architectural co-design — not merely scaling — drives efficiency in modern LLMs. Key takeaways:
- MLA reduces KV cache memory by 6.3× via latent projections with minimal accuracy loss.
- Dynamic MoE Routing achieves near-perfect load balancing without auxiliary losses.
- MTP improves data efficiency while enabling 1.8× faster inference.
4. The model’s total training cost of 2.788M H800 GPU-hours (vs. ~12M for comparable dense models) sets a new precedent for sustainable large-scale AI.
Open Challenges:
- Scaling latent attention to million-token contexts.
- Tight coupling of MTP with reinforcement learning.
- Deeper MTP.