Demystifying Deepseek AI, LLaMA and OpenAI:
Neural network architectures used by LLaMA, DeepSeek, and OpenAI models:
LLaMA Architecture:
Architecture: LLaMA (Large Language Model Meta AI) also employs a transformer-based, decoder-only architecture. It focuses on efficiency and accessibility, aiming to provide high performance with reduced computational requirements.
Key Features:
- Efficient Training: Designed to achieve strong performance with less computational overhead compared to some contemporaries.
- Accessibility: Intended to be more accessible for research and practical applications.
- LLaMA models are based on the transformer architecture. They use pre-normalization, where the input of each transformer sub-layer is normalized using the RMSNorm function. The models employ the SwiGLU activation function, with a dimension of 2/3 of the hidden dimension3.
- LLaMA uses rotary positional embeddings (RoPE) at each layer3.The models are trained with the AdamW optimizer. A cosine learning rate schedule is used, where the final learning rate is 10% of the maximum learning rate.
- They use an efficient implementation of causal multi-head attention to reduce memory usage. LLaMA models range from 7B to 65B parameters
DeepSeek-R1 Architecture:
Architecture: DeepSeek has developed models like DeepSeek-V3 and DeepSeek-R1, which incorporate innovative techniques to enhance efficiency. Notably, they utilize a Mixture-of-Experts (MoE) architecture.
Key Features:
- Mixture-of-Experts (MoE): This approach activates only a subset of the model’s parameters during inference, significantly reducing computational load. For instance, DeepSeek’s MoE model achieved comparable performance to dense models while using only a fraction of the computational resources
- Cost-Effective Training: By leveraging MoE and other optimization techniques, DeepSeek has managed to train models at a fraction of the cost and time compared to traditional methods.
- DeepSeek-R1 is based on the DeepSeek-V3-Base model7. It utilizes a multi-stage training pipeline incorporating reinforcement learning (RL). DeepSeek-R1-Zero is trained using RL directly on the base model without supervised fine-tuning (SFT).
- DeepSeek-R1 incorporates cold-start data and a multi-stage training pipeline before RL8. The training includes two RL stages for improved reasoning and human preference alignment, and two SFT stages to seed reasoning and non-reasoning capabilities.
- A rule-based reward system is used, including accuracy and format rewards. A language consistency reward is used during RL training to mitigate language mixing. Smaller dense models are distilled from DeepSeek-R1. The distilled models range from 1.5B to 70B parameters and are based on Qwen and Llama series
OpenAI Architecture:
The GPT models utilize a transformer-based architecture, specifically a decoder-only structure. They employ multi-head self-attention mechanisms and position-wise feedforward networks. The models are densely activated, meaning all layers and neurons are active during inference.
Key Features:
- Dense Activation: Every part of the model is utilized for each input, leading to high computational costs.
- High Resource Consumption: Training and inference require substantial computational resources.
- OpenAI’s o1 series models introduced inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. OpenAI-o1–1217 is a benchmark for reasoning tasks and DeepSeek-R1 aims to match its performance. GPT-4o-0513 and OpenAI-o1-mini are also mentioned as baselines for comparison.
Key Differences in Neural Layers
Training Approach:
LLaMA focuses on pre-training with architectural improvements such as RMSNorm, SwiGLU, and RoPE.
DeepSeek-R1 uses a multi-stage training process which includes reinforcement learning and supervised fine-tuning from the base model (DeepSeek-V3-Base)8.
OpenAI’s Reinforcement Learning from Human Feedback (RLHF) stands out as a key differentiator, enabling its models to generate more aligned, user-friendly responses compared to LLaMA and DeepSeek.
Training Approach of OpenAI Models
Pre-Training on Large Text Datasets
- OpenAI’s GPT models are primarily trained using a decoder-only transformer architecture on massive, diverse, and high-quality text datasets. These datasets include publicly available internet text, books, articles, and other curated sources.
- The pre-training objective is causal language modeling, where the model predicts the next token in a sequence given the preceding tokens. This allows the model to learn patterns, structure, and linguistic nuances.
Objective Function: L=−∑i=1NlogP(xi∣x<i)L = -\sum_{i=1}^{N} \log P(x_i | x_{<i})L=−∑i=1NlogP(xi∣x<i) Where xix_ixi is the token at position iii, and x<ix_{<i}x<i are the preceding tokens.
Fine-Tuning with Human Feedback (RLHF) OpenAI’s later models, like GPT-3.5 and GPT-4, use Reinforcement Learning from Human Feedback (RLHF) to align the model’s behavior with user expectations.
- Supervised Fine-Tuning: The model is fine-tuned on labeled datasets curated by humans, where responses are explicitly annotated to teach the model preferred behaviors.
- Reward Model Training: Human annotators rank multiple responses generated by the model, creating a dataset for training a reward model.
- Reinforcement Learning: The model is fine-tuned using Proximal Policy Optimization (PPO), maximizing the reward signal from the reward model. This aligns the outputs with human preferences and ethical guidelines.
Why RLHF?
- Helps the model generate more helpful, truthful, and harmless responses.
- Reduces undesirable behaviors, such as generating biased or toxic content.
Key Architectural Optimizations OpenAI has introduced improvements in its architecture over successive versions:
- Layer Normalization: Applied to stabilize training.
- Sparse Attention: Optimizations like sparse attention patterns (introduced in GPT-4) make processing longer contexts more efficient.
- Mixed-Precision Training: Uses FP16 and FP32 for faster training with lower memory consumption.
- Parallelism: Implements data and model parallelism for scaling across large clusters of GPUs.
Dataset Curation on OpenAI model
- Datasets are meticulously curated to ensure high-quality training data. This involves removing low-quality, repetitive, or harmful content while ensuring a broad coverage of topics and styles.
Specific Layers and Techniques:
LLaMA uses RMSNorm for normalization, SwiGLU for activation, and RoPE for positional embeddings3. DeepSeek-R1 uses a rule-based reward system, a language consistency reward, and distillation. No sources provide architectural details about OpenAI.
Comparative Summary
Activation Strategy:
- OpenAI’s GPT: Dense activation; all parameters are active during inference.
- Meta’s LLaMA: Dense activation with a focus on efficiency.
- DeepSeek’s Models: Utilize Mixture-of-Experts, activating only necessary parameters, leading to significant computational savings.
Resource Efficiency:
- OpenAI’s GPT: High computational and energy requirements.
- Meta’s LLaMA: More efficient than GPT but still relies on dense activation.
- DeepSeek’s Models: Achieve high performance with lower computational costs due to MoE architecture.
Focus:
LLaMA aims for efficient training and strong performance using public data. DeepSeek-R1 prioritizes reasoning capabilities through a complex RL and SFT approach. OpenAI models, particularly the o1 series, are known for their inference-time scaling via Chain-of-Thought reasoning
In summary, while OpenAI’s GPT and Meta’s LLaMA models utilize dense transformer architectures, DeepSeek distinguishes itself by employing a Mixture-of-Experts approach, leading to more efficient utilization of computational resources.