Anil Prasad
5 min readJan 28, 2025

Demystifying Deepseek AI, LLaMA and OpenAI:

Neural network architectures used by LLaMA, DeepSeek, and OpenAI models:

LLaMA Architecture:

Architecture: LLaMA (Large Language Model Meta AI) also employs a transformer-based, decoder-only architecture. It focuses on efficiency and accessibility, aiming to provide high performance with reduced computational requirements.

Key Features:

  • Efficient Training: Designed to achieve strong performance with less computational overhead compared to some contemporaries.
  • Accessibility: Intended to be more accessible for research and practical applications.
  • LLaMA models are based on the transformer architecture. They use pre-normalization, where the input of each transformer sub-layer is normalized using the RMSNorm function. The models employ the SwiGLU activation function, with a dimension of 2/3 of the hidden dimension3.
  • LLaMA uses rotary positional embeddings (RoPE) at each layer3.The models are trained with the AdamW optimizer. A cosine learning rate schedule is used, where the final learning rate is 10% of the maximum learning rate.
  • They use an efficient implementation of causal multi-head attention to reduce memory usage. LLaMA models range from 7B to 65B parameters

DeepSeek-R1 Architecture:

Architecture: DeepSeek has developed models like DeepSeek-V3 and DeepSeek-R1, which incorporate innovative techniques to enhance efficiency. Notably, they utilize a Mixture-of-Experts (MoE) architecture.

Key Features:

  • Mixture-of-Experts (MoE): This approach activates only a subset of the model’s parameters during inference, significantly reducing computational load. For instance, DeepSeek’s MoE model achieved comparable performance to dense models while using only a fraction of the computational resources
  • Cost-Effective Training: By leveraging MoE and other optimization techniques, DeepSeek has managed to train models at a fraction of the cost and time compared to traditional methods.
  • DeepSeek-R1 is based on the DeepSeek-V3-Base model7. It utilizes a multi-stage training pipeline incorporating reinforcement learning (RL). DeepSeek-R1-Zero is trained using RL directly on the base model without supervised fine-tuning (SFT).
  • DeepSeek-R1 incorporates cold-start data and a multi-stage training pipeline before RL8. The training includes two RL stages for improved reasoning and human preference alignment, and two SFT stages to seed reasoning and non-reasoning capabilities.
  • A rule-based reward system is used, including accuracy and format rewards. A language consistency reward is used during RL training to mitigate language mixing. Smaller dense models are distilled from DeepSeek-R1. The distilled models range from 1.5B to 70B parameters and are based on Qwen and Llama series

OpenAI Architecture:

The GPT models utilize a transformer-based architecture, specifically a decoder-only structure. They employ multi-head self-attention mechanisms and position-wise feedforward networks. The models are densely activated, meaning all layers and neurons are active during inference.

Key Features:

  • Dense Activation: Every part of the model is utilized for each input, leading to high computational costs.
  • High Resource Consumption: Training and inference require substantial computational resources.
  • OpenAI’s o1 series models introduced inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. OpenAI-o1–1217 is a benchmark for reasoning tasks and DeepSeek-R1 aims to match its performance. GPT-4o-0513 and OpenAI-o1-mini are also mentioned as baselines for comparison.

Key Differences in Neural Layers

Training Approach:

LLaMA focuses on pre-training with architectural improvements such as RMSNorm, SwiGLU, and RoPE.

DeepSeek-R1 uses a multi-stage training process which includes reinforcement learning and supervised fine-tuning from the base model (DeepSeek-V3-Base)8.

OpenAI’s Reinforcement Learning from Human Feedback (RLHF) stands out as a key differentiator, enabling its models to generate more aligned, user-friendly responses compared to LLaMA and DeepSeek.

Training Approach of OpenAI Models

Pre-Training on Large Text Datasets

  • OpenAI’s GPT models are primarily trained using a decoder-only transformer architecture on massive, diverse, and high-quality text datasets. These datasets include publicly available internet text, books, articles, and other curated sources.
  • The pre-training objective is causal language modeling, where the model predicts the next token in a sequence given the preceding tokens. This allows the model to learn patterns, structure, and linguistic nuances.

Objective Function: L=−∑i=1Nlog⁡P(xi∣x<i)L = -\sum_{i=1}^{N} \log P(x_i | x_{<i})L=−∑i=1N​logP(xi​∣x<i​) Where xix_ixi​ is the token at position iii, and x<ix_{<i}x<i​ are the preceding tokens.

Fine-Tuning with Human Feedback (RLHF) OpenAI’s later models, like GPT-3.5 and GPT-4, use Reinforcement Learning from Human Feedback (RLHF) to align the model’s behavior with user expectations.

  • Supervised Fine-Tuning: The model is fine-tuned on labeled datasets curated by humans, where responses are explicitly annotated to teach the model preferred behaviors.
  • Reward Model Training: Human annotators rank multiple responses generated by the model, creating a dataset for training a reward model.
  • Reinforcement Learning: The model is fine-tuned using Proximal Policy Optimization (PPO), maximizing the reward signal from the reward model. This aligns the outputs with human preferences and ethical guidelines.

Why RLHF?

  • Helps the model generate more helpful, truthful, and harmless responses.
  • Reduces undesirable behaviors, such as generating biased or toxic content.

Key Architectural Optimizations OpenAI has introduced improvements in its architecture over successive versions:

  • Layer Normalization: Applied to stabilize training.
  • Sparse Attention: Optimizations like sparse attention patterns (introduced in GPT-4) make processing longer contexts more efficient.
  • Mixed-Precision Training: Uses FP16 and FP32 for faster training with lower memory consumption.
  • Parallelism: Implements data and model parallelism for scaling across large clusters of GPUs.

Dataset Curation on OpenAI model

  • Datasets are meticulously curated to ensure high-quality training data. This involves removing low-quality, repetitive, or harmful content while ensuring a broad coverage of topics and styles.

Specific Layers and Techniques:

LLaMA uses RMSNorm for normalization, SwiGLU for activation, and RoPE for positional embeddings3. DeepSeek-R1 uses a rule-based reward system, a language consistency reward, and distillation. No sources provide architectural details about OpenAI.

Comparative Summary

Activation Strategy:

  • OpenAI’s GPT: Dense activation; all parameters are active during inference.
  • Meta’s LLaMA: Dense activation with a focus on efficiency.
  • DeepSeek’s Models: Utilize Mixture-of-Experts, activating only necessary parameters, leading to significant computational savings.

Resource Efficiency:

  • OpenAI’s GPT: High computational and energy requirements.
  • Meta’s LLaMA: More efficient than GPT but still relies on dense activation.
  • DeepSeek’s Models: Achieve high performance with lower computational costs due to MoE architecture.

Focus:

LLaMA aims for efficient training and strong performance using public data. DeepSeek-R1 prioritizes reasoning capabilities through a complex RL and SFT approach. OpenAI models, particularly the o1 series, are known for their inference-time scaling via Chain-of-Thought reasoning

In summary, while OpenAI’s GPT and Meta’s LLaMA models utilize dense transformer architectures, DeepSeek distinguishes itself by employing a Mixture-of-Experts approach, leading to more efficient utilization of computational resources.

Comparative Summary

Anil Prasad
Anil Prasad

Written by Anil Prasad

Ambharii Labs is a Next Generation Digital Transformation and Data, Analytics Consulting and Solutions Company

No responses yet