Inside Apple’s AI: Understanding the Architecture and Innovations of AFM Models

Van Tuan DANG
14 min readAug 2, 2024

--

Overview

In the rapidly evolving field of artificial intelligence, Language Models (LLMs) have become the cornerstone of numerous applications, from natural language processing to complex decision-making systems. The state-of-the-art in LLMs is marked by significant advancements in model architecture, training methodologies, and deployment strategies, aimed at enhancing both performance and efficiency. Apple’s recent introduction of their Foundation Language Models (AFM) represents a pivotal moment in this landscape, promising to redefine how AI integrates into everyday technology.

Apple’s AFM models, showcased at the 2024 Worldwide Developers Conference, are designed to power the Apple Intelligence system deeply embedded into iOS 18, iPadOS 18, and macOS Sequoia. These models, namely the AFM-on-device and AFM-server, are tailored for efficiency and versatility, supporting a wide range of user-centric tasks such as text refinement, notification summarization, and visual content creation.

Figure 1: Modeling overview for the Apple foundation models.

The AFM development process follows a structured pipeline that includes data collection, preprocessing, pre-training, post-training, and optimization. Each stage is guided by Responsible AI principles to ensure ethical and effective AI solutions. Now, let’s decode what AFM has to offer.

I. Architecture of AFM Models

The architecture of Apple’s Foundation Language Models (AFM) is designed to provide high performance and efficiency. These models are based on the Transformer architecture, incorporating several innovative design choices that enhance stability, memory usage, and processing power. The architecture is optimized for both on-device and server-based applications, ensuring seamless integration across various platforms. Below is a table describing the components of an AFM:

Figure 2: Components that make up AFM

1. Transformer Architecture

  • Mechanism: Uses self-attention mechanisms to weigh the influence of different words in a sentence, capturing contextual relationships effectively.
  • Advantage: Handles long-range dependencies better than RNNs or LSTMs, making it ideal for language understanding and generation tasks.

2. Shared Input/Output Embedding Matrix

  • Mechanism: The same set of embeddings is used for both the input and the output, reducing the number of parameters.
  • Advantage: Memory efficiency without sacrificing the ability to learn rich representations, allowing the model to be more compact and efficient.

3. Pre-Normalization with RMSNorm

  • Mechanism: Normalizes input before passing it through the network layers.
  • Advantage: More stable and consistent training, particularly in deep networks, by preventing issues like exploding or vanishing gradients.

4. Query/Key Normalization

  • Mechanism: Ensures the scale of the query and key vectors remains consistent throughout the training process.
  • Advantage: Reduces training instability, leading to better model performance and more reliable convergence.

5. Grouped-Query Attention (GQA)

  • Mechanism: Divides attention heads into groups, each handling a subset of queries.
  • Advantage: Lower memory usage while maintaining high attention resolution, making the model more efficient in handling large datasets.

6. SwiGLU Activation

  • Mechanism: A combination of Swish and GLU (Gated Linear Units), providing a smoother gradient flow.
  • Advantage: Enhances model efficiency, speeding up training and inference times, and improving overall performance.

7. RoPE Positional Embeddings

  • Mechanism: Encodes positional information in a rotary manner, supporting long sequences.
  • Advantage: Better handling of long contexts, crucial for tasks requiring understanding of long text passages, such as document summarization or long-form content generation.

Model Dimensions

  • Mechanism: Specific dimensions like model dimension (3072), head dimension (128), number of query heads (24), etc.
  • Advantage: Provides the necessary complexity and capacity to perform a wide range of tasks efficiently, ensuring a balanced approach to performance and resource utilization.

By breaking down the architecture into these components and explaining each one, we can see how Apple’s AFM models combine cutting-edge design and practical efficiency to deliver high performance across various applications.

II. Pre-training Process

The pre-training process for Apple’s Foundation Language Models (AFM) is designed to ensure that the models are highly capable and efficient. This process involves several stages, each focusing on different aspects of data quality and model training to build a robust foundation for further fine-tuning and application-specific adaptations. The following table describes the stages of the pre-training process:

Figure 3: The stages of the pre-training process
  1. Data Sources: Data is sourced from diverse channels including licensed publishers, publicly available datasets, and web crawls by Applebot. Applebot respects robots.txt directives to avoid certain sites:
  • Web Pages: Filtered to exclude profanity and unsafe content.
  • Licensed Datasets: High-quality long-context data from publishers.
  • Code: Data from open-source repositories, covering 14 common programming languages.
  • Math: Data from math-rich web domains and math forums.
  • Public Datasets: Filtered to remove PII.

Ensures the model can generalize well across different language tasks, providing a broad knowledge base.

2. Quality Control: Rigorous filtering to exclude unsafe material and personally identifiable information (PII). Uses heuristics and model-based classifiers for safety and quality filtering:

  • Safety Filtering: Heuristics and model-based classifiers for filtering profanity and unsafe material.
  • De-duplication: Global fuzzy de-duplication using locality-sensitive n-gram hashing.
  • Decontamination: Filtering against 811 common pre-training benchmarks.

Ensures training data is clean and safe, crucial for developing reliable and ethically sound models.

3. Core Pre-Training: The model undergoes initial training on a broad mixture of data, consuming the majority of the compute budget at this stage:

  • AFM-server: Trained from scratch for 6.3 trillion tokens on 8192 TPUv4 chips.
  • AFM-on-device: Distilled and pruned from a larger model, initialized from a pruned 6.4B model, and trained for 188 billion tokens.

Establishes a strong foundation of general language understanding and knowledge, preparing the model for specialized training stages.

4. Continued Training: Focuses on refining the model with specific domains such as mathematics and programming code, incorporating high-quality licensed data to enhance context understanding:

  • Training Tokens: Additional 1 trillion tokens at a sequence length of 8192.
  • Data Mixture: Upweighted math and code data, downweighted bulk web-crawl, including licensed data.

Improves the model’s performance in specialized areas, making it more capable of handling complex and domain-specific tasks.

5. Context-Lengthening: Extends the sequence length the model can handle, integrating synthetic long-context data to train the model for better handling of extended text sequences:

  • Training Tokens: Further 100 billion tokens at a sequence length of 32768.
  • RoPE Base Frequency: Increased from 500k to 6,315,089 to support long-context generalization.

Allows the model to manage long-form content effectively, essential for applications like document summarization and detailed content generation.

By implementing these stages with specific techniques and data management strategies, Apple ensures that their AFM models are both powerful and versatile, capable of performing a wide range of tasks with high efficiency and reliability.

III. Post-Training Techniques

Post-training techniques refine Apple’s Foundation Language Models (AFM) to enhance their instruction-following capabilities, conversational skills, and task-specific performance. These techniques involve supervised fine-tuning and reinforcement learning from human feedback, ensuring the models are aligned with user expectations and perform efficiently in real-world applications. The table below describes the key components of Post-Training process:

Figure 4: the key components of Post-Training

1. Supervised Fine-Tuning (SFT): a critical technique used to enhance Apple’s Foundation Language Models (AFM). By using high-quality human-annotated data, SFT refines the model’s ability to follow instructions and perform specific tasks accurately. This process involves collecting detailed datasets and applying specific fine-tuning methodologies.

2. Reinforcement Learning from Human Feedback (RLHF): refines Apple’s Foundation Language Models (AFM) by incorporating human feedback to improve model performance and alignment with user expectations. This process involves several key components and two main algorithms: Iterative Training with Expert Critique (iTeC) and Mirror Descent Leave-One-Out (MDLOO).

  • Feedback Collection
  • Reward Function
  • Policy Gradient
  • Iterative Training with Expert Critique (iTeC)
  • Mirror Descent Leave-One-Out (MDLOO)

These detailed processes and formulas illustrate how RLHF, combined with iTeC and MDLOO, ensure that Apple’s Foundation Language Models are finely tuned to align with human preferences and perform optimally across various tasks. This approach enhances the models’ responsiveness, accuracy, and reliability, making them more effective in real-world applications.

Powering Apple Intelligence Features

The AFM models are integrated into Apple devices and services to enhance various functionalities. This section explains how these models are adapted and optimized for specific applications within Apple’s ecosystem, ensuring high performance and user satisfaction:

  1. Runtime-Swappable Adapters (LoRA)
  2. Inference Optimization
  3. Feature-Specific Enhancements

The figure 5 is a visual representation of how Apple Intelligence features are powered using AFM models. Here’s a detailed breakdown of the diagram in the context of “Powering Apple Intelligence Features”:

Figure 5: Architecture of Apple Intelligence with adapters for the language on-device and server models and the image models. In this report we are only describing the text models.

Mechanism Details for On-device Models

  • Low-Rank Adaptation (LoRA): Decomposes weight updates into two low-rank matrices: W′=W+ΔW with ΔW=A⋅B and A and B are low-rank matrices, and WWW is the original weight matrix.
  • Quantization Process: Converts weights from floating-point precision to lower precision to reduce computational load: Q(w)=round((w−wmin)/s​​) where sss is the scale factor, and wmin​ is the minimum weight value.
  • Early Exit Strategy: Allows the model to make predictions at intermediate layers if the confidence is high, reducing computation time: if max(softmax(x))>threshold, exit early
  • Model Pruning: Removes less important weights and neurons to streamline the model: W′={w∈W:∣w∣>threshold}

Mechanism Details for Feature-Specific Enhancements:

  • Text Summarization: Uses sequence-to-sequence models with attention mechanisms to generate summaries: Summary=Attention(H_encoder​,H_decoder​) where H_encoder and H_decoder are the hidden states of the encoder and decoder, respectively.
  • Visual Content Generation: Utilizes generative models like GANs or VAEs to create images from text descriptions: x^=G(z), where G is the generator network and z is a latent vector sampled from a distribution.

The powering Apple Intelligence features demonstrates how AFM models are integrated into Apple’s ecosystem, enhancing functionality and user experience across various devices and services. By using runtime-swappable adapters, optimizing inference, and tailoring features for specific applications, Apple ensures that these models deliver high performance and efficiency.

Evaluation

Evaluation of Apple’s Foundation Language Models (AFM) is critical to ensure they meet high standards of performance, reliability, and safety. This section covers the methods and metrics used to assess the models’ capabilities across various tasks and scenarios:

  1. Pre-training Evaluation

The pre-training evaluation of Apple’s Foundation Language Models (AFM) is essential to ensure the models meet high performance standards before fine-tuning. This involves using standardized benchmarks to assess the models’ baseline capabilities, providing a foundation for further improvements.

  • Benchmark Tests
  • Evaluation Metrics
  • Pre-training Results

The pre-training evaluation demonstrates that Apple’s Foundation Language Models are robust and capable across various benchmarks, highlighting their potential for further enhancement through post-training and fine-tuning. The detailed metrics and results provide a clear picture of the models’ strengths, ensuring they meet high standards of performance and reliability.

2. Post-training Evaluation

The post-training evaluation of Apple’s Foundation Language Models (AFM) is conducted to ensure that the models have improved significantly after fine-tuning and can perform well in real-world applications. This involves using both human assessments and automated benchmarks to measure the models’ performance on various tasks.

A. Human Assessments

Figure 6: Side-by-side evaluation of AFM-on-device and AFM-server against comparable models. We find that our models are often preferred over competitor models by human graders.

Figure 6 presents a comparative evaluation of AFM-on-device and AFM-server models against various competitor models, as assessed by human graders. The key findings are:

  • AFM-on-device: Demonstrates competitive performance, often preferred over models like Llama-3–8B, Gemma-7B, Phi-3-mini, Mistral-7B, and Gemma-2B. Notably, AFM-on-device is preferred 63.8% of the time over Gemma-2B.
  • AFM-server: Shows strong performance compared to high-end models like GPT-4, Llama-3–70B, Mixtral-8x22B, GPT-3.5, and DBRX-Instruct. AFM-server is preferred 56.4% of the time over DBRX-Instruct and 51.5% over GPT-3.5.

Overall, AFM models are frequently favored by human graders, indicating superior or competitive performance in quality, relevance, and safety of outputs compared to other leading models.

B. Automated Benchmarks

Figure 7: Instruction-following capability (measured with IFEval) for AFM models and relevant comparison models (higher is better). The AlpacaE- val 2.0 LC results for Mistral 7B, Llama3 8B, Llama3 70B, DBRX-Instruct, and Mixtral 8x22B are obtained from the AlpacaEval leaderboard [Taori et al., 2023]. The Arena Hard results for comparison models are from the Arena- Hard-Auto leaderboard [Li et al., 2024b]

Figure 7 displays the instruction-following capability of AFM models compared to several other models using the IFEval metric, where higher scores indicate better performance. Both AFM-on-device and AFM-server models exhibit strong instruction-following abilities, achieving high IFEval scores in their evaluations.

Comparison Models:

  • Mistral 7B, Llama3 8B, Llama3 70B, DBRX-Instruct, and Mixtral 8x22B: Their scores are sourced from the AlpacaEval leaderboard, providing a benchmark for instruction-following performance.
  • Other Comparison Models: Their scores are taken from the Arena Hard leaderboard, ensuring a comprehensive evaluation across different models.
  • Results: AFM models generally outperform or match the performance of competitor models, highlighting their superior ability to follow instructions accurately.

The evaluation indicates that AFM models are highly effective in understanding and executing complex instructions, often surpassing other leading models in this critical capability

Figure 8: Berkeley Function Calling Leaderboard Benchmark evaluation results on Function Calling API, along-side relevant sampled comparisons. Numbers were collected from the Gorilla leaderboard [Patil et al., 2023].

Figure 8 presents the results of the Berkeley Function Calling Leaderboard Benchmark evaluation, comparing the tool use capabilities of AFM models against other relevant models. The benchmarks assess the models’ abilities to issue tool calls accurately based on user requests and provided tool descriptions, adhering to the OpenAPI specification.

  • AFM-server achieves the highest scores in most categories, demonstrating superior performance in function calling tasks.
  • AFM-on-device also performs well, particularly in the Simple and Multiple categories, indicating strong on-device capabilities.
  • Comparison Models: While GPT-4 and Gemini-1.5-Pro-0514 show competitive results, AFM models often surpass them, especially in average and relevance scores.

Overall, the AFM-server excels in tool use applications, outperforming other models in accuracy and efficiency, as evaluated by the Berkeley Function Calling Leaderboard. This highlights the robustness of AFM models in handling complex function calling tasks effectively.

Figure 9: Writing ability on internal summarization and composition benchmarks (higher is better) for AFM-on-device and AFM-server alongside relevant sampled comparisons. We find that our models perform better or similar to related models.

Figure 9 compares the writing abilities of AFM-on-device and AFM-server models with other relevant models, focusing on summarization and composition tasks. The benchmarks used assign scores from 1 to 10 for model responses, with higher scores indicating better performance.

AFM-on-device: Demonstrates strong performance in both summarization and composition tasks, scoring comparably or better than models like Mistral-7B, Gemma-7B, and Phi-3-mini.

  • On-Device Summarization: AFM-on-device scores 9.1, indicating strong performance, while comparison models Mistral-7B and Gemma-7B both score 8.9, Phi-3-mini scores 8.8, and Gemma-2B scores 7.6.
  • On-Device Composition: AFM-on-device scores 9.0, showcasing excellent composition abilities, while comparison models Mistral-7B and Gemma-7B both score 9.1, Phi-3-mini scores 9.0, and Gemma-2B scores 8.0.

AFM-server: Shows excellent summarization and composition abilities, matching or outperforming models such as GPT-4 and Mixtral-8x22B, and significantly surpassing DBRX-Instruct and GPT-3.5.

  • Server Summarization: AFM-server scores 9.5, matching GPT-4 and Mixtral-8x22B, while comparison models DBRX-Instruct scores 9.2 and GPT-3.5 scores 8.6.
  • Server Composition: AFM-server scores 9.6, slightly below GPT-4 (9.7), while Mixtral-8x22B scores 9.5, DBRX-Instruct scores 9.2, and GPT-3.5 scores 8.9 in comparison models.
Figure 10: Math benchmarks for AFM-on-device and AFM-server alongside relevant sampled comparisons. GSM8K is 8-shot and MATH is 4-shot. All results are collected with an internal automated evaluation pipeline.

Figure 10 presents the performance of AFM-on-device and AFM-server models on math benchmarks, specifically GSM8K and MATH, compared with other relevant models. The evaluations are conducted using an internal automated evaluation pipeline with an 8-shot prompt for GSM8K and a 4-shot prompt for MATH.

  • AFM-on-device: Performs well in both GSM8K and MATH benchmarks, significantly outperforming models like Gemma-7B and Mistral-7B, even though it is smaller in size.
  • AFM-server: Exhibits excellent performance, especially in the MATH benchmark, where it closely matches GPT-4 and outperforms models like Llama-3–70B and Mixtral-8x22B.

C. Summarization Feature Evaluation

The summarization feature evaluation focuses on the performance of AFM models in generating concise and accurate summaries for various types of content, including emails, messages, and notifications. This evaluation is tailored to specific guidelines and uses specialized graders to assess summarization quality against a range of datasets.

Figure 11: Ratio of “good” and “poor” responses for three summarization use cases relative to all responses. Summaries are classified as “good”, “neutral”, or “poor” along five dimensions. A result is classified as “good” if all of the dimensions are good (higher is better). A result is classified as “poor” if any of the dimensions are poor (lower is better). Overall, our AFM-on-device adapter generates better summaries than comparable models.

Figure 11 presents the evaluation of human satisfaction with the summarization feature for three different use cases: Email, Message, and Notification. The responses are classified as “good” or “poor” relative to all responses, measured across five dimensions.

Good Result Ratio: This ratio represents the percentage of summaries that were classified as “good” across all five dimensions (composition, comprehensiveness, groundedness, following instructions, and readability). A higher ratio indicates better overall performance in generating high-quality summaries.

  • Email: AFM-on-device + Adapter performs slightly better than Gemma-7B and significantly better than Phi-3-mini and Llama-3–8B.
  • Message: AFM-on-device + Adapter outperforms all comparison models, with a notable margin over Gemma-7B and a substantial lead over Phi-3-mini and Llama-3–8B.
  • Notification: AFM-on-device + Adapter excels, achieving the highest good result ratio, significantly outperforming all comparison models.

Poor Result Ratio: This ratio indicates the percentage of summaries classified as “poor” if any of the five dimensions were rated poorly. A lower ratio is better, as it signifies fewer poor quality summaries.

  • Email: AFM-on-device + Adapter has a slightly higher poor result ratio than Gemma-7B but much lower than Phi-3-mini and Llama-3–8B.
  • Message: AFM-on-device + Adapter shows a lower poor result ratio compared to all comparison models.
  • Notification: AFM-on-device + Adapter has the lowest poor result ratio, indicating the least number of poor quality summaries.

Overall, the AFM-on-device adapter generates better summaries than comparable models, with higher ratios of good responses and lower ratios of poor responses across all three use cases. This demonstrates the effectiveness of AFM-on-device in producing high-quality summaries that meet user expectations.

IV. Responsible AI

Apple’s Foundation Models (AFM) are built on Responsible AI principles designed to empower users, authentically represent diverse populations, design with care to avoid harm, and protect user privacy. Apple’s Responsible AI approach ensures AFM models are safe, ethical, and user-centric. By embedding ethical considerations and safety measures at every step, Apple creates AI that is trustworthy, fair, and privacy-focused. This commitment underscores Apple’s dedication to empowering users while maintaining high standards of integrity and responsibility in AI development.

V. Conclusion

The Apple Intelligence Foundation Language Models (AFM) exemplify a blend of advanced AI technology and ethical standards. Through rigorous pre-training, post-training, and evaluation processes, AFM models are designed to be powerful, responsible, and user-centric.

Apple’s commitment to Responsible AI ensures that these models empower users with intelligent tools while safeguarding privacy and promoting fairness. The careful approach to data filtering, legal compliance, and continuous improvement reflects Apple’s dedication to ethical AI.

AFM models set new benchmarks in AI development by achieving state-of-the-art performance without compromising safety and fairness. As Apple continues to innovate, the influence of AFM models will likely shape the broader AI landscape.

Apple’s mission to democratize AI through open source and open science, emphasizing inclusivity and transparency. This vision ensures AI serves humanity, driven by principles of openness, fairness, and responsibility.

In summary, Apple’s Foundation Language Models are a testament to the company’s vision of balancing innovation with responsibility, ensuring technology evolves to empower and protect users ethically.

--

--