Intuition Machine

Artificial Intuition, Artificial Fluency, Artificial Empathy, Semiosis Architectonic

Comparison of Large Reasoning Models (LRMs)

--

1. OpenAI O1/O3

  • Architecture: Dense Transformer, RLHF-optimized
  • Training Data Transparency: Proprietary, Limited Details
  • Strengths: High-quality Reasoning, STEM, Coding
  • Reasoning Approach: Adjustable Effort (Low/Med/High), Step-by-Step
  • Efficiency Considerations: Adjustable Reasoning Mode, 200K Context
  • Best Use Case: General AI, STEM Research, API Function Calling

2. DeepSeek R1

  • Architecture: Mixture of Experts (MoE, 671B total, 37B active)
  • Training Data Transparency: Somewhat Open (14.8T tokens)
  • Strengths: Expert Coding & Debugging, Efficient MoE
  • Reasoning Approach: Multi-Expert Collaboration, Self-Improvement
  • Efficiency Considerations: High Token Efficiency (MoE, 312 tokens/sec)
  • Best Use Case: Enterprise Coding Assistant, Code Debugging

3. Gemini 2.0 (Flash Thinking)

  • Architecture: Multimodal Transformer (Text+Image+Speech)
  • Training Data Transparency: Proprietary, Multimodal (Text+Images+APIs)
  • Strengths: Multimodal (Images, Text, Speech), Deep Reasoning
  • Reasoning Approach: Flash Thinking Mode (Transparent Thought Process)
  • Efficiency Considerations: 1M Token Context (Experimental), Optimized for Speed
  • Best Use Case: AI Assistant (Multimodal, Research, Planning)

4. QwQ (Alibaba)

  • Architecture: Dense Transformer, Fine-tuned for Reasoning
  • Training Data Transparency: Limited Transparency, Strong Math Focus
  • Strengths: Math, Logical Puzzles, Chain-of-Thought Self-Verification
  • Reasoning Approach: Self-Reflection, Chain-of-Thought
  • Efficiency Considerations: Small Model (32B), Inference-Efficient
  • Best Use Case: Mathematical Proofs, Advanced Reasoning

5. Sky-T1

  • Architecture: Dense Transformer, Knowledge Distillation from QwQ
  • Training Data Transparency: Fully Open, 17K High-Quality Reasoning Examples
  • Strengths: Math & Coding, Structured Responses, Open
  • Reasoning Approach: Knowledge Distillation, Structured Reasoning
  • Efficiency Considerations: 32B, Optimized for Few-GPU Running
  • Best Use Case: Open Research, Custom Reasoning AI

6. Marco-o1

  • Architecture: 7B Transformer + MCTS-based Search
  • Training Data Transparency: Limited Details, Focus on Open-Ended Reasoning
  • Strengths: Creative Reasoning, Multi-path Exploration
  • Reasoning Approach: Tree Search (MCTS) + Chain-of-Thought
  • Efficiency Considerations: Small (7B), but Heavy Computation on Hard Tasks
  • Best Use Case: Creative Thinking, Brainstorming Complex Ideas

7. Claude 3.5

  • Architecture: Dense Transformer, Fine-tuned for Dialogue & Ethics
  • Training Data Transparency: Partially Open (some RLHF data available)
  • Strengths: Long-form Dialogue, Ethical AI, Context Memory
  • Reasoning Approach: Contextual Understanding, Ethical Guardrails
  • Efficiency Considerations: Long Context (200K), RLHF for Alignment
  • Best Use Case: Long-form Assistant, Decision-Making Support

8. Codestral (Mistral)

  • Architecture: Dense Transformer, Code-focused
  • Training Data Transparency: Proprietary, Open Benchmarks Used
  • Strengths: Cost-Effective Code Generation, Small but Optimized
  • Reasoning Approach: Optimized for Coding Efficiency & Debugging
  • Efficiency Considerations: High Speed (Small Model, Transformer Optimized)
  • Best Use Case: Cost-Effective Coding LLM

Exploring Diverse Approaches to Large Reasoning Models

Large reasoning models are a new breed of artificial intelligence that move beyond fluent text generation. They “think” through complex problems by breaking them into intermediate steps, thereby improving reliability and interpretability on tasks like coding, mathematics, and multimodal analysis. In recent years, industry giants and the open-source community alike have developed LRMs with distinct philosophies. Let’s explore these differences in depth.

1. Architectures and Training Techniques

OpenAI’s O1 and O3 Series

  • Architecture:
    OpenAI’s O-series models (O1, O3, and the smaller O3-mini) use a dense transformer backbone. The key innovation is not a new architecture but rather the addition of extensive reinforcement learning (RL) to teach the model to generate internal “chains-of-thought.”
  • Training Methodology:
    These models are fine-tuned with RLHF (Reinforcement Learning from Human Feedback), allowing them to plan and “think” during inference. A standout feature is the adjustable reasoning effort — users can choose low, medium, or high computational “thinking” to trade off speed for depth.
  • Resulting Behavior:
    They excel at producing step-by-step explanations on STEM problems, coding tasks, and logical puzzles, all while being optimized to handle very long contexts (up to 100K–200K tokens).

DeepSeek R1

  • Architecture:
    DeepSeek R1 takes a different route by employing a Mixture-of-Experts (MoE) design. Although the overall parameter count is huge (671 billion parameters), only a fraction (roughly 37B) is active per token thanks to selective expert activation.
  • Training Methodology:
    Like the O-series, R1 is refined with reinforcement learning. It’s trained on an enormous dataset — about 14.8 trillion tokens — with a strong emphasis on coding, technical documents, and math problems.
  • Resulting Behavior:
    This design yields impressive throughput and energy efficiency. R1 stands out particularly in coding tasks, debugging, and even automated refactoring, as it is tuned to trace through complex codebases with a long (128K token) context window.

Google’s Gemini 2.0 (Fast & Flash Thinking Modes)

  • Architecture:
    Gemini 2.0 uses a dense transformer similar to models like PaLM but augments it to natively process multiple modalities (text, images, speech).
  • Training Methodology:
    Its training involves both supervised chain-of-thought datasets and RL fine-tuning that teaches it to “think out loud.” The Flash Thinking mode explicitly generates intermediate reasoning steps during inference.
  • Resulting Behavior:
    Gemini not only provides step-by-step reasoning but can also call external tools (like calculators or search APIs) to enhance accuracy. Its multimodal capability makes it adaptable across domains — from analyzing graphs to writing code and even generating speech.

Open-Source Models: QwQ and Sky-T1

  • QwQ (Alibaba’s 32B Model):
    Built on the Qwen series, QwQ is fine-tuned to emphasize reasoning. It employs additional inference-time “reflection” to review and refine answers. Although it benefits from chain-of-thought training, detailed disclosures about its dataset remain limited.
  • Sky-T1 (UC Berkeley’s Model):
    Designed as an accessible, “open GPT-O1 analog,” Sky-T1 distills knowledge from QwQ. With just 17K curated examples focusing on math and coding, it demonstrates that targeted training can yield impressive reasoning skills even with a smaller dataset. Its training methodology — complete with transparent data generation scripts and rejection sampling — is fully open for inspection and replication.

2. Dataset Composition and Transparency

Closed-Source Models (OpenAI, DeepSeek, Gemini)

  • Data Sources:
    These models are trained on vast and varied corpora, including code repositories, textbooks, web articles, technical documents, and — especially in Gemini’s case — multimodal data such as images and paired text.
  • Transparency:
    While OpenAI and Google share high-level insights into their data composition, the detailed breakdowns remain proprietary. DeepSeek offers more technical details (e.g., 14.8 trillion tokens) but still keeps its exact corpus content under wraps.

Open-Source Models (QwQ and Sky-T1)

  • Data Sources:
    QwQ is believed to have been trained on a mix of math puzzles, coding examples, and chain-of-thought datasets, whereas Sky-T1’s entire training set (about 17K high-quality examples) is public.
  • Transparency:
    Open-source efforts shine in this area. Sky-T1, for instance, provides full access to its training scripts, data curation methods, and fine-tuning processes, ensuring that its reasoning pathways can be audited and improved by the community.

3. Efficiency and Optimization Techniques

OpenAI’s O-Series

  • Inference-Time Optimization:
    Instead of simply scaling model size, OpenAI’s approach focuses on allocating additional computation during inference for difficult queries. This “reasoning effort” can be dialed up or down based on the complexity of the task.
  • Context Handling:
    Specialized attention mechanisms and context window optimizations allow these models to process extremely long documents without sacrificing performance.

DeepSeek R1

  • MoE Efficiency:
    By activating only a fraction of its massive parameter set per token, R1 drastically reduces per-token computation, achieving high throughput (over 300 tokens per second on an A100 GPU) and energy efficiency.
  • Caching:
    Built-in context caching ensures that repeated queries are served almost instantly, further boosting efficiency in enterprise scenarios.

Google Gemini 2.0

  • Tool Delegation:
    Gemini leverages external tool calls to handle subtasks (such as arithmetic or real-time searches), thus avoiding unnecessary internal computation.
  • Multimodal Parallelism:
    The system is designed to process different modalities concurrently, balancing the computational load while maintaining responsiveness even in its more deliberative Flash Thinking mode.

Open-Source Models

  • Lightweight Optimization:
    QwQ and Sky-T1, with their 32B-parameter scales, are engineered to run efficiently on more modest hardware. They benefit from modern quantization techniques (e.g., 4-bit or 8-bit quantization) and optimized inference libraries.
  • Targeted Training:
    Sky-T1’s success via knowledge distillation illustrates that with carefully curated, domain-specific data, high-quality reasoning can be achieved without the need for massive-scale training.

4. Unique Features and Domain Strengths

General-Purpose Reasoning vs. Specialization

  • OpenAI O1/O3:
    These models are designed to be versatile workhorses, capable of handling everything from solving complex math problems and writing code to tackling spatial reasoning with vision inputs. Their chain-of-thought mechanism produces well-reasoned answers that inspire trust.
  • DeepSeek R1:
    R1 is tailored for technical tasks — particularly coding. Its debugging proficiency, multi-hop code analysis, and contextual code completion make it an excellent AI pair-programmer, though it may be less adapted for creative or conversational tasks.
  • Google Gemini 2.0:
    Gemini’s multimodal and agentic capabilities allow it to seamlessly integrate text, images, and even speech. Its Flash Thinking mode makes it not only a powerful problem solver but also an engaging tutor that explains its reasoning transparently.
  • Alibaba QwQ and Sky-T1:
    QwQ’s strengths lie in mathematics and analytical problem-solving, with a bilingual edge that supports multiple languages. Sky-T1 builds on this by combining math and coding prowess with highly structured outputs, making it especially attractive for academic research and domain-specific applications.

5. Practical Behavior and User Experience

User Interaction and Output Styles

  • OpenAI Models:
    When deployed via interfaces like ChatGPT, O1 and O3 deliver answers that are both reasoned and concise. Although they internally generate detailed chains-of-thought, these are usually hidden from the end user unless explicitly requested, striking a balance between transparency and usability.
  • DeepSeek R1:
    Users interact with R1 much like they would with a technical consultant. It not only provides code completions but also walks through its debugging process. Context caching ensures rapid responses for repeated queries, and its straightforward, technical tone is ideal for enterprise environments.
  • Google Gemini 2.0:
    Gemini offers a rich, multimodal user experience. For instance, you might upload an image of a graph, ask for an analysis, and then receive both a textual explanation and visual annotations. Its ability to switch between fast-response and in-depth reasoning modes makes it versatile for both everyday queries and challenging problems.
  • Open-Source Experiences (QwQ and Sky-T1):
    These models often reveal their internal reasoning processes, which is especially valuable for learning and research. However, their raw output may require additional parsing by the user. Their openness allows researchers to modify, fine-tune, and integrate them into custom workflows — a flexibility that commercial models typically do not offer.

6. Mainstream Versus Open Approaches

Capability and Performance

  • Mainstream Models:
    With massive datasets, extensive RL fine-tuning, and dedicated infrastructure, models like OpenAI’s O-series, DeepSeek R1, and Gemini 2.0 generally lead in overall performance across diverse tasks.
  • Open-Source Models:
    Although typically smaller and more specialized, models like QwQ and Sky-T1 are rapidly closing the gap in niche areas — especially in mathematics and coding — by focusing on targeted, high-quality training data.

Transparency and Customization

  • Closed Models:
    Proprietary systems sacrifice some transparency in exchange for a polished, turnkey user experience. Their internal processes and training datasets remain hidden, which may limit auditability and customizability.
  • Open Models:
    Transparency is a defining strength here. Open-source projects provide complete access to training data, code, and even internal reasoning traces. This level of openness not only builds trust but also enables researchers to tailor the models to specific domains and requirements.

Cost and Deployment

  • Enterprise-Grade Services:
    Mainstream models often require significant compute resources and come with per-use costs that reflect their scale and performance. They offer fully managed infrastructure and support, making them ideal for large-scale commercial applications.
  • Lightweight Solutions:
    Open-source models can often be run on commodity hardware. Their lower computational requirements and the possibility of fine-tuning for specific tasks can result in a far more cost-effective solution, particularly for niche applications.

Conclusion

The world of large reasoning models is marked by a vibrant diversity of approaches:

  • OpenAI’s O1/O3 series showcase how reinforcement learning and dynamic inference-time computation can turn a standard dense transformer into a versatile problem solver.
  • DeepSeek R1 leverages a Mixture-of-Experts architecture to optimize efficiency and excel in technical domains, particularly in coding and debugging.
  • Google’s Gemini 2.0 pushes the envelope with multimodal inputs and an explicit “thinking out loud” mode, blending general reasoning with domain-specific tool integration.
  • Open-source efforts such as QwQ and Sky-T1 demonstrate that with carefully curated data and innovative fine-tuning strategies, advanced reasoning can be achieved in an accessible, transparent, and customizable form.

Ultimately, the choice between these models hinges on the specific application: whether one needs a broadly capable, commercial-grade system or a specialized, open, and adaptable tool for research and domain-specific tasks. As the field continues to evolve, we can expect further cross-pollination of ideas — ensuring that the benefits of deep reasoning are accessible to both industry and academia, and driving the next wave of innovation in artificial intelligence.

OpenAI Deep Research Results

Comparison of Large Reasoning Models (LRMs)

This comparison examines OpenAI’s “o1” and “o3” models, DeepSeek R1, Google’s Gemini 2.0 Flash Thinking, and open-source QwQ and Sky-T1. We focus on their architecture/training, strengths & weaknesses, and availability/customization.

OpenAI o1 (First-Generation Reasoning Model)

Architecture and Training Methodology

  • Architecture: OpenAI’s o1 (introduced in late 2024) is a proprietary large language model built to perform extensive chain-of-thought (CoT) reasoning during inference. It’s based on a GPT-4-level transformer architecture but with the ability to generate lengthy hidden reasoning sequences before final answers (Deepseek R1 vs OpenAI o1 — DEV Community) This represented a shift in LLM design — o1 adds a new “test-time compute” dimension (the model effectively thinks longer by generating many internal tokens that the user doesn’t see) (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) (Deepseek R1 vs OpenAI o1 — DEV Community)
  • Training: OpenAI has not fully disclosed o1’s training details (leading to much community speculation (Deepseek R1 vs OpenAI o1 — DEV Community) . It was likely fine-tuned on a vast dataset of complex problems with step-by-step solutions, and further optimized with reinforcement learning from human feedback (RLHF) to follow instructions. The model’s “reasoning mode” was tuned to use CoT prompts internally, enabling it to solve problems that GPT-4 and Claude 3.5 previously couldn’t (Deepseek R1 vs OpenAI o1 — DEV Community) Early versions like o1-preview demonstrated the approach (e.g. surprising ability in tasks like chess and logic puzzles), and a smaller variant o1-mini was also introduced for efficiency (Deepseek R1 vs OpenAI o1 — DEV Community)

Strengths and Weaknesses

  • Strengths: o1 showed a dramatic leap in reasoning capability over earlier models. For example, whereas GPT-4 or Claude 3.5 failed at certain strategy tasks, o1-preview could win ~47% of games against a random chess opponent (Deepseek R1 vs OpenAI o1 — DEV Community) a feat previous models never achieved. It set new state-of-the-art scores on benchmarks like the ARC-AGI reasoning test (capturing headlines for its high score) (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) In areas of math, logical reasoning, and coding, o1 often finds correct solutions via its internal CoT process where others get stuck.
    However, o1 is not infallible — it can still “think” itself into wrong answers on surprisingly simple queries that require common sense or pattern recognition (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) Its chain-of-thought approach incurs a heavy computational cost (generating thousands of tokens per answer in some cases) (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) making it slower and more expensive for trivial questions. Also, the model’s knowledge and skills are limited by its training data (e.g. it might struggle with very recent events or niche topics not covered during training).
  • Weaknesses: A notable downside of o1 is efficiency — giving the model more “thinking time” improves accuracy but uses far more compute, which is costly (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) Users have reported extremely verbose explanations due to the hidden reasoning tokens, and these invisible tokens still count against API usage limits. There were even controversies about OpenAI’s secrecy around o1’s hidden CoT mechanism and users being banned for probing it (Deepseek R1 vs OpenAI o1 — DEV Community) In terms of generalization, while o1 excels at tasks it was tuned for (math, logic puzzles, etc.), it doesn’t necessarily exhibit true human-like reasoning; it can fail on tasks that require real-world understanding or visual context (since it’s text-only). Finally, o1 is heavily aligned (filtered) by OpenAI — this makes it safe for general use but also means it avoids certain creative or sensitive topics, where open models might be more flexible (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) (for some users, this constraint is a weakness).

Availability and Customization

  • Availability: OpenAI’s o1 is proprietary. It was made accessible through OpenAI’s ChatGPT interface and API for select users (particularly ChatGPT Enterprise or Pro subscribers) in late 2024, though the full-power version was not broadly advertised. OpenAI did not release the model weights or architecture details publicly (Deepseek R1 vs OpenAI o1 — DEV Community) This closed approach meant that developers could only use o1 via OpenAI’s services, and its “reasoning mode” was essentially a black box to outsiders. In early use, o1 had tiers: e.g. o1-preview (an early test model), o1-pro (the full version), and o1-mini (a scaled-down version with fewer reasoning steps) (Deepseek R1 vs OpenAI o1 — DEV Community) The o1-mini variant was offered to allow faster, cheaper inference with some reasoning ability, but still only through OpenAI’s API/ChatGPT.
  • Customization: Because o1 is closed-source, customization is very limited. Developers cannot fine-tune o1 on their own data or modify its behavior beyond what the API parameters allow. The main way to customize o1’s output is through prompting — e.g. providing detailed instructions or system messages to guide its reasoning. OpenAI’s platform might allow adjusting the “reasoning depth” indirectly (for example, by choosing o1-mini vs the full model, or by a parameter to limit how many reasoning tokens it uses), but users cannot alter the model’s weights or train it for specialized domains. In summary, one must rely on OpenAI for updates or improvements to o1. This lack of transparency and hackability spurred other AI labs to create open alternatives (Deepseek R1 vs OpenAI o1 — DEV Community) since researchers and developers couldn’t directly build on o1 for their own needs.

OpenAI o3 (Next-Generation Reasoning Model)

Architecture and Training Methodology

  • Architecture: o3 is OpenAI’s latest large reasoning model (successor to o1) that further pushes inference-time reasoning capabilities (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) It likely uses an architecture similar to o1 (a GPT-derived transformer with very large context length), but augmented to allow even longer and more elaborate reasoning chains. OpenAI scaled up test-time compute dramatically with o3 — in fact, on some benchmark runs, o3 generated such long chains-of-thought that a single query cost over $1,000 in cloud compute (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) This suggests o3 can iterate or branch on its reasoning internally (possibly using techniques like self-refinement or tree-of-thought exploration). The model also supports an enormous context window (OpenAI’s o1-Pro context was ~200k tokens; o3 reportedly can handle similar or greater lengths). This allows feeding o3 very large problems or datasets as input.
  • Training: OpenAI built o3 on the foundation of o1’s approach, likely incorporating more advanced training tricks. They would have used an even larger or more diverse dataset of multi-step problems, and possibly new reinforcement learning algorithms to fine-tune reasoning. OpenAI hasn’t published specifics, but hints can be gleaned from its performance: o3 set record-high scores on difficult benchmarks like GPQA Diamond (a graduate-level science Q&A test) with 87.7% (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) far above typical LLMs. Achieving this may have involved ensemble or search-based techniques at inference (for example, generating multiple candidate reasoning paths and choosing the best). The model likely retains extensive RLHF tuning for alignment, and OpenAI might have introduced configurable reasoning levels — indeed, a later “o3-mini” variant allows selecting low/medium/high reasoning effort per query (OpenAI Challenges DeepSeek with New Features | Perigon) Overall, o3’s training focused on maximizing reasoning accuracy, even if it meant using far more computation per answer.

Strengths and Weaknesses

  • Strengths: o3 represents the cutting edge in LLM reasoning. It can solve extraordinarily complex problems that stumped earlier models, thanks to its ability to “think” in depth before responding. On scientific and mathematical benchmarks, o3 has achieved state-of-the-art results (e.g. ~88% on a graduate-level science QA, and similarly high on advanced math contests) (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) It also benefits from a huge context window, enabling it to consider vast amounts of information in one go — reportedly up to 5× more tokens than o1-Pro, on the order of a million tokens in its experimental mode (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) This means o3 can ingest entire research papers or lengthy reports and reason across them, a unique capability. Additionally, OpenAI improved the model’s flexibility: o3-mini (a distilled version) allows faster responses when full reasoning isn’t needed, giving users a trade-off between speed and rigor (OpenAI Challenges DeepSeek with New Features | Perigon) In summary, o3’s key strength is extreme reasoning performance — it’s arguably the most powerful model on reasoning-heavy tasks as of early 2025.
  • Weaknesses: The very feature that makes o3 powerful also brings weaknesses: efficiency and practicality. Running o3 at full capacity is extremely resource-intensive — its best performance requires thousands of inference tokens (and thus lots of GPU time), impractical for everyday use (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) OpenAI themselves have not widely deployed the full o3 due to cost; instead they offer the scaled-down o3-mini for general users. Another limitation is that o3, like o1, can still err on simpler tasks or have blind spots. For example, even after its ARC-AGI triumphs, o3 was shown to fail at some visual-pattern puzzles that a human finds trivial (AI Reasoning — What is It? — by James Wang — Weighty Thoughts) This highlights that o3’s reasoning is still synthetic and can miss common-sense shortcuts. Moreover, as a closed model, o3 inherits the lack of transparency: we often don’t know why it made a certain decision in its hidden chain-of-thought (beyond the final answer). Alignment and filtering remain in place, so o3 might avoid certain queries or produce cautious answers by design. Finally, generalization beyond its training distribution could be an issue — if a problem requires a type of reasoning o3 wasn’t exposed to, it might not spontaneously invent a solution strategy (whereas humans could try analogies or new approaches). In short, o3 is extremely powerful but not omni-capable, and using it requires significant computational investment.

Availability and Customization

  • Availability: o3 is currently proprietary and limited-access. As of early 2025, OpenAI has not released o3 widely; the full model is likely used internally and for select enterprise partners or evaluations. However, recognizing competitive pressure, OpenAI launched o3-mini in January 2025 as a free-tier model for ChatGPT users (OpenAI Challenges DeepSeek with New Features | Perigon) (OpenAI Challenges DeepSeek with New Features | Perigon) o3-mini is a smaller or restricted version of the reasoning model integrated into ChatGPT (accessible to anyone on the free plan, with certain limits). It offers the flavor of o3’s reasoning but on a more efficient scale — users can even choose between three reasoning effort levels (low, medium, high) to balance speed vs accuracy (OpenAI Challenges DeepSeek with New Features | Perigon) This move was a strategic response to rivals like DeepSeek R1 making advanced reasoning freely available (OpenAI Challenges DeepSeek with New Features | Perigon) For the full o3, OpenAI is expected to roll it into its paid offerings (e.g. ChatGPT Pro or an enterprise API) once they manage cost and stability. There is no open model download; one must use OpenAI’s interface or API to access o3’s capabilities.
  • Customization: Because no weights are released, direct customization of o3 by outsiders isn’t possible. You cannot fine-tune o3 on your own data or run it on your own servers. The only customization is via API settings and prompts. Developers can instruct the model with system messages (e.g. provide context or constraints in the prompt) and, in the case of o3-mini, select how much reasoning the model should apply. This can help tailor the output format or the degree of detail in explanations. OpenAI might also allow some configurable parameters (for instance, a reasoning_depth flag or similar, as implied by the low/medium/high options (OpenAI Challenges DeepSeek with New Features | Perigon) . But ultimately, o3’s behavior is fixed by OpenAI’s training; users can’t modify the core model. Any improvements or domain specializations would have to come from OpenAI’s side. This lack of customizability is a trade-off for using a cutting-edge model – many developers, therefore, consider open alternatives (like DeepSeek, QwQ, Sky-T1) if they need more control over the model’s training or deployment.

DeepSeek R1 (Open-Source Reasoning LLM)

Architecture and Training Methodology

  • Architecture: DeepSeek R1 is a large reasoning model released by the startup (or lab) DeepSeek, and it stands out for its Mixture-of-Experts (MoE) architecture. It effectively has 671 billion parameters spread across experts (Deepseek R1 vs OpenAI o1 — DEV Community) — far larger than most dense models — but uses gating so that only a subset of those parameters are active for a given query. This design gives R1 a huge capacity for knowledge and reasoning diversity while keeping inference tractable. The model architecture likely builds on a transformer backbone, with expert layers that specialize in different types of reasoning subtasks (the MoE approach). Such a structure helps it match the performance of extremely large models without requiring all 671B parameters to fire simultaneously.
  • Training: DeepSeek employed a multi-phase training pipeline with heavy reinforcement learning to instill reasoning abilities in R1. According to descriptions, they started with a base model (“DeepSeek-V3”) and did pure RL training (using a method called GRPO) on it to produce an intermediate model r1-zero (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) They then applied supervised fine-tuning (SFT) on a “cold start” dataset to improve r1-zero’s coherence and readability. Next, they did another round of RL (with added constraints for language consistency) to get a stronger checkpoint (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) Uniquely, the team generated additional training data by having the model itself produce reasoning traces and answers (using rejection sampling to filter for quality) (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) The base model was again fine-tuned on this newly generated data (plus other collected data), and finally, a last phase of RL was done optimizing for reasoning correctness and preference alignment (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) This intricate process — alternating RL and SFT, and leveraging the model’s own “thoughts” to create training data — was key to R1’s success.
  • Unique Techniques: DeepSeek R1 introduced some novel tricks. For example, they mention using “‘Aha’ moments as pivot tokens” during chain-of-thought (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) In practice, this means the model learns to mark points in its reasoning where it has a breakthrough or needs to reconsider, allowing it to revise its approach mid-solution. This is akin to a human saying “Wait, let me rethink that step.” Such techniques improved R1’s ability to correct itself on the fly. Also, because R1 is open, the team and community have distilled its knowledge into smaller models (like fine-tuning Qwen or LLaMA on R1’s outputs) (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) — spreading the reasoning techniques to models of various sizes. In summary, R1’s training combined large-scale RL (uncommon for most LLMs due to complexity), model self-improvement via generated data, and MoE scaling to create a top-tier reasoning model at a fraction of the usual cost.

Strengths and Weaknesses

  • Strengths: Despite being open-source, DeepSeek R1 reaches performance levels close to the best closed models. On complex reasoning benchmarks (e.g. ARC-AGI or challenging puzzles), R1 is only a notch below OpenAI’s o1 — it even surpassed the early o1-preview model in many cases (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) It has strong mathematical and coding abilities; testers found it to be “killer” in math (though still slightly under o1 in absolute terms) (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) and very competent in coding tasks, essentially on par with o1 in initial coding tests (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) One area R1 outperforms others is in creative writing and open-ended responses: users describe its answers as having a more personal, free-wheeling style, with richer creativity and easier steerability than even OpenAI’s tuned models (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) This is partly because it has fewer censorship filters — R1’s outputs can feel like a genuine human internal monologue, which can be very engaging (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) Another major strength is efficiency and cost: R1 was heralded for offering o1-level reasoning at about 1/20th the cost (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) Its MoE design means you don’t need the entire 671B parameters active, saving inference cost. And since it’s open-source, users avoid API fees altogether. Finally, R1’s open license (MIT) means anyone can inspect and improve it — giving the community a foundation to build upon without starting from scratch. This openness has already led to distilled smaller models that retain much of its reasoning prowess (Deepseek R1 vs OpenAI o1 — DEV Community) making the technology more accessible.
  • Weaknesses: While R1 is excellent, it isn’t the absolute top performer in every category. OpenAI’s best (o1-pro and o3) still have an edge in pure reasoning accuracy and consistency (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) For example, on the hardest reasoning puzzles or competition math problems, R1 might get slightly fewer correct answers than o1. It’s also observed that R1 can be verbosely thoughtful to a fault — it may produce very long solutions that occasionally go in circles or include irrelevant details. In tests like LLM chess simulations, R1 made more mistakes in following the rules compared to o1 (18.6 mistakes per game on average vs ~3.7 for o1 in one analysis) (Deepseek R1 vs OpenAI o1 — DEV Community) suggesting it can lose focus or accuracy in extended tasks. The flip side of its lighter filtering is that alignment is a concern: R1 might generate content that OpenAI’s models would normally refuse or moderate. This means users have to be cautious and perhaps implement their own safety layers if using R1 in production. Additionally, R1’s huge size (MoE or not) means running it is non-trivial — not everyone has the hardware to deploy a 671B model, so practical use might involve relying on DeepSeek’s API or using the smaller distilled versions. In terms of generalization, R1 was heavily trained on math, coding, and language tasks; if you ask it something outside that realm (say, interpret an image or do a physical reasoning task beyond text), it won’t have special skills there (no multimodal capability built-in). Overall, R1’s weaknesses are minor trade-offs given its strengths: a bit less accuracy than the absolute leader, some verbosity/errors, and the usual open-model caveats of needing user oversight.

Availability and Customization

  • Availability: DeepSeek R1 is open-source and widely available. The model weights were released under an MIT license, meaning anyone can download and use them (Notes on Deepseek r1: Just how good it is compared to OpenAI o1 : r/LocalLLaMA) DeepSeek provided access in multiple forms: via a free web chat (so users can try R1 online easily) and through an API endpoint (Deepseek R1 vs OpenAI o1 — DEV Community) This makes R1 one of the most accessible cutting-edge LLMs. The only limiting factor is its size — running the full 671B MoE model requires serious computing power. To address that, the developers also released distilled versions (smaller models that approximate R1’s behavior) that can run on local machines with far less compute (Deepseek R1 vs OpenAI o1 — DEV Community) This tiered availability (from cloud API for full model to local smaller models) ensures a wide range of users can experiment with R1.
  • Customization: Because R1 is open-source, it is highly customizable. Developers and researchers can fine-tune R1’s weights on new data to adapt it to specific domains (though fine-tuning the full model is expensive, it’s possible, and fine-tuning the smaller distilled models is much easier). The open license encourages integration into other projects — you can embed R1 into your applications without legal barriers. The fact that all training details and data for R1 (and even the intermediate checkpoints) are available or described means the community can replicate or modify the training process (Sky-T1: Train your own O1 preview model within $450) (OpenAI Challenges DeepSeek with New Features | Perigon) For example, one could take R1 and apply further RLHF to enforce certain behaviors, or distill it down to an even smaller model for mobile use. R1’s openness has indeed spurred a wave of innovation; notably, OpenAI alleged that DeepSeek may have leveraged OpenAI’s own model outputs during training (OpenAI Challenges DeepSeek with New Features | Perigon) highlighting how R1 blurs the competitive line between open and closed models. From a user perspective, customization can be as simple as tweaking the “reasoning style” via prompts (since R1 will follow instructions and even adopt different reasoning strategies if asked). In summary, R1 offers maximal flexibility — you can use it as-is via API, run it yourself, retrain it, or slice it up for parts, with the blessing of its MIT license.

Google Gemini 2.0 Flash Thinking (Fast/“Thinking Mode”)

Architecture and Training Methodology

  • Architecture: Gemini 2.0 Flash Thinking is Google DeepMind’s entrant into reasoning-optimized LLMs. It is essentially a specialized mode of the broader Gemini 2.0 model (which is a large multimodal model). In Flash Thinking mode, Gemini operates as a reasoning engine that produces a transparent chain-of-thought. The architecture remains a large transformer, likely on the order of hundreds of billions of parameters (Google hasn’t published the exact size yet), and importantly it’s multimodal — it can take images as input in addition to text (Gemini 2.0 Flash “Thinking mode”) (Gemini 2.0 Flash “Thinking mode”) One headline feature of Gemini Flash is an extremely large context window: it can handle inputs up to one million tokens (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) This is orders of magnitude above most models (for comparison, OpenAI’s o1 Pro handles ~200k tokens). Such capacity likely required architectural innovations like segmented attention, retrieval augmentation, or an advanced memory management system to process so many tokens efficiently.
  • “Thinking Mode” Training: The Flash Thinking variant was trained to generate its own step-by-step reasoning as part of its output (Gemini 2.0 Flash “Thinking mode”) This means during training, the model was probably given problems along with human-crafted or model-assisted reasoning traces, and learned to produce reasoning text followed by the final answer. According to Google’s documentation, this yields stronger reasoning performance than the base model without such traces (Gemini 2.0 Flash “Thinking mode”) Google also leveraged their experience in planning algorithms: Demis Hassabis noted they combined ideas from systems like AlphaGo (which plans moves) with large-scale models (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) This could imply some reinforcement learning or search was used at inference time for certain tasks (though the details are not public). Another notable aspect is integrated tools: Gemini Flash has native code execution abilities (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) meaning the model can internally run Python code (or similar) to check its calculations or simulate outcomes as it reasons. For instance, if faced with a math problem, it might generate some code, run it, and use the result to inform its answer — all within the model’s pipeline. This was likely achieved by training the model with examples of problems that require calculation and showing it how to incorporate the results of code.
  • Multimodal and Others: Gemini 2.0 was designed from the ground up to be multimodal, so Flash Thinking can interpret images, diagrams, or possibly other data formats. An example from Google’s demo is solving a geometry problem with an image of a shape — the model can “see” the image and include it in its reasoning process (Gemini 2.0 Flash “Thinking mode”) (Gemini 2.0 Flash “Thinking mode”) Training for that capability would have involved image+text pairs and perhaps vision models integrated into the architecture. The “Flash” in the name suggests fast processing, and indeed users report that even with its long context, the model can respond relatively quickly. This could be due to efficient model optimization or Google’s infrastructure. In sum, Gemini 2.0 Flash Thinking’s training married LLM reasoning with external problem-solving tools and multimodal understanding, underpinned by Google’s significant compute resources and research (it made rapid progress, as one update in Jan 2025 showed substantial improvement just weeks after the initial release (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) .

Strengths and Weaknesses

  • Strengths: Gemini 2.0 Flash Thinking has several standout strengths:
  • Advanced Reasoning Performance: It achieves very high marks on challenging benchmarks. For example, it scored 73.3% on the AIME math competition and 74.2% on the GPQA Diamond science benchmark (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) which are state-of-the-art levels that surpass most models (and approach OpenAI’s o3 on the science benchmark). It demonstrates prowess in mathematical problem solving, scientific reasoning, and other tasks requiring multi-step logic.
  • Massive Context Handling: The ability to ingest up to 1,000,000 tokens of text is unprecedented (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) This means Gemini can analyze extremely large documents or even multiple documents jointly. Researchers can feed it entire libraries of information and have it draw conclusions — something not feasible with other models. This context size gives Gemini a unique advantage for use cases like literature review, lengthy legal analysis, or cross-document reasoning.
  • Transparent Chain-of-Thought: Because it’s trained to show its thinking, Gemini provides an explanation along with answers (Gemini 2.0 Flash “Thinking mode”) This transparency is useful for users to follow the logic and increases trust in the results. It also helps in debugging: if the model goes wrong, one can see where its reasoning faulted.
  • Tool Use and Multimodality: Gemini Flash has built-in code execution, so it can handle tasks like calculations, data analysis, or code writing with higher reliability (it can test code it writes) (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) It can also reason about images and integrate visual information into solutions (e.g. solving geometry from a diagram) (Gemini 2.0 Flash “Thinking mode”) which most text-only models cannot do. This makes it extremely versatile — bridging text and vision, and seamlessly using tools — akin to having a model that can ‘think’ and also ‘act’ (by running code) as needed.
  • Free (Beta) Access: Google has made Gemini Flash available free of charge during its experimental preview (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) This lowers the barrier for developers and researchers to try it out, unlike OpenAI’s top models which usually require payment or subscription. Even though usage is capped, the free availability is a strategic strength to gather user adoption and feedback quickly.
  • Weaknesses: Despite its capabilities, Gemini Flash has some limitations:
  • Beta Quirks and Verbosity: Early users note that the model can be extremely verbose. It often produces very long explanations (e.g. thousands of tokens) (Gemini 2.0 Flash “Thinking mode”) which, while thorough, might be overkill for some applications and can slow down getting the final answer. Being in an experimental phase, it might also have the occasional glitch — for instance, sometimes the reasoning output can include stray artifacts or minor formatting issues when it’s interleaving tool outputs (one example showed a bit of corrupted SVG output in its explanation) (Gemini 2.0 Flash “Thinking mode”)
  • Not Always the Top Performer: On certain benchmarks, Gemini is excellent but not the absolute best. For example, OpenAI’s o3 edged it out on the GPQA science benchmark (o3 scored ~87.7% vs Gemini’s 74.2%) (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) So, while Gemini Flash is close to state-of-the-art, OpenAI’s most expensive model still has a lead in some areas. This suggests there are still some reasoning tasks or knowledge domains where Gemini might fall short or need further improvement.
  • Closed Model: Although accessible via API, Gemini 2.0 is not open-source. The weights and training data are not publicly released. This means the community cannot directly inspect how it works or fine-tune it independently. We rely on Google to update and maintain it. For those who prefer self-hosted solutions or need to customize the model’s training, Gemini is not an option.
  • Usage Limits and Uncertain Future Cost: The free beta comes with usage limits (e.g. number of tokens per day), and it’s temporary. Down the line, Google may put Gemini behind a paid service or restrict certain features. This uncertainty can be a weakness if one were planning a product around it — the pricing and availability might change. Also, using the 1M-token context, while possible, may be subject to stricter limits or slower speeds.
  • Generalization and Safety: It’s early to fully judge, but large chain-of-thought models sometimes encounter issues with common-sense reasoning or hallucination if the prompts fall outside their tested scenarios. Gemini Flash’s transparency helps mitigate hallucination by showing logic, but it could still follow a wrong premise confidently. Safety-wise, Google likely applied some alignment, but with the model’s expanded abilities (code execution, etc.), there might be novel risks (e.g. if a malicious user tries to get it to execute harmful code, Google probably sandboxes this heavily). These are areas to watch as the model matures.

Availability and Customization

  • Availability: Google’s Gemini 2.0 Flash Thinking model is available as a cloud service. It can be accessed through the Google AI Studio (and Vertex AI) as an experimental model (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) Developers can sign up for the Gemini API or use it via Google’s PaLM/Vertex AI endpoints. During the experimental phase (early 2025), Google has made it free to use (with limits) to gather usage and feedback (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) One can obtain an API key and interact with the model (as demonstrated by community integrations that simply require the key) (Gemini 2.0 Flash “Thinking mode”) The model runs on Google’s infrastructure, and there’s no offline version — you send your requests to Google’s servers and get the response. Google has also provided a web interface (Google AI Studio) where users can try the model in a playground environment.
  • Customization: As a closed model offered via API, direct customization of Gemini Flash’s internals isn’t possible for external developers. You cannot fine-tune the model weights or add training data yourself. However, Google provides some levers for customization through the API. For instance, you can choose whether to use the “Flash Thinking” mode or the regular mode of Gemini 2.0 (the API likely has a flag for the reasoning mode). You can also feed the model very specific prompts that set context or style, effectively programming its behavior with instructions. Since it can handle huge contexts, you could even supply a few-shot chain-of-thought example in the prompt to nudge its reasoning style in a certain direction (though it usually doesn’t need that to produce CoT). Google’s platform might also allow combining the model with other tools — for example, using the code execution feature or plugging in retrieval for long contexts — but those are part of the model’s capability rather than user modifications. In terms of integration, developers can embed the API in their applications and adjust parameters like temperature (for randomness) or the level of detail in answers. Google has emphasized improving reliability and contradiction safeguards (Google releases free Gemini 2.0 Flash Thinking model, pressuring OpenAI’s premium strategy | VentureBeat) but if a user wanted a stricter or looser model, they would have to wait for Google to implement such changes. In summary, you can utilize Gemini Flash with flexible prompting and take advantage of its multi-step outputs, but you cannot alter the model’s fundamental training or host it yourself. Customization is thus limited to what the API offers (which is primarily prompt-based control). The advantage is that it’s plug-and-play — if the default model fits your needs, you get cutting-edge performance without having to train anything.

Alibaba QwQ (Qwen-with-Questions, Open-Source)

Architecture and Training Methodology

Strengths and Weaknesses

  • Strengths: QwQ’s primary strengths lie in structured problem-solving domains:
  • It excels in mathematics and programming tasks, frequently outperforming much larger models in these areas. On benchmarks like MATH-500 and AIME (American Invitational Math Exam), QwQ achieved top-tier results, even surpassing many state-of-the-art models of its time (Alibaba Cloud Unveils Open-Source AI Reasoning Model QwQ and New Image Editing Tool — Alibaba Cloud Community) Its ability to carry out long, precise calculations and logical deductions step-by-step is a major advantage.
  • The model’s chain-of-thought clarity is a strength. QwQ not only arrives at answers, but does so by laying out a reasoning process. This makes it easier for users to follow how it solved a problem and to trust the solution (or catch mistakes if they occur). For example, it will check its work and question its assumptions as it goes (Alibaba Cloud Unveils Open-Source AI Reasoning Model QwQ and New Image Editing Tool — Alibaba Cloud Community) which is behavior similar to an expert human problem-solver.
  • Open-source accessibility is another strength. Since QwQ’s weights are freely available, anyone can inspect the model, run it locally, or fine-tune it. This means researchers can learn from QwQ’s approach to reasoning and even incorporate that into other models. It also means applications that require on-premises AI (for privacy or cost reasons) can use QwQ without relying on a third-party API.
  • QwQ’s moderate size (32B) makes it more resource-friendly than something like GPT-4 or DeepSeek R1. It’s feasible to run on a single high-end GPU or a small cluster, which is a strength for adoption in the open-source community.
  • In terms of reasoning specialization, QwQ’s “eternal student” mindset means it’s less likely to jump to a conclusion. This careful approach yields a high accuracy on problems that need multi-step thinking, and it’s less prone to hallucinating a quick but wrong answer in those scenarios.
  • Weaknesses: QwQ is a specialized model with several notable weaknesses:
  • It has uneven capabilities — while brilliant in math and code, it is less adept at common-sense reasoning and general knowledge. The developers themselves noted it needs improvement in understanding everyday situations and nuanced language (Alibaba Cloud Unveils Open-Source AI Reasoning Model QwQ and New Image Editing Tool — Alibaba Cloud Community) So, if asked a question like “Why do people celebrate birthdays?”, QwQ might not shine as much as it would on a calculus problem.
  • The model can suffer from overthinking. One reported limitation is a tendency to enter recursive reasoning loops (QwQ: Reflect Deeply on the Boundaries of the Unknown | Qwen) — essentially, it might keep questioning itself in circles without reaching a conclusion, especially if not properly prompted to eventually stop and answer. This can lead to very lengthy, somewhat rambling outputs in edge cases.
  • QwQ may mix languages or styles unexpectedly (QwQ: Reflect Deeply on the Boundaries of the Unknown | Qwen) As an experimental model, it wasn’t fully polished for consistency. Users might get answers that switch language mid-way or combine formality with casual tone in odd ways. This is a side effect of the model not being fine-tuned to the same degree as a production system like ChatGPT.
  • Lack of extensive alignment/safety tuning: being a research preview, QwQ likely did not undergo the kind of rigorous RLHF that ensures it refuses improper requests or avoids biased language. It requires the user to exercise caution. In a production setting, this is a weakness as the model might output something unfiltered or erroneous if prompted maliciously or accidentally.
  • While 32B is smaller than giant models, it’s still computationally heavy for long chains: QwQ often uses a lot of tokens to explain itself. For instance, solving a complex math problem might involve hundreds or thousands of tokens of reasoning. This makes inference slower and more costly in real-time applications, somewhat negating the advantage of the smaller parameter count.
  • Finally, QwQ’s knowledge is based on what it was trained on (likely data up to 2023/2024). It might not have the breadth of knowledge of a model trained on the entire web. This is a general weakness of many open models that don’t have the same massive training corpus as OpenAI or Google models. It could potentially miss some trivia or domain-specific info outside math/code unless fine-tuned further.

Availability and Customization

  • Availability: QwQ is fully open-source and available to the public. Alibaba released QwQ-32B under a permissive license (Apache 2.0, if consistent with their other Qwen models) and hosted the model files on platforms like Hugging Face and ModelScope (Alibaba Cloud Unveils Open-Source AI Reasoning Model QwQ and New Image Editing Tool — Alibaba Cloud Community) This means anyone can download the model for free. The release includes the model weights and documentation, and there’s even an online demo where one can test QwQ’s reasoning on a web interface. Because it’s open, multiple community members have likely converted it for different frameworks (PyTorch, TensorFlow, etc.) and optimized it (e.g. quantizing it for 16-bit or 8-bit to run on smaller GPUs). In short, QwQ is as accessible as it gets — no login or API key required, just download and run.
  • Customization: The open nature of QwQ allows extensive customization:
  • Developers can fine-tune QwQ on new datasets. For example, if you wanted QwQ to be better at common-sense, you could gather a dataset of common-sense problems with explanations and fine-tune the model weights. This ability to further train the model is a huge advantage over closed models.
  • One can also modify the model’s code/architecture if needed. Since the code is available, if a researcher wanted to experiment with, say, adding an MoE layer or integrating QwQ with a retrieval system, they could do so.
  • Integration into custom applications is straightforward. You can deploy QwQ on your own server and have full control over how it’s used (you’re not bound by someone else’s rate limits or content policies). It can be incorporated into pipelines — for instance, you could have QwQ think through a problem and then pass its solution to another system, or vice versa.
  • Alibaba’s release likely includes an instruct-tuned version (so it responds well to prompts). But if needed, you could even fine-tune QwQ for a different style, like more terse answers or a specific persona.
  • The only “cost” to this freedom is the requirement of technical know-how. Running a 32B model and fine-tuning it requires expertise in machine learning engineering and access to GPUs. For many hobbyists or small companies, using a hosted API is easier. But for those who have the means, QwQ can be molded to their needs without external restrictions.
  • In summary, QwQ offers maximum accessibility and modifiability — it’s an open foundation that the community can build upon. Already, QwQ has served as a teacher model (e.g., in training Sky-T1, discussed next) and as a benchmark for what open models can do in reasoning. Its availability is a big win for open AI development, and customization is limited only by the imagination and resources of the user.

Sky-T1 (Berkeley Sky Lab’s Open-Source Reasoning Model)

Architecture and Training Methodology

  • Architecture: Sky-T1–32B is an open-source reasoning model developed by the Sky Computing Lab at UC Berkeley. Architecturally, Sky-T1 started from an existing base model — specifically Qwen 2.5-Instruct (32B) — which is a high-quality transformer model from Alibaba’s family (essentially the successor to the base that QwQ was built on) (Sky-T1: Train your own O1 preview model within $450) Sky-T1 inherits the 32B transformer architecture with presumably 8k or 16k context length (the exact context length isn’t stated, but since Qwen 2.5 is a foundation model, it likely has at least 8k tokens context). There’s nothing exotic in the architecture itself; the key is what they did with training.
  • Training: The creation of Sky-T1 is a compelling story of efficient fine-tuning:
  • The team wanted to achieve o1-preview-level reasoning performance within a $450 budget (Sky-T1: Train your own O1 preview model within $450) To do this, they employed distillation and targeted fine-tuning rather than training a huge model from scratch. They used QwQ-32B (preview) as a teacher model (Sky-T1: Train your own O1 preview model within $450) Specifically, they generated a training dataset by having QwQ solve a variety of problems and record its chain-of-thought and answers.
  • They curated about 17,000 problems covering diverse reasoning domains (with emphasis on math and coding) and used rejection sampling to ensure quality — meaning if QwQ’s answer wasn’t good, they might discard or regenerate that example (Sky-T1: Train your own O1 preview model within $450) They also did some editing (“rewrite QwQ traces”) to clean up the solutions.
  • Using this synthesized high-quality dataset of question -> CoT -> answer, they fine-tuned the base Qwen2.5–32B-Instruct model for 3 epochs (Sky-T1: Train your own O1 preview model within $450) The training took only 19 hours on 8×H100 GPUs, costing under $450 in cloud compute (Sky-T1: Train your own O1 preview model within $450) which is incredibly low-cost by LLM standards. They leveraged the DeepSpeed library (Zero-3) to handle memory efficiently, indicating they squeezed as much performance as possible from limited resources (Sky-T1: Train your own O1 preview model within $450)
  • They paid attention to domain balance: initially, just using the teacher data gave good math results but coding lagged, so they incorporated more coding problems (like from the APPS or TACO datasets) and also some challenging math problems (from Numina math dataset) to ensure Sky-T1 became competent in both domains (Sky-T1: Train your own O1 preview model within $450) This balanced mix allowed the final model to “excel in both domains”, math and coding, simultaneously (Sky-T1: Train your own O1 preview model within $450)
  • No mention of RLHF or preference tuning is made; it sounds like Sky-T1’s alignment comes naturally from inheriting the instruct-tuning of Qwen 2.5 and the fact that QwQ’s outputs were likely helpful and thorough. They did publish a technical report and all their training code/scripts open-source (Sky-T1: Train your own O1 preview model within $450) underlining the academic rigor and openness of the project.
  • In summary, Sky-T1’s methodology is: take a solid base model (Qwen 2.5), feed it a diet of excellent reasoning examples from a teacher (QwQ), and do it cheaply. The result is a model that performs on par with a much more expensively-trained model (OpenAI’s o1-preview) (Sky-T1: Train your own O1 preview model within $450) This approach demonstrates the power of transfer learning and open collaboration: using outputs of one open model to train another.

Strengths and Weaknesses

  • Strengths:
  • Cost-Effective Excellence: The most touted strength of Sky-T1 is that it achieved high-level reasoning ability very cheaply and quickly (Sky-T1: Train your own O1 preview model within $450) This is a proof-of-concept that you don’t need tens of millions of dollars to train a competitive LLM. For the AI community, this is a huge positive, as it lowers the barrier to entry for research. It showcases efficient use of resources — a win for open-source methodology.
  • Competitive Reasoning & Coding Performance: Sky-T1 was shown to perform on par with OpenAI’s o1-preview on popular reasoning and coding benchmarks (Sky-T1: Train your own O1 preview model within $450) That means it can solve complex math problems and programming challenges about as well as the initial version of one of OpenAI’s reasoning models. For a 32B model, that’s very impressive. It indicates strong general reasoning capabilities within those domains. If o1-preview could handle, say, challenging logical puzzles or code generation tasks, Sky-T1 can handle them too.
  • Balanced Skills: Thanks to the curated training data, Sky-T1 is good at both mathematics and coding in one model (Sky-T1: Train your own O1 preview model within $450) Some models excel only in one (e.g., a math-specialized model or a code model), but Sky-T1 manages both. This makes it versatile for technical problem-solving — you could ask it to prove a theorem or debug a piece of code and expect reasonable performance in each.
  • Open and Reproducible: Like QwQ and R1, Sky-T1 is fully open. The model weights are available for anyone to use (Sky-T1: Train your own O1 preview model within $450) and even the training data (17k problems) and scripts are released (Sky-T1: Train your own O1 preview model within $450) This transparency is a strength because it allows others to validate and build upon the work. If someone doubts its performance, they can replicate the training themselves with the provided code. If they want to extend it (say, add more training data or fine-tune on a new domain), the roadmaps and tools are all there.
  • Foundation for Further Research: Sky-T1, being open and relatively small, can act as a foundation for further experimentation. It’s much easier to try new ideas on a 32B model than on a 170B model. Researchers could attempt to add new reasoning techniques or multi-modality to Sky-T1 without starting from zero. This modularity and ease of experimentation is a strength in the research context.
  • Weaknesses:
  • Limited Scope of Mastery: Sky-T1’s expertise is largely in math and coding, the areas its training focused on. It might not generalize as well to unrelated tasks. For instance, its performance on general trivia QA, creative writing, or common-sense reasoning might be mediocre compared to models that were trained on more diverse data. In other words, it lacks the breadth that a model like GPT-4 or even DeepSeek R1 (which went through multiple broad training phases) has.
  • 32B Parameter Constraints: At the end of the day, Sky-T1 is still a 32B model. There’s a limit to what it can store in its weights in terms of knowledge and complexity of reasoning. OpenAI’s o1 (and certainly o3) are much larger and can thus handle more complexity internally. Sky-T1 might struggle with extremely complex, nuanced tasks that push the limits of its capacity. It achieved parity with o1-preview, but OpenAI’s full o1 and o3 are beyond its reach. So, there is a gap to the absolute top performance.
  • “Preview” Maturity: The name Sky-T1–32B-Preview suggests it’s an initial version. It likely has some rough edges. Perhaps it wasn’t exhaustively fine-tuned for instruction-following outside of the reasoning context, or maybe it lacks polish in conversational ability. It might also carry over some quirks from its teacher (QwQ) — for example, if QwQ had a tendency to loop or use certain phrases, Sky-T1 might mimic that to some degree. Essentially, as a first release, it might need further tuning to be as reliable as a commercial product.
  • No RLHF Alignment Pass: Sky-T1 was trained on synthetic data and not explicitly on human feedback or refusal styles. Therefore, it might not know how to refuse inappropriate requests or moderate its outputs. This could be a weakness in a deployment scenario — it might say things that are unfiltered or take problematic instructions too literally. Any user of Sky-T1 would need to be careful and possibly add their own moderation layer.
  • Evaluation Limitations: The benchmarks where Sky-T1 shines (math, code tests) are impressive, but they are just a subset of possible tasks. We haven’t seen public results of Sky-T1 on, say, the ARC reasoning test or a wide NLP task suite. So its weaknesses might become apparent with broader evaluation — perhaps it’s weaker in understanding long narratives or in multilingual tasks compared to more extensively trained models.
  • In summary, Sky-T1’s weaknesses are mostly the flipside of its focused training and smaller size: it’s not a generalist and it’s not the absolute best, but that’s expected. For what it was designed to do, it performs excellently; just one shouldn’t expect it to replace a GPT-4 for everything. It’s a powerful tool in a narrower toolbox.

Availability and Customization

  • Availability: Sky-T1 is released under an open-source license (very likely Apache 2.0 or similar, given the Berkeley origin) and is freely available. The model weights (Sky-T1–32B-Preview) are downloadable from Hugging Face (Sky-T1: Train your own O1 preview model within $450) All supporting code (data generation scripts, fine-tuning scripts) and the training dataset itself have been open-sourced on GitHub (Sky-T1: Train your own O1 preview model within $450) This means anyone in the world can obtain Sky-T1 and use it without restriction. It became available in January 2025 as a public release. There’s no official API by the Sky team (since it’s not a company offering a service, but rather a lab release), but interested parties could easily host an API themselves using the model. Already, given the buzz, some community members or smaller companies might host Sky-T1 on cloud platforms for others to try out. But fundamentally, availability is unlimited — you can run it locally if you have the hardware, or spin it up on a cloud VM.
  • Customization: Sky-T1’s open nature allows full customization. You can:
  • Fine-tune it on additional data. For example, if you want to teach Sky-T1 some common-sense reasoning, you can create a dataset for that and continue training from the released checkpoint. The team’s code release will help you do this properly.
  • Prune or modify the model. If 32B is too large for your use-case, you might try techniques like knowledge distillation or quantization to compress it. Since all weights and training data are there, one could even attempt to train a smaller model (say 7B) using Sky-T1 as a teacher, similar to how Sky-T1 itself was made from QwQ — a cycle of distillation.
  • Embed it in applications without worrying about licenses. For instance, a startup could integrate Sky-T1 into their software and even fine-tune it for their domain (medical, legal, etc.) as long as they have the expertise. There’s no external dependency; it’s self-contained.
  • Combine it with other systems. Because you have access to the model, you could merge it with a retrieval system for up-to-date knowledge, or hook it up to tools (like a calculator or a database) by fine-tuning it to call those tools when needed.
  • The community could also contribute improvements: since it’s on GitHub, if someone finds a better set of hyperparameters or an error in the data, they can suggest changes. This collaborative aspect can rapidly improve the model.
  • In short, Sky-T1 offers the same degree of freedom as QwQ or DeepSeek R1 — the user is in control. The difference is Sky-T1 is smaller and easier to experiment with, which is great for academic researchers or indie developers. The only limitation to note is that customizing a model still requires ML know-how and computing resources. But given the low-cost recipe demonstrated (they did it in 19 hours on 8 GPUs), reproducing or tweaking that is within reach of many university labs or even well-resourced hobbyists. Sky-T1 democratizes the idea of an “OpenAI-level” reasoning model, making it something the community can play with and improve on their own.

In conclusion, each of these Large Reasoning Models has its unique approach and niche:

This landscape shows a spectrum from proprietary, highly-optimized LRMs to open, community-driven ones. The architecture and training differences (from dense vs MoE, RL-heavy vs supervised, monolithic vs tool-integrated) lead to different strengths/weaknesses, and the question of availability often dictates who can use the model and how. For developers and researchers, the choice might boil down to: maximum performance and convenience (OpenAI/Google APIs) versus transparency and customizability (open-source models like R1, QwQ, Sky-T1). Each approach has its trade-offs, and having all these options ultimately benefits the progress of AI.

--

--

Intuition Machine
Intuition Machine

Published in Intuition Machine

Artificial Intuition, Artificial Fluency, Artificial Empathy, Semiosis Architectonic

Carlos E. Perez
Carlos E. Perez

Written by Carlos E. Perez

Author of Artificial Intuition, Fluency and Empathy and the Pattern Language books on AI — http://linkedin.com/in/ceperez https://twitter.com/IntuitMachine

Responses (2)