Stories by Luhui Dev on Medium

Dino-GSP Major Update: Algeo SDK 2.0 embedded editing mode is now available

Luhui Dev — Sun, 10 May 2026 15:08:21 GMT

Videos can be embedded. Documents can be embedded. Spreadsheets can be embedded.

But what about geometry?

For the past decade, whenever a product needed users to draw a geometry problem, edit a dynamic figure, or save an interactive geometry asset, the workflow usually broke in the same place: leave the product, use a separate tool, take a screenshot, and paste it back. That fractured workflow has sat in the middle of education platforms, teaching research systems, and AI math products for years.

Today, Algeo SDK 2.0 embedded editing mode is officially available. Geometry is no longer the missing embeddable format. It can now live inside your product like a standard component, with data flowing back into your business system, UI matching your product design, and permissions staying under your own control.

Here are five common scenarios we see. If any of them sounds like your product, this release is worth a closer look.

Scenario 1: online education platforms can let teachers create geometry problems in place

A high school math teacher is preparing tomorrow’s geometry lesson on your platform. She needs an example problem about angle proofs in a circle.

Before: she opened a separate geometry tool, finished the diagram, took a screenshot, and pasted it back into your question bank. The text lived in one place and the image in another. Students saw a static picture that could not be dragged, edited, or reused after the test.

Now: she clicks “insert geometry board” in your question bank admin, and the Algeo editor opens in place. Circles, points, and auxiliary lines are created in the same workflow. When she saves, the board data enters your question bank and is bound to her account, school, and textbook chapter.

When students open the problem, they can drag a point on the circle and see the angle change directly. Throughout the whole process, your product stays in control: the data is yours, the permissions are yours, the content rights are yours, and the user behavior logs are yours.

Scenario 2: AI math products can let AI and students work on the same board

This is one of the fastest-growing customer categories we have seen over the past year.

A student uploads a photo of a geometry problem. Your AI parses the problem and generates a solution path. But text alone is not enough. The student needs to see why an auxiliary line is drawn that way, and needs to test by hand whether an equality still holds when a point starts moving.

Algeo embedded editing closes that loop for the first time:

After AI parsing, code can generate board content and load it into the editor automatically
Students interact directly inside your product by dragging, modifying, and trying alternatives
Every student edit can be sent back to your system as an event and used in the next AI analysis round
AI can respond to the student’s specific change instead of giving generic explanation

Education is a feedback loop. Text plus static diagrams can no longer carry that loop for geometry. The missing piece is a board that can be driven by code while still giving students hands-on control.

Scenario 3: educational publishing can turn geometry assets into a managed production workflow

In many publishing workflows, geometry illustrations used to operate like a separate workshop: an author drew the figure, a designer remade it as vector art, an editor reviewed it, and a layout designer processed it again. One geometry asset for one problem could pass through four tools and five people.

After embedding Algeo into a content management system, that pipeline becomes much flatter:

Authors write problems and draw figures directly in the CMS, with assets stored as structured geometry data rather than images
Editors can open the original board and revise it directly instead of asking the author to recreate it
The same geometry data can export to PDF, web, print, and interactive courseware: draw once, reuse everywhere
Version control stays inside the CMS, so geometry boards stop being external unmanaged files

For content organizations, this is not just about saving one tool. It is about turning geometry into a managed asset.

Scenario 4: schools and institutions can finally build a shared geometry asset library

Teaching research has an old pain point: Chinese language groups have material libraries, English groups have corpora, math teams have question banks, but geometry often remains scattered. Every teacher has dozens of local geometry source files. They leave with the teacher, disappear with an old computer, and are hard for new teachers to inherit.

When an institution embeds Algeo into its collaborative teaching research platform:

Geometry assets enter the institutional asset library and can be organized by subject, grade, and knowledge point
Teachers can remix the same board while keeping a complete revision history
New teachers can receive accumulated geometry resources on day one
Permissions and approvals follow the institution’s own rules, including what can be shared broadly and what stays inside a subject group

Scenario 5: question banks and homework systems can make geometry a first-class format

Many question bank systems have structured templates for multiple choice, fill-in-the-blank, and written-response questions. Geometry is often still just an image. That creates three limits:

Similar-question recommendation is weak because the system cannot tell whether two geometry problems share the same mathematical structure
Fine-grained grading is hard because the student’s answer often comes back as another image
Learning analytics are shallow because the system cannot see which construction step caused the student to get stuck

Once Algeo turns geometry problems into structured data, these workflows become possible:

Both the problem and the solving process are structured, so the question bank can handle geometry more like algebra
Every student operation can be reported back, allowing the grading system to locate which point was moved at which step
Learning analytics can tell a teacher that 70% of a class did not think to draw a specific auxiliary line

What is ready at the technical level

The scenarios are compelling, but production adoption is always an engineering problem. Algeo SDK 2.0 is designed to be production-ready in several core areas.

Bidirectional communication with clear data ownership

Every edit, board switch, and save request can be sent back to the host application through postMessage. You control the save button. The iframe does not bypass your business system to persist anything directly. When to save, where to save, and which permissions are required are all decided by your backend. The SDK only maintains the UI state for saved and unsaved changes.

Fully configurable UI that fits into your product

The navigation bar, board list, toolbox, algebra panel, and document panel can each be toggled independently at runtime. In an AI-assisted scenario, the editor can be reduced to a clean canvas. In a professional authoring scenario, the full toolchain can be shown. In advanced integrations, you can even replace our board list with your own UI and drive it through the SDK capability APIs.

Engineered capability layers

The SDK separates editor capabilities into four clear units: board file document, multi-board slides, history, and display mode. Each unit can be called independently, which also gives us room to improve each one over time without breaking the others.

Versioned protocol for long-term evolution

Every handshake between the SDK and iframe carries a protocol version. That means an integration you build today can continue to work after future upgrades, while still allowing us to deliver new capabilities without asking you to rewrite the integration every time.

Production-oriented robustness

The SDK includes a 30-second initialization timeout, standardized error codes, a clean destroy lifecycle, and self-hosted base URL support through baseUrl. These details matter when a real product faces network jitter, CSP rules, and complex route changes in single-page applications. We have already validated the approach in multiple production customer environments.

Why choose Dino-GSP and Algeo

There are very few teams in China that can build a dynamic geometry editor at this level. We spent a year making it production-ready, then another release cycle turning it from a product into a component. Geometry as a category really opens up only when it can be installed inside any product.

If your product contains the word “geometry”, whether in K12, higher education, AI math, educational publishing, or teaching research, we would be glad to talk.

Docs: open.dajiaoai.com

Repository: github.com/dajiaoai/algeo-sdk

Put a geometry board inside your product, starting today.

AHE Deep Dive: How Coding Agent Harnesses Automatically Evolve

Luhui Dev — Mon, 04 May 2026 14:49:21 GMT

When building a coding agent, the capability of your base model is only part of the equation. In real production scenarios, what matters just as much is the harness wrapped around that model — the prompt, tools, middleware, memory, execution environment, trace, and evaluation pipeline.

This is exactly what the AHE paper addresses: how to make a coding agent’s harness continuously observable, modifiable, testable, rollback-able, and even self-iterating — just like software engineering.

The full paper title is “Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses”, authored by researchers from Fudan University, Peking University, and Shanghai Qiji Zhifeng Co., Ltd. The academic teams bring methodological design, while the industry team contributes experience from Agent/LLM infrastructure and Nex AGI systems.

Even better, AHE is open source: china-qijizhifeng/agentic-harness-engineering.

This makes it more than just a paper concept — you can directly examine the seed coding agent, evolve agent, experiment configs, traces, manifests, and rollback structures. For anyone building coding agents, agent infrastructure, or broader agent products, this repository is worth dissecting.

This article explores three questions: why AHE works, how it evolves harnesses, and how to start your own small experiment with the repository.

Part 1: A Quick Intro to Harness Engineering

A harness is the external engineering shell that makes a model actually work. In a coding agent, it typically includes:

System prompt: defines the agent’s basic working mode
Tools: file I/O, shell, search, test execution, code modification, etc.
Tool descriptions: what the model sees about tool usage and parameter schemas
Middleware: interception, validation, correction, and logging before/after tool calls
Memory: short-term, long-term, and experience accumulation
Context management: compression, pruning, and retrieval
Execution environment: sandbox, permissions, runtime isolation
Evaluation/observability: testing, trace, logs, rewards, failure reports, regression tracking

This structure determines how the model approaches tasks, invokes tools, handles failures, and judges completion.

For example, when a shell command hangs in production, the solution isn’t to keep adding “don’t use interactive commands” to the prompt. A more robust approach: add timeout to the shell tool, use middleware to detect high-risk commands, truncate long outputs at the response layer, and enforce state checks before task completion.

This is the essence of Harness Engineering: putting agent capabilities into a maintainable runtime system.

I won’t dive deeper into the Harness concept here. If you want to learn more, search for keywords like: Harness Engineering, Agent Harness, Agent Runtime, Tool-use Agent, Agent Observability, Agent Evaluation, Coding Agent Infrastructure.

Let’s move to the main focus of this article.

Part 2: AHE’s Core Positioning — Self-Iterating Coding Agent Harnesses

AHE stands for Agentic Harness Engineering.

The paper’s subtitle contains the key phrase: Observability-Driven Automatic Evolution of Coding-Agent Harnesses.

This breaks down into three layers:

First, AHE targets coding agent harnesses. It doesn’t train new models or modify base model parameters.

Second, it performs automatic evolution. The goal isn’t a one-time manual prompt tweak, but continuous harness evolution across multiple runs.

Third, it relies on observability. Changes come from traces, logs, rewards, failure analysis, change manifests — not from vague “self-reflection” in a prompt.

So AHE’s precise positioning is:

An automatic evolution framework for coding agent harnesses. Through observable runtime evidence, it continuously improves the agent’s surrounding prompt, tools, middleware, memory, skills, and sub-agents.

This is the key difference from ordinary prompt optimization. AHE does modify prompts, but its action space is much larger — it includes tools, middleware, and memory as evolvable structures.

Part 3: AHE’s Experimental Results

AHE’s main experiments ran on Terminal-Bench 2. The paper reports that after 10 iterations, AHE improved the seed harness’s pass @1 from 69.7% to 77.0%. This shows that on the target benchmark, AHE found effective harness modifications.

The ablation study is even more revealing. The paper replaced different components in full AHE back to the seed harness individually, with roughly these results:

This result is highly informative.

If gains mainly came from better system prompts, prompt-only should improve. But in the experiment, prompt-only actually decreased, while memory, tools, and middleware showed more significant improvements.

This means AHE’s key benefits come from structural harness modifications. It also suggests that in complex tasks, many agent failures require harder (more engineering-focused) mechanisms: tool behavior, runtime interception, state recording, long-term experience, regression testing.

The paper also conducted transfer experiments. When the evolved harness transferred to SWE-bench-verified, success rate gains were small, but token usage dropped more noticeably. This suggests AHE’s evolved structures may be better at reducing ineffective exploration and context waste.

Cross-model transfer is also noteworthy. When AHE-generated harnesses were applied to multiple base models, the paper reports positive gains across the board. This indicates the learned components contain some transferable engineering structures.

My assessment: AHE’s prediction of “which changes will fix problems” is significantly better than random, but its prediction of “which changes will cause regressions” is still relatively weak. It does prove that harnesses can be continuously evolved in a file-based, evidence-based, version-controlled manner.

Part 4: AHE’s Key Workflow — Evaluate, Diagnose, Modify, Verify, Rollback

AHE’s main loop:

graph TD
    A[Current Harness] --> B[Run Code Agent on benchmark]
    B --> C[Collect trace, log, reward]
    C --> D[Analyze failure patterns]
    D --> E[Evolve Agent modifies Harness files]
    E --> F[Write change_manifest]
    F --> G[Re-evaluate next round]
    G --> H[Verify if changes work, rollback if needed]
    H -.-> A

This closed loop has three main actors.

First is the Code Agent.

This is the actual agent completing coding tasks, and the object being optimized. In the AHE repository, the seed agent is quite simple — basically a bash-only coding agent.

Second is the Agent Debugger.

It reads the Code Agent’s execution traces and compresses massive traces into readable failure reports. After a benchmark run, raw traces can be extremely long, making direct model reading too costly. Agent Debugger converts these traces into overviews and per-task analyses, providing evidence for subsequent modifications.

Third is the Evolve Agent.

It reads the previous round’s results, failure analysis, and historical modification records, then modifies harness files in the workspace. Its modification targets include prompts, tools, middleware, memory, skills, sub-agent configs, etc.

AHE adds strong engineering constraints to this process:

Every modification must land in files. Every modification requires a manifest. The next round must verify predictions in the manifest. Poor results must be rollback-able. The entire process should leave an auditable evidence chain.

The self-reflection agent must answer more specific questions: which file was changed, why, which tasks are expected to be fixed, which tasks might be harmed, and whether the next round’s results validate this judgment.

Part 5: What Evolvable Components Does AHE Break the Harness Into?

AHE’s first step is breaking the harness into explicit components.

The paper emphasizes several evolvable object types:

System Prompt: Defines the Code Agent’s basic behavior, like executing shell non-interactively, checking state before task completion, not exiting prematurely.

Tool Descriptions: What the model sees about tools. The tool itself might not change, but if the description changes, so does how the model calls it.

Tool Implementations: The actual tool implementation. For example, how the shell tool executes commands, handles timeouts, truncates output, returns error messages.

Middleware: Runtime interception layer. It can check before/after tool calls, like detecting dangerous commands, reminding about unverified tasks, blocking premature endings, recording risk states.

Skills: Reusable experience. Think of these as operation manuals for certain task patterns.

Sub-agents: Sub-agent configurations. Complex tasks can be split to different roles.

Long-term Memory: For accumulating experience across tasks and rounds.

This decomposition gives the Evolve Agent a richer action space. It can choose the right place to intervene based on failure evidence.

Example: Code Agent keeps hanging in shell. The least efficient approach is adding more prompt reminders. AHE’s path is more engineering-focused: add timeout to shell tool; middleware checks for obviously interactive commands; return messages explicitly state failure reasons; system prompt adds behavioral constraints.

These structural modifications are more stable and easier to reuse and rollback.

The key is understanding the positioning: prompts are behavioral suggestions; tools, middleware, and memory are execution mechanisms.

AHE’s value lies in bringing these execution mechanisms into the evolution scope.

Part 6: Three Layers of Observability — How AHE Avoids Blind Search

Just having an agent randomly modify files and rerun benchmarks has limited value. AHE’s core design is three layers of observability.

1. Component Observability

Component observability means the system knows what parts the harness has, where each part is, how to modify it, and how to register it.

In the AHE repository, prompts, tool descriptions, tool implementations, middleware, memory, etc., all appear as files. New tools need YAML descriptions and Python implementations, plus config registration; new middleware needs explicit integration; new skills or sub-agents also need config exposure.

2. Experience Observability

Experience observability means after an agent runs, the system records how it succeeded or failed.

AHE collects each task’s trace, runtime log, reward, etc. Then Agent Debugger compresses these raw traces into analysis reports.

When a coding agent fails, simply knowing “it failed” isn’t very useful. What you really need to locate is the failure level: command execution failure, dependency installation failure, test not run, file path error, output too long causing context pollution, agent prematurely judging task complete, losing previous state in long tasks.

Through traces and analysis, AHE turns failures into readable, summarizable, actionable evidence.

3. Decision Observability

After each modification, the Evolve Agent must write a change_manifest.json. This manifest records which files were changed, what failure pattern they address, why this component was chosen, which tasks are expected to be fixed, which might regress, and the modification's constraint strength.

After the next evaluation round, the system checks this manifest to see if predictions came true.

This step turns every modification into a verifiable hypothesis. Even without using AHE’s full automatic evolution pipeline, just introducing the change manifest habit into your own agent team will immediately improve engineering transparency.

Many agent projects struggle with long-term maintenance precisely because of this: lots of prompt changes, lots of tool adjustments, but nobody knows what each change actually solved, and nobody knows if it introduced new problems. AHE’s manifest mechanism at least makes this process auditable.

Part 7: AHE’s Engineering Organization from the Repository

The main entry point for the AHE repository is evolve.py. It orchestrates the entire evolution workflow, including initializing workspace, running evaluations, handling iteration directories, doing attribution, recovery, and rollback.

The seed agent being evolved is agents/code_agent_simple/, which includes:

code_agent.yaml describes how this agent loads prompts, which tools it uses, what tracer to use.

systemprompt.md is the initial system prompt.

LongTermMEMORY.md and ShortTermMEMORY.md correspond to long-term and short-term memory interfaces. tool_descriptions/ holds tool descriptions, tools/ holds tool implementations.

The Evolve Agent is in agents/evolve_agent/. Key files worth examining:

evolve_agent.yaml defines what tools, middleware, and skills the Evolve Agent itself can use.

evolve_prompt.md is an evolution contract: it specifies that Evolve Agent can only modify workspace, must make evidence-based changes, must write summaries and manifests, must follow registration rules.

Config files are in configs/ and configs/experiments/. configs/base.yaml is the base config, configs/experiments/exp-simple-code-gpt54.yaml is a config overlay close to the paper experiments.

Launch scripts are in scripts/, like scripts/evolve.sh for starting long experiments, scripts/build_templates.py for building task templates for E2B.

If you just want to understand the project, you don’t need to read all files at once. I recommend this reading order:

README
  ↓
agents/code_agent_simple/code_agent.yaml
  ↓
agents/code_agent_simple/systemprompt.md
  ↓
agents/evolve_agent/evolve_prompt.md
  ↓
configs/base.yaml
  ↓
configs/experiments/exp-simple-code-gpt54.yaml
  ↓
evolve.py

This sequence helps you build concepts first, then see execution details.

Part 8: Getting Started with the Repository — Run a Small Experiment First

AHE is not a lightweight SDK. You can’t expect to pip install and immediately embed it in production systems.

It’s more like a research experiment framework. Running full paper-level experiments requires LLM API, E2B sandbox, SERPER API, benchmark data, concurrent scheduling, and considerable token costs.

So a more realistic onboarding approach is to run a minimal closed loop first.

Set the goal as: get AHE’s core pipeline running.

That is:

graph LR
    A[Task execution] --> B[Trace generation]
    B --> C[Analysis generation]
    C --> D[change_manifest written]
    D --> E[Next round re-evaluation]
    E --> F[change_evaluation
judges modification effect]

Once this pipeline works, you understand AHE’s practical value.

1. Clone the Repository

Official repository:

git clone https://github.com/china-qijizhifeng/agentic-harness-engineering.git
cd agentic-harness-engineering

2. Install Dependencies

The project uses uv to manage Python dependencies.

uv sync

3. Configure Environment Variables

Copy the environment variable template:

cp .env.example .env

At minimum, pay attention to these variables:

LLM_API_KEY
LLM_BASE_URL
E2B_API_KEY
SERPER_API_KEY
GITHUB_TOKEN

Agent Debugger can also configure model endpoints separately. Refer to .env.example for specifics.

One important note: AHE’s task execution depends on E2B sandbox. Much code execution happens in isolated remote environments. This helps with security and reproducibility, but also means you need an E2B account and credits.

4. Prepare Benchmark Task Templates

The official workflow requires building task templates first. Example command:

uv run python scripts/build_templates.py --dataset-dir /path/to/dataset -j 16

Replace /path/to/dataset with your actual task data path.

If you’re just doing a small experiment, I don’t recommend preparing full Terminal-Bench 2 at the start. Select a few tasks and get the pipeline working first — that’s more important.

5. Start with a Small Config

For paper experiment config, refer to:

configs/experiments/exp-simple-code-gpt54.yaml

Running the full config is quite costly. Copy a small config, for example:

cp configs/experiments/exp-simple-code-gpt54.yaml configs/experiments/exp-mini.yaml

Then reduce the parameters:

max_iterations: 2
harbor:
  k: 2
  n_concurrent: 4

If the config supports specifying task subsets, use only 3 to 5 tasks. The point of a small experiment is validating the workflow, not chasing scores.

6. Launch the Evolution Experiment

You can use the script:

./scripts/evolve.sh configs/experiments/exp-mini.yaml

Or look inside the script to see how it calls evolve.py, then manually launch as needed.

Full experiments can run for a long time. Even small experiments require attention to API costs, E2B concurrency limits, and network stability.

7. Look at Experiment Artifacts, Not Just Scores

After running, don’t just look at pass rate.

What’s more worth examining are these artifacts:

runs/iteration_*/
analysis/overview.md
analysis/detail/*.md
change_manifest.json
change_evaluation.json
agent/nexau_in_memory_tracer.cleaned.json
verifier/reward.txt

After running, focus on observing and answering these questions:

What patterns were this round’s failures attributed to?
Which files did Evolve Agent change?
Why did it choose to change these files?
Which tasks does the manifest predict will be fixed?
Did the next round verify this prediction?
Were there cases where fixing one task broke another?

If you can find answers to all these questions in the artifacts, it means AHE’s core closed loop is working.

Part 9: What AHE Hasn’t Solved Yet

AHE is valuable, but its boundaries should be clear too.

First, it’s still a research framework. Full runs aren’t cheap, requiring benchmarks, sandboxes, LLM APIs, and fairly complex experiment configs.

Second, the effectiveness evidence in the paper needs more replication experiments. The improvement on Terminal-Bench 2 is clear, but for strong statistical conclusions, more seeds, more campaigns, and more confidence intervals are needed.

Third, its prediction of regression risk isn’t strong enough. The system is better at explaining what a modification might fix, but not as good at judging what it might harm. This is a hard problem for automatic evolution systems.

Part 10: AHE’s Inspiration for Agent Product Teams

AHE’s biggest inspiration for product-focused agent teams is pulling agent improvement processes from “mystical prompt tuning” back into the engineering world.

A real agent product will eventually face these questions:

After a user reports an error, how do you reproduce it?
How do you aggregate failure causes?
Did a certain prompt modification actually help?
Did a tool change regress other scenarios?
Is there regression testing before release?
Can you rollback if production performance degrades?
How do you distill effective experience into memory or skills?

No single model can solve these problems for you.

They belong to the scope of harness engineering work.

If you’re also building your own agent, this repository is worth thoroughly dissecting. Even without running it completely, you can learn a lot about harness organization, trace design, modification attribution, and regression verification engineering methods.

References

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
arXiv: https://arxiv.org/abs/2604.25850
AHE Official Code Repository
GitHub: https://github.com/china-qijizhifeng/agentic-harness-engineering
Harness engineering: leveraging Codex in an agent-first world
OpenAI Engineering Blog: https://openai.com/index/harness-engineering/

🙋‍
I’m Luhui Dev, a developer who has been breaking down Agent engineering and exploring how AI can be applied in education.
I focus on Agent Harness, LLM application engineering, AI for Math, and the productization of education SaaS.

DSPy Tutorial: Why Signatures Are Easier to Optimize Than Raw Prompts

Luhui Dev — Wed, 22 Apr 2026 10:09:19 GMT

One useful thing I ran into while building recently: DSPy.

While building Canviz’s content generation pipeline, I kept running into the same engineering problem: explanation quality and whiteboard-script reliability were both important, but it was hard to keep both stable with prompt text alone. As soon as I switched models or added a new grade level, I had to retune the string again. DSPy gave me a more systematic way to think about it.

If you are looking for a practical DSPy tutorial, a cleaner prompt engineering workflow, or a better way to optimize LLM pipelines, the core idea is simple: define the task as a Signature first, then let DSPy optimize the prompting layer around it.

The Core Tension in Prompt Engineering

Before talking about DSPy, it is worth stating one thing clearly: why is handwritten prompting an engineering problem rather than just a craft problem?

Traditional prompts have a structural flaw: they mix together “what the task is” and “how to tell the model to do it.”

That one natural-language string is doing two jobs at once:

It describes the task logic: what the inputs are and what outputs should come back.
It acts as a model-specific incantation tuned for the current model.

Take a math-teaching example. The task logic of “explain a chicken-and-rabbit cage problem to a student” is stable. But the spell that works well for GPT may not be the one that works well for Claude Sonnet. Once you switch models, or move from grade 3 to grade 5, the spell may break. Worse, there is usually no systematic way to repair it beyond trial and error.

That is just hardcoding by another name. In normal software engineering, we already know not to freeze core logic into brittle literals. In LLM pipelines, though, we often lock core behavior into a fragile string.

DSPy’s author, Stanford researcher Omar Khattab, describes the problem like this:

LM pipelines are often implemented as hard-coded prompt templates discovered by trial and error, and they are unusually brittle.

What Is DSPy, and What Is the Key Insight?

DSPy (Declarative Self-improving Python) is a framework open-sourced by Stanford NLP in 2023 and published at ICLR 2024. Its core claim is:

Programming language models, not prompting them.

The solution is elegant: separate the task interface from the concrete prompt implementation.

You tell DSPy:

what each step takes in and returns,
what the pipeline structure is,
and how success should be evaluated.

Then DSPy’s compiler and optimizer search for a better prompt strategy for your chosen model, data, and metric.

The official analogy is useful here: it feels a bit like moving from assembly to a higher-level language, or from handwritten SQL to an ORM.

Three Core Ideas That Explain DSPy

1. Signature: the task type signature

A Signature is DSPy’s interface description. You specify what the step does, not how to word it:

import dspy
class ExplainMathProblem(dspy.Signature):
    """Explain a math problem to a student at a specified grade level."""
    problem: str = dspy.InputField(desc="Original math problem")
    grade: int = dspy.InputField(desc="Student grade, e.g. 3 means third grade")
    explanation: str = dspy.OutputField(desc="A step-by-step explanation suitable for the grade")
    key_concept: str = dspy.OutputField(desc="The main concept tested by the problem")

There is no handwritten prompt here. There is only interface semantics, not roleplay wording like “You are a patient and caring math teacher.”

2. Module: composable functional blocks

Modules are DSPy’s execution units, inspired by PyTorch’s nn.Module. You can compose them into a full pipeline:

class MathLessonPipeline(dspy.Module):
    def __init__(self):
        # Step 1: explain the problem
        self.explain = dspy.ChainOfThought(ExplainMathProblem)
        # Step 2: generate a matching DinoGSP visualization script
        self.generate_diagram = dspy.Predict(
            "problem, explanation -> dinogsp_script: str"
        )
        # Step 3: create a similar practice exercise
        self.make_exercise = dspy.Predict(
            "problem, key_concept, grade -> exercise: str, answer: str"
        )
def forward(self, problem, grade):
        # Explain
        step1 = self.explain(problem=problem, grade=grade)
        # Generate diagram
        step2 = self.generate_diagram(
            problem=problem,
            explanation=step1.explanation
        )
        # Create exercise
        step3 = self.make_exercise(
            problem=problem,
            key_concept=step1.key_concept,
            grade=grade
        )
        return dspy.Prediction(
            explanation=step1.explanation,
            dinogsp_script=step2.dinogsp_script,
            exercise=step3.exercise,
            answer=step3.answer
        )

Across the whole three-step pipeline, no prompt string is manually written. What you write is the logic structure.

DSPy also ships several common reasoning strategies:

ModuleReasoning styleExample use in teachingdspy.Predictdirect predictiondifficulty grading, concept labelingdspy.ChainOfThoughtchain-of-thoughtstep-by-step explanationsdspy.ReActthink-act looptool-based script validationdspy.ProgramOfThoughtprogrammatic reasoningexecutable math code generation

3. Optimizer: the auto-tuning engine

This is the most distinctive part of DSPy.

You provide:

an evaluation dataset, such as 100 problems with human-annotated references,
and a metric function that decides whether the output is good enough.

Then the optimizer searches for stronger prompt instructions and better few-shot examples:

# Define the metric: is the explanation age-appropriate, and does the script parse?
def lesson_quality_metric(example, prediction, trace=None):
    explanation_ok = len(prediction.explanation) > 50  # minimum length
    script_parseable = validate_dinogsp(prediction.dinogsp_script)  # valid script
    grade_appropriate = check_vocabulary_level(
        prediction.explanation, example.grade
    )  # age-appropriate wording
    return explanation_ok and script_parseable and grade_appropriate

# Optimize with MIPROv2
optimizer = dspy.MIPROv2(metric=lesson_quality_metric, auto="medium")
optimized_pipeline = optimizer.compile(
    MathLessonPipeline(),
    trainset=annotated_lessons
)
# Save the result and load it directly in production later
optimized_pipeline.save("./optimized_math_lesson.json")

A medium run costs time and money, but the return is a content-generation system tuned to a specific model, dataset, and metric rather than a lucky handwritten prompt.

A Data Point Worth Looking At

One official DSPy result that stuck with me is from HotPotQA, a multi-hop reasoning benchmark.

Using dspy.ReAct with a gpt mini series model:

accuracy was around 24% before optimization,
and reached around 51% after MIPROv2 optimization on 500 examples.

The important point is not that a more expensive model was used. It is that the same class of model became much better at the task through optimization.

How It Differs from LangChain and LlamaIndex

A reasonable question is whether DSPy matters if you already use LangChain.

LangChain / LlamaIndex are orchestration frameworks. They are good at wiring together LLMs, vector stores, retrieval, and tool calls, but the prompts are still usually human-written strings. When the model changes, you often still have to go back and edit prompts by hand.

DSPy is closer to a compiler framework for AI programs. It does not just connect components. It tries to take over prompt generation and optimization as well. The developer writes the logic; DSPy searches for a better natural-language realization of that logic for a given model.

The difference becomes obvious in a math-education pipeline:

With LangChain, if you built a “third-grade explanation” flow and tomorrow need fifth-grade support, you usually revisit the prompt strings manually.
With DSPy, you are more likely to change the inputs, dataset, or evaluation target, then recompile and let the framework search again.

If I had to compress it into one analogy: LangChain is an automation assembly line; DSPy is a higher-level language with a compiler.

My Developer View: What It Solves, and What It Still Does Not

What DSPy genuinely solves

Model migration pain: when moving from GPT-5.4 to a cheaper model, you can recompile instead of rewriting all prompts.
Joint optimization across steps: explanation quality and diagram-script usability can be optimized together instead of separately.
Experiment reproducibility: optimized results can be saved as JSON and shared across the team.

Where it is still hard

Metrics are the hardest part: a function like validate_dinogsp() has to be designed carefully, or the optimizer will exploit loopholes.
Optimization is not free: as datasets, model costs, and optimization rounds grow, the bill grows too.
Debugging is still maturing: when the optimized pipeline is still not good enough, it is often hard to tell whether the bottleneck is the dataset, the metric, or the model itself.

When Should You Use DSPy?

Good fit

You are building a multi-step LLM pipeline.
You need to switch between different models.
You have an evaluation dataset and measurable quality targets.
You are tired of vibe-based prompt tuning.
You need something maintainable in production.

Probably not a good fit

You are only validating an idea quickly.
The task has no clear metric, so the optimizer has nothing reliable to optimize against.

Closing Thought

What I like most about DSPy is not just that it can auto-optimize prompts. It pushes a more reliable engineering mindset:

In an AI pipeline, prompts are closer to parameters than source code.

Just as I would not hardcode neural-network weights into source files, I should not treat a prompt tuned for one model as the program logic itself. Those prompts are better treated as artifacts that can be learned, optimized, saved, and migrated.

The logic of teaching content is stable: step-by-step explanation, visual support, age-appropriate wording. But how to get a model to deliver that changes with model upgrades, new grades, and new problem types. DSPy separates those two layers, which is what makes an AI teaching system actually maintainable.

🙋‍♀️ If you’re also working on AI education, feel free to connect.

References

DSPy docs: dspy.ai
Paper: DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ICLR 2024
GitHub: stanfordnlp/dspy
Optimizer guide: dspy.ai/learn/optimization/optimizers

Struggling with Research Figures? Here’s How Multi-Agent Collaboration Gets It Right

Luhui Dev — Sat, 11 Apr 2026 08:41:23 GMT

The Problem Every Researcher Knows Too Well

Anyone who’s done research knows this pain: creating a single figure from concept to completion can be more exhausting than writing the actual paper. You need logical structure, data precision, and style compliance — miss any one of these, and you’re back to the drawing board.

Single-model AI generation tools often produce beautiful images with broken logic, or logically sound diagrams that look terrible, or worst of all — figures where all the proportions are completely off.

PaperBanana solved this problem, and it works remarkably well. The key insight? Break the task into multiple roles and let an AI team collaborate.

Why Traditional AI Falls Short

Many assume that throwing a large language model at the problem should work. But research figures aren’t ordinary illustrations — they need to accurately express logic, ensure data precision, and ultimately meet academic journal aesthetics.

A single model can’t nail all three at once. The result? Either gorgeous images with completely wrong logic, or logically correct diagrams that look like they’re from the ’90s, and almost always with numerical proportions that make no sense.

This is the core pain point of research figure generation, and exactly why solutions like PaperBanana emerged.

PaperBanana’s Five-Role Collaboration

PaperBanana’s design philosophy is simple: Split the generation task into five specialized roles, let each focus on what they do best, then collaborate iteratively.

The Visual Workflow

1. Retriever — The Inspiration Board

The Retriever searches through a curated reference database to find the most relevant examples.

It focuses on visual structure matching, ensuring that subsequent generation has reliable layout references to work from.

Think of it like a designer browsing templates before starting to sketch — that’s what the Retriever does.

2. Planner — The Skeleton Designer

The Planner is the core brain. It transforms paper descriptions and figure objectives into detailed figure plans, including:

Figure components (nodes/modules)
Logical relationships and arrow directions between components
Spatial layout suggestions
Labels, annotations, etc.

The Planner’s core job is to provide the skeleton, preventing the generation from going off the rails.

3. Stylist — The Aesthetic Director

With the skeleton in place, the Stylist handles the aesthetics.

It extracts colors, fonts, line weights, and shapes from reference examples, optimizing the Planner’s output to meet journal standards.

NeurIPS and Nature have different figure styles — the Stylist ensures generated figures comply with academic norms.

4. Visualizer — The Executor

The Visualizer generates figures based on the standardized plan:

Method figures → Rendered using high-quality image generation models
Data charts → Outputs reproducible Matplotlib code

This means generated figures aren’t just pretty — they’re directly usable as research materials, reproducible and modifiable.

5. Critic — The QA/Feedback Loop

The Critic is key to closing the loop. It checks whether the figure faithfully reflects the text, whether it’s clear, and whether it meets style specifications.

If unsatisfied, it provides revision suggestions, prompting the Planner/Visualizer to iterate. Usually 2–3 rounds produce high-quality figures.

Why Multi-Role Collaboration Works

Compared to single-model end-to-end generation, PaperBanana has three major advantages:

Reference-driven: The Retriever provides structural and stylistic examples, making generation more reliable
Clear division of labor: Logic, style, and rendering are separated, avoiding the chaos of black-box generation
Closed-loop self-checking: Critic + iteration makes figure quality controllable

In other words, this is a process innovation for AI-assisted research figure creation. In experiments, PaperBanana significantly outperformed baselines in fidelity, readability, and aesthetics.

If you’re interested in the design of this scenario, I’ve compiled the complete Prompt set — grab it below 👇

Beyond Academic Figures

This multi-role collaboration pattern isn’t limited to academic illustrations.

For flowcharts, experimental design diagrams, teaching demonstrations, automated data visualization, and even complex tasks like code generation and decision planning, multi-agent collaboration proves more reliable.

References

Dino-GSP Major Update: dynamic geometry demos, geometry embeds, and AI drawing upgrades

Luhui Dev — Tue, 07 Apr 2026 12:34:34 GMT

Dino-GSP 2.4.0 was released on March 23, 2026. This update is not just a list of extra features. It connects dynamic geometry demos, online geometry embeds, region area calculation, and AI geometry drawing into a more complete workflow.

If you are comparing dynamic geometry software, online geometry tools, math teaching tools, or interactive geometry platforms for lessons, content, or websites, this release deserves attention.

Dino-GSP 2.4.0 at a glance

This release focuses on four high-frequency needs:

Slider-based dynamic demos that make geometry figures actually move
Geometry embed mode for blogs, course pages, and product sites
Boolean region operations and area calculation for more complex analysis
Broader AI geometry assistance that fits real creation workflows

1. Dynamic geometry demos upgraded: sliders are now a first-class feature

The point of dynamic geometry is not just drawing figures. It is showing parameter changes, geometric relationships, and reasoning processes in motion. The latest Dino-GSP release fully rounds out slider support and makes it much closer to a real dynamic geometry software workflow for classrooms and content creation.

This upgrade includes:

Create and edit dynamic parameters: sliders can directly control lengths, angles, and point positions, with figures updating in real time.
Text-linked values: slider values can be inserted into explanatory text so teaching copy updates together with the figure.
Autoplay support: presentation and sharing modes support autoplay, speed adjustment, and looping for lessons and recorded demos.
More complete exports: sliders can be exported to SVG and TikZ while preserving labels and control styles for papers, handouts, and blogs.

This pushes Dino-GSP beyond a static geometry board and makes it more suitable for interactive geometry demos, classroom walkthroughs, and parameter-driven explanations.

2. Geometry embed mode arrives: the online geometry tool can now live inside web pages

For course builders, bloggers, and documentation teams, the ability to embed geometry into a page is a practical requirement. The latest Dino-GSP release adds a full geometry embed mode.

2.1 Where this helps

Embedding interactive geometry into teaching blogs
Showing manipulable math demos inside online courses
Adding interactive diagrams to product sites or knowledge bases
Preserving parameter control and geometry state in shared pages

2.2 What is included

A complete embed architecture: dedicated routing, state synchronization, and communication bridging.
iframe export: exportable iframe links with configurable aspect ratios for different layouts.
REPL integration: embedded surfaces can load and edit geometry content, so the experience goes beyond passive viewing.

3. Region area calculation and boolean operations improved: analysis is more complete

If you need to work with overlapping shapes, composite figures, or region logic, this release strengthens the analytical layer.

The update includes:

Boolean path operations: intersection, union, and difference for more complex region construction.
Region area calculation: direct area calculation plus contains checks.
Precision fixes: better handling of boundary precision issues, negative radii, and undefined dependencies.

This matters for:

Solving geometry problems involving overlapping areas
Verifying region relationships in teaching contexts
Building composite paths for cleaner exports
Running more stable geometry computation workflows

4. Master management is now available: keep diagram styles consistent at scale

If you produce many teaching diagrams or worksheet visuals, repeated style setup quickly becomes inefficient. The latest release adds master management to improve content production efficiency.

You can now:

Open the master panel directly from the editor tabs
Create, update, apply, and delete masters
Set default styles and preview them in real time

For teachers, geometry creators, and worksheet teams, this improves batch production more than one-off drawing speed.

5. AI geometry drawing keeps improving: a smarter geometry assistant

Dino-GSP has been pushing AI toward an executable geometry assistant, not just a chat box. This AI update is part of that broader workflow.

The main AI improvements include:

Usage and credit records: clearer tracking for AI costs and consumption.
Image upload entry points: users can upload sketches or images and be routed to image-capable models.
Better conversation tools: copy, reaction, and feedback support for a more stable interaction loop.
Clearer instruction display: formatting, truncation, and expansion improve readability for complex prompts.
Animation support: AI can help create geometry animations and assist with keyframes and motion paths.

6. Axes, grids, and algebra definitions continue to improve

Beyond the larger features, this release also includes lower-level upgrades that affect daily use.

6.1 Coordinate system and grid

Custom grid ranges are supported
Axis point selection can lock intelligently
pi and pi/2 spacing are supported
X and Y ranges, labels, and intervals are more configurable

6.2 Automatic algebra definition reordering

Object order is adjusted automatically when algebra definitions change
Circular dependency detection and error prompts are supported

7. More upgrades across drawing and sharing workflows

7.1 Geometry and drawing

New orthogonal drawing mode
Better ellipse arc editing
Added arrow styles
Dynamic anchor support for labels
Formula editor symbols better aligned with classroom math notation

7.2 Interaction and interface

Floating toolbar for union selection, color settings, and hover hints
More line and point styling options
Clearer property panel structure
Input width adjusts dynamically with expression count

7.3 Sharing and SEO

Community sharing can control whether AI chat records are public
Shared works can restrict saving and remixing
Shared pages support dynamic titles and descriptions

This makes Dino-GSP better not just for drawing, but also for distribution, discoverability, and search visibility.

8. Which day-to-day issues were fixed

This release also fixes a large number of practical issues, including:

Region computation: negative area, path restoration, arc judgment, and precision flicker
Sliders: style copying, step and speed defaults, snapping, previews, and history behavior
Selection: deselect with Shift, incorrect select-all behavior, and function graph box selection
Exports: inconsistencies across SVG, LaTeX, and Canvas, plus font embedding and clipping offsets
Tool compatibility: grid snapping, compass and transform tool errors, file jumps, and copy/paste

Try Dino-GSP

If you are comparing geometry software, math teaching tools, or embeddable dynamic geometry options, this version is now a much stronger reference point.

👉 Try Dino-GSP now

About Dino-GSP

Dino-GSP is a tool for math teaching, geometry creation, and online sharing. It combines a geometry engine, AI assistance, and professional export capabilities into a more modern geometry workflow.

Embed a Geometry Canvas in Your Webpage with One Line of Code

Luhui Dev — Tue, 17 Mar 2026 13:36:57 GMT

Introduction

Many products actually need geometry capabilities.

For example:

Online education platforms need to display geometric shapes in courses
Question bank systems need to create diagrams for math problems
AI Tutors need to draw diagrams dynamically when explaining problems
Lesson plan and courseware tools need to generate mathematical graphics

But here’s the problem:

A geometry canvas is actually a very complex software system.

If you develop it yourself, you’ll quickly find yourself dealing with a pile of problems:

Geometric object management (points, lines, circles, angles, curves)
Intersection calculation and constraint computation
Graphics rendering and drag-and-drop interaction
Multi-canvas management
File format and sharing system

All these capabilities combined basically constitute a complete product.

The final choice for many teams is either to use static images or integrate an existing geometry system.

Recently, we did something interesting: We turned a geometry canvas into a component that can be directly embedded in webpages.

Developers only need one line of code to put a complete geometry canvas into their own products.

A Geometry Canvas That Can Be Embedded in Webpages

The Dino-GSP（大角几何）Open Platform provides an embeddable geometry canvas SDK.

Developers can embed the geometry canvas into their own web applications just like using a frontend component.

The core concept is actually quite simple:

Your webpage
   ↓
Embed geometry canvas
   ↓
Gain complete geometry capabilities

This means:

No need to develop your own geometry engine
No need to implement geometry calculations yourself
No need to write complex interaction logic yourself

Just embed it and use it.

In the official capability design, Dino-GSP（大角几何）aims to become “geometry capability infrastructure”: through SDK, API, REPL, and other methods, making geometry capabilities embeddable in more products and systems.

The Simplest Way: Direct Embedding

If you just want to display a geometric figure, the simplest method is iframe embedding.

For example:

This way you can directly embed a geometry canvas into a webpage.

Suitable scenarios include:

Displaying geometric figures on teaching pages
Embedding mathematical graphics in blog articles
Showing dynamic figures in online textbooks

No additional development work required.

Developer Approach: Using the SDK

If you want deeper control over the canvas, such as:

Dynamically loading graphics
Switching canvases
Importing files
Calling geometry operations

You can use the SDK integration approach.

First, install the SDK:

npm install @dajiaoai/algeo-sdk

Then create a canvas on the page:

import { AlgeoSdk } from '@dajiaoai/algeo-sdk'

const container = document.getElementById('algeo-container')

const sdk = await AlgeoSdk.create(container, {
  initialId: '33TA3484'
})

This creates a geometry canvas instance.

You can then operate it through the API, for example:

Load shared content:

await sdk.loadShareById('33TA3484')

Get canvas count:

const { count } = await sdk.getSlideCount()

Switch canvas:

await sdk.switchSlide(2)

Developers can use the geometry canvas as a programmable component.

A Very Interesting Capability: REPL

In addition to regular APIs, Dino-GSP（大角几何）also provides a REPL interface.

Simply put, it means using commands to directly control the geometry system.

For example:

Define geometric objects
Query graphic states
Execute geometry operations

The REPL output is in structured text format, making it convenient for AI or Agent systems to call.

This means that in the future, not only humans can operate the canvas, but AI can also directly call geometry capabilities.

This is why we call it: AI-native geometry capability interface.

Which Products Is This Suitable For?

The embeddable geometry canvas is actually suitable for many products.

1. Online Education Platforms

Directly embed geometric figures in course pages, supporting drag-and-drop and dynamic demonstrations.

2. Question Bank Systems

Automatically generate or load geometric figures for math problems.

3. AI Tutors

Draw diagrams dynamically when explaining geometry problems.

4. Math Content Platforms

Directly embed geometric figures in articles.

5. Independent Developer Tools

Quickly build a math tool without developing your own geometry engine.

Why We Built This Open Platform

Over the past year, while working on the geometry system, I’ve had a deep realization: geometry capability is actually a fundamental capability for many products.

But there aren’t many solutions available on the market currently — either complete software (like GeoGebra) or simple graphics libraries.

There’s a lack of a way to call geometry capabilities like an API.

So what the Dino-GSP（大角几何）Open Platform hopes to do is enable more products to directly use geometry capabilities without having to reinvent the wheel.

👉 Dino-GSP（大角几何）Open Platform: open.dajiaoai.com

AlphaGeometry2 Deep Dive: How Does Google AI Solve IMO Geometry Problems?

Luhui Dev — Fri, 06 Mar 2026 12:40:07 GMT

## Introduction

Lately I have been building products around AI for Math, not the kind of tools that just let a large model explain solutions, but systems more focused on structured geometric expression, constraint relations, diagram structure, and reasoning capability.

**As you work on this long enough, one thing becomes obvious: it is not hard to make AI talk about math, but it is very hard to make AI actually do math.**

**Geometry is especially hard.**

I can ask a model to explain why alternate interior angles are equal, but if I ask it to find the key auxiliary line on its own inside a complex diagram, it starts making things up.

When you look back at Google’s path from that angle, you realize they started addressing this systematically much earlier.

AlphaGeometry (AG1) was the first generation.

AlphaGeometry2 (AG2) is the upgraded version.

If you look at the two generations together, they read like a very clear field report on what breaks inside a math AI system.

I wrote this essay because I want to stand on Google’s shoulders and see:

* where they got stuck

* how they solved it

* which parts are engineering problems

* which parts are cognitive misunderstandings

## 1. First-Generation AlphaGeometry: What Did It Actually Solve?

The core idea behind AG1 was actually pretty pragmatic: **do not expect a large model to prove geometry problems by itself. Let the model do the “guessing,” let the symbolic system do the “reasoning,” and let search do the “finding.”**

Concretely:

* the LLM proposes auxiliary constructions

* the DDAR symbolic engine handles angle and ratio reasoning

* the search system traverses possible paths

This architecture is highly rational. It admits that language models are not good at rigorous deduction, but they are good at pattern matching and generating constructions that might be useful. The actual proof work is handed over to the symbolic system.

The key component here is the **DDAR reasoning engine**.

DDAR stands for Deductive Database of Angle and Ratio.

Google DeepMind Aletheia: A Deep Dive into a Fully Autonomous Math Research Agent

Luhui Dev — Wed, 25 Feb 2026 16:27:53 GMT

Google DeepMind Aletheia leads the IMO-ProofBench Advanced benchmark with an impressive ~91.9% score.

It also significantly outperforms baseline systems on hard USAMO 2025 problems. On harder internal benchmarks, it surpasses earlier reasoning models as well; while gaps still remain, it is clearly ahead of prior baselines.

Recent discussion around Aletheia feels familiar.

Headlines say “AI mathematician,” and comments ask: “Is it going to replace mathematicians? Can it already do autonomous research?”

I carefully reviewed the Aletheia paper and dataset, then organized the key architecture and practical implications I learned. That is exactly what this essay is about.

1) The Road to DeepMind Aletheia

Looking at the timeline, it is clear that Google DeepMind has been building toward this for a long time.

When AlphaGo appeared in 2016, the underlying question was already there: How do you optimize decision trajectories inside a system with complete rules and a clear evaluation function?

A board game is discrete, outcomes are decidable, and the search space is huge but structured. That is an ideal environment for strategy optimization.

DeepMind’s “neural networks + search” was never just about Go. It tested a broader hypothesis: if a problem can be strictly described and each step can be judged as correct or incorrect, “talent” can be partially replaced by computation.

With AlphaGeometry in 2024, the question shifted: Can mathematical reasoning also be placed inside a rule-closed system like this?

AlphaGeometry’s key design:

LLM proposes auxiliary construction candidates
A symbolic geometry system verifies constraints
Search handles backtracking and expansion

For the first time in this context, the LLM does not decide truth; it proposes possibilities, and structural systems guarantee logical validity.

That transition matters: Google had begun to place math reasoning inside a verifiable loop.

In late 2024, AlphaProof moved the battlefield into formal systems such as Lean. The question became: If geometry can be structured, can all of mathematics be formalized to machine level?

By entering Lean-like systems, AlphaProof sharply narrows expression:

every step must be machine-checkable,
the type system imposes strong constraints,
vague language no longer works,
proofs are not “plausible”; they must pass verification.

It also adds reinforcement learning to optimize strategy paths, so the system does not just write proofs; it learns tactic selection, goal decomposition, and branch value estimation.

From this point on, DeepMind’s direction becomes explicit: turn mathematical behavior into a schedulable search problem.

Aletheia is the continuation of that path, and currently its strongest result.

2) What Is Actually Worth Discussing About Aletheia

Saying “it can autonomously propose and prove conjectures” is still too shallow.

The hardest core of Aletheia has three parts: closed loop, structure, and scheduling.

If that loop runs stably, mathematical research truly begins to move beyond purely human time scales.

2.1 It really built a runnable research loop

Most “math AI” systems are still input problem -> output answer.

Aletheia looks more like a laboratory pipeline. A minimal loop looks like this:

Conjecture proposal: generate statements from existing theory, failed paths, or structural patterns
Proof attempt: generate drafts, choose lemmas, decompose goals
Formal verification: send to a proof assistant; accepted results are stored, failures return hard errors
Error-driven repair: rollback, add lemmas, change decompositions, rewrite conjectures
Knowledge/strategy update: feed newly proved theorems/lemmas back into the system for the next round of generation/search

The key is that failure is not “weak quality”; it is a hard error signal. That makes the feedback loop engineering-grade.

You can think of it as LLM for creative candidate generation, while formal systems decide whether anything actually hits the target.

2.2 The core is not the model; it is IR and verification interfaces

When people see a new SOTA math result, the default reaction is often “bigger and stronger model again.”

But in formal math, system ceilings are usually set less by parameter count and more by representation: how do you encode a theorem or proof state, map “ideas” into checkable syntax trees, and turn proof-assistant feedback into learnable signals?

In that sense, Aletheia is closer to a “math compiler + debugger + search engine” stack.

This requires a heavy middle layer:

Theorem Graph / Lemma Graph: dependency graph across results
Goal-state representation: structured encoding of current proof state (goal, hypotheses, type constraints)
Tactic/step representation: executable action space for proving (similar to AlphaProof action spaces)

Without this, even a strong model can only produce “essay-style proofs” that do not land in formal systems.

2.3 Why the engineering meaning is bigger than the score

Scores are outcomes. Engineering value is reusable method.

If Aletheia really has those three layers, then:

mathematical research can be decomposed into an action space + feedback + policy optimization paradigm,
formal systems move correctness from human review to machine adjudication,
LLMs move from judge to candidate generator, reducing hallucination blast radius.

The value of this route is that it turns “research” from an abstract human process into something software systems can implement.

Put differently: it gives research a CI/CD-like pipeline flavor — propose, verify, fail, repair, merge.

3) What Happens After Research Behavior Is Engineered?

One long-standing bottleneck in mathematics is verification cost.

Complex proofs can take months or years for peer confirmation. Human review time is scarce.

Formal systems shift correctness from human judgment to machine judgment. Once verification is no longer the bottleneck, generation speed becomes the main variable.

Imagine a system that expands theorem graphs daily, outputs large volumes of intermediate lemmas, and continuously reorganizes dependencies.

It may not solve grand open problems immediately, but it will keep filling theoretical space.

What changes under scaled research output?

My guess: first, rhythm.

The field’s rhythm has long been constrained by human verification throughput. If verification is machine-managed, the speed of theoretical expansion will rise sharply. Then the scarce resource is no longer proving ability, but problem selection and theory organization.

When proposition-generation speed exceeds human reading speed, disciplinary rhythm breaks.

Final Notes

If you are also working in education + AI, I am fairly confident about this: purely text-based “solution explanation” AI products will become harder to sustain.

As formal systems integrate and verification gets standardized, products that only explain steps will be pushed to the margins.

Future defensibility will likely require three things:

A structural intermediate layer: not just text output, but executable objects
Built-in verification: machine checking as a default capability
Exploration mode support: let learners propose conjectures, test hypotheses, and observe failure feedback

Teaching systems will increasingly resemble small theorem environments, not chat-only bots.

This path is not easy. It likely requires at least DSL or formal representation capabilities, plus executable constraint systems and interfaces to provers/verification engines.

But if the Aletheia direction continues, this is likely a long-term trend.

HKU CodePlot-CoT Deep Dive: Visual Reasoning or Geometric Reasoning?

Luhui Dev — Wed, 18 Feb 2026 02:01:01 GMT

Preface

In my previous analysis of MathCanvas, I argued:

The instability of LLMs in geometry is not because they cannot see the diagram —
it’s because they lack a stable intermediate structure to operate on.

Some recent work tries to solve this by making models draw before thinking.

MathCanvas lets the model generate internal sketches and reason over them.

After publishing that article, a reader asked:

If visual intermediate states matter so much, why not let the model actually draw the diagram?

It turns out someone already tried that.

HKU’s CodePlot-CoT does exactly this.

Instead of imagining auxiliary lines, the model writes Python matplotlib code, renders the diagram, then continues solving.

Sounds reasonable:
if visual reasoning is unstable, give the model an executable world.

But this raises a deeper question:

When the model writes plotting code, is it doing geometry — or merely testing a numerical example?

To answer that, we need to understand what problem the paper is really addressing.

What Problem CodePlot-CoT Actually Solves

CodePlot-CoT targets a fundamental phenomenon:

Multimodal models have unstable spatial working memory in math problems.

Concretely, the model may understand the question and even produce a correct reasoning chain —
but once the reasoning depends on diagram state, it drifts.

Typical failures:

Auxiliary lines change across steps
Spatial relations are forgotten
Later reasoning depends on structures that never existed

MathCanvas responds by creating an internal visual CoT.

CodePlot-CoT chooses a different path:

Instead of imagining a diagram, let the model manipulate a real one.

The diagram is outsourced to Python.

The Core Technique: The Model Writes Matplotlib

In the paper, when the model reasons:

“Connect C and D”

it doesn’t continue in natural language.

It outputs:

ax.plot([C[0], D[0]], [C[1], D[1]])

The pipeline becomes:

text reasoning → generate plotting code → render image → feed back → continue reasoning

The model’s intermediate thoughts no longer live in tokens or latent space — they live in an executable environment.

This brings immediate benefits:

1. Stable spatial state

The model relies on the environment instead of memory (agent-style tool use)

2. Visual consistency

Typical multimodal “diagram drift” disappears

3. Scalable supervision

The paper constructs the Math-VR dataset (178K problems) where
diagram → code → reasoning becomes supervision signal

This is a classic computer-vision approach:
don’t make the model imagine the world — give it a world.

So far, the idea is elegant.

But it has a very direct limitation.

Does the Model Actually Understand Geometry?

Consider the code:

ax.plot([C[0], D[0]], [C[1], D[1]])

This means: draw a line segment in coordinates.

But in geometry, “connect CD” is not drawing — it is a construction.

CD could be:

a chord
an angle bisector
a perpendicular
a radical axis
a locus constraint

These are the sources of reasoning.

Matplotlib describes appearance.
Geometry requires relational constraints.

So the intermediate structure is still:

looks correct, not logically necessary

Loss of Causality

Geometry depends on why, not whether.

Construct angle bisector → angles equal
This is logical implication.

But in a rendered diagram it becomes:

measured angles ≈ equal

These are fundamentally different:

TypeNatureGeometric constructionNecessaryNumerical instanceAccidental

CodePlot-CoT reasoning is:

generate a coordinate instance → inspect → conclude

Mathematically this is single-instance verification.

Geometry theorems require validity across all configurations.

Visual Verification vs Mathematical Proof

We can abstract two paradigms:

CodePlot-CoTGeometryEvidenceLooks correctMust be correctMethodExperimentDeductionNatureEmpirical reasoningFormal reasoning

CodePlot-CoT is essentially a geometry experiment AI, not a proof AI.

It answers:
“Does this diagram support the conclusion?”

Not:
“Is the conclusion derivable in the system?”

Why It Can’t Reach AlphaGeometry

HKU gives an executable diagram.
Google AlphaGeometry gives a derivable proof.

Between them lies a missing layer:

geometric objects themselves

Questions that matter:

Is a point free or constructed?
Is a line a bisector or arbitrary?
Is a circle defined by three points or sampled?

This is not vision — it is mathematical modeling.

We can place approaches on an axis:

image → rendering code → geometric objects → logical proof

CodePlot-CoT stops at layer 2
AlphaGeometry operates at layer 4

Humans mostly work at layer 3: construction.

We don’t start with proofs, and we don’t just look at pictures.

We manipulate objects:

draw perpendiculars
take intersections
construct circles
create constraints

This step determines whether reasoning can exist at all.

Closing Thoughts

In my own project Dino-GSP, I’m trying to isolate exactly this layer.

Not drawing, not proving — operating on geometric objects.

The model outputs neither:

ax.plot(...)

nor:

Therefore AB ⟂ CD

Instead:

PerpLine(, )
Intersection(Circle(O,2), Line(2,3))

Once the intermediate structure becomes constraints:

diagrams become stable
relations become verifiable
reasoning becomes chainable

CodePlot-CoT matters not because it solves geometry —
but because it proves visual reasoning needs an external workspace.

LLMs don’t fail math because they can’t reason.

They fail because they have nowhere to build a mathematical world.

CodePlot-CoT provides a workspace.
It’s just not yet a mathematical one.

Geometry likely requires a manipulable geometric language.

References

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
https://arxiv.org/abs/2510.11718

AlphaGeometry: An Olympiad-level AI system for geometry
https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/

A New Direction for AI for Math: What MathCanvas Actually Changes

Luhui Dev — Fri, 13 Feb 2026 08:59:48 GMT

Introduction

Over the past two years, progress in AI math solving has been striking.

GSM8K is nearly saturated.
Algebra problems are stable.
Competition benchmarks keep getting refreshed.

So a natural assumption emerged:

Larger models + longer chains of thought → better math ability.

But working on a geometry product for a long time reveals something very different.

Large models are still highly unstable on geometric construction problems.

In a simple isosceles triangle proof, a model may produce a perfectly standard reasoning outline — yet forget to construct the auxiliary line that the logic depends on.

The failure is not numerical.
It is structural.

The angle bisector should be constructed — but isn’t
The perpendicular should be dropped — but is missing
The diagram looks plausible but violates constraints
The generated figure cannot support the reasoning

Geometry has never primarily been about language understanding.

It is about constructing structure.

After reading MathCanvas, this became very clear to me:

The breakthrough for AI geometry lies in the intermediate structure.

MathCanvas is not a visual enhancement paper.
It incorporates construction behavior itself into the reasoning process.

This article does three things:

Deconstruct the architecture and training logic of MathCanvas
Analyze its real technical contribution
Explain why I believe this points to the future

The Core Idea of MathCanvas: Turning Drawing into a Reasoning Operator

MathCanvas does not propose another Visual Chain-of-Thought.

It proposes:

Intrinsic Visual Chain-of-Thought

The key word is intrinsic.

Traditional multimodal reasoning works like this:

read image → explain → answer

The image is only an input feature.

MathCanvas changes something fundamental:

The model can actively generate diagrams during reasoning, and those diagrams become conditions for subsequent reasoning steps.

After each reasoning segment, the model must decide whether to draw.

Drawing becomes a learnable strategic action.

This matters because geometric reasoning actually works like this:

write a few steps → get stuck → add an auxiliary line → reasoning resumes

MathCanvas models this behavior inside the inference process.

Technical Breakdown

The real contribution of the paper lies in the training pipeline.

Stage I — Visual Manipulation

Goal: teach the model how to construct

They train on:

10M caption → diagram pairs
5.2M step-by-step editing trajectories

The key is not producing a final diagram —
it is learning incremental edits.

The model learns to:

build primitives
add auxiliary lines
modify geometry
maintain consistency

During this phase, the reasoning pathway is frozen.
Only the generation ability is trained.

This avoids damaging existing reasoning capability.

Stage II — Strategic Visual-Aided Reasoning

This is the core stage.

After each text generation segment, the model predicts whether to emit:Drawing becomes part of next-token prediction.

The model learns:

when to construct
what to construct
how reasoning changes after construction

Loss = text CE + image flow loss

The diagram becomes an intermediate state in the reasoning chain.

The Real Moat: Data Design

MathCanvas builds a geometry primitive + relation system:

geometric objects
constructive relations (bisector, perpendicular, circumcenter, tangent, parallel…)

Large-scale editing trajectories are synthesized automatically.

This is effectively an implicit geometry DSL, rendered as images for training.

Ablations show removing edit trajectories significantly hurts performance.

Which suggests the model learns not just geometry —
it learns the rhythm of construction.

What Actually Matters in This Paper

1. Structure operations enter the reasoning chain

This is a shift from language CoT → structural CoT.

2. Intermediate visual states stabilize reasoning

Performance improves across planar and spatial geometry tasks.

The model is not just explaining better — it is reasoning differently.

3. Scalable synthetic data generation

Primitive + relation generation is more valuable than manual annotation.

4. Limitations remain

constraints are implicit in pixels
correctness cannot be formally verified
structures cannot be exported
editing is not persistent

Why This Points to the Future

After reading MathCanvas, my conclusion strengthened:

The future of AI geometry lies in operable intermediate representations.

Visual state is one form.
Constraint graphs are another.
DSLs are another.

The format does not matter.

What matters is whether the intermediate state is:

constructible
persistent
monitorable
reusable

Where Dino-GSP Fits

The system I’m building — Dino-GSP (Dynamic Geometry System) — follows this pipeline:

natural language → geometry DSL → constraint graph → rendering → continuous editing

Every object has dependency relations.
Every construction step is traceable.
Edits update constraints.
Structures can be verified and exported.

It is a white-box geometry system.

The alignment with MathCanvas is clear:

SystemIntermediate StateMathCanvasvisual intermediate stateDino-GSPexecutable constraint state

If MathCanvas proves construction behavior improves reasoning,

Dino-GSP attempts to turn construction into a computable system.

You can use it at dajiaoai.com.

The Three-Layer Future Architecture

I increasingly believe future systems will contain three layers:

Language planning layer — thinking
Structural construction layer — constraint
Visual feedback layer — perception

Only when these connect does AI truly gain geometric reasoning ability.

MathCanvas is not a benchmark paper.
It is a direction paper.

And one conclusion feels unavoidable:

If the intermediate state is only an image, the system cannot become engineering-grade.

The next stage of AI for Math is not smarter models.

It is models that can construct.

And in geometry — construction determines everything.

References

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
https://arxiv.org/abs/2510.14958
AlphaGeometry: Solving Olympiad Geometry without Human Demonstrations
https://arxiv.org/abs/2401.05492