<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Luhui Dev on Medium]]></title>
        <description><![CDATA[Stories by Luhui Dev on Medium]]></description>
        <link>https://medium.com/@luhuidev?source=rss-65692e5d4fe5------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*2VWrKuJMC_TasMl7mCZlxg.jpeg</url>
            <title>Stories by Luhui Dev on Medium</title>
            <link>https://medium.com/@luhuidev?source=rss-65692e5d4fe5------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 10 May 2026 15:08:38 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@luhuidev/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Dino-GSP Major Update: Algeo SDK 2.0 embedded editing mode is now available]]></title>
            <link>https://luhuidev.medium.com/dino-gsp-major-update-algeo-sdk-2-0-embedded-editing-mode-is-now-available-79cbadafb57e?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/79cbadafb57e</guid>
            <category><![CDATA[luhuidev]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[math]]></category>
            <category><![CDATA[dino-gsp]]></category>
            <category><![CDATA[ai-tools]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Sun, 10 May 2026 15:08:21 GMT</pubDate>
            <atom:updated>2026-05-10T15:08:21.890Z</atom:updated>
            <content:encoded><![CDATA[<p>Videos can be embedded. Documents can be embedded. Spreadsheets can be embedded.</p><p>But what about <strong>geometry</strong>?</p><p>For the past decade, whenever a product needed users to draw a geometry problem, edit a dynamic figure, or save an interactive geometry asset, the workflow usually broke in the same place: leave the product, use a separate tool, take a screenshot, and paste it back. That fractured workflow has sat in the middle of education platforms, teaching research systems, and AI math products for years.</p><p>Today, <a href="https://open.dajiaoai.com/?utm_source=luhuidev"><strong>Algeo SDK 2.0 embedded editing mode</strong></a> is officially available. Geometry is no longer the missing embeddable format. It can now live inside your product like a standard component, with data flowing back into your business system, UI matching your product design, and permissions staying under your own control.</p><p>Here are five common scenarios we see. If any of them sounds like your product, this release is worth a closer look.</p><h3>Scenario 1: online education platforms can let teachers create geometry problems in place</h3><p>A high school math teacher is preparing tomorrow’s geometry lesson on your platform. She needs an example problem about angle proofs in a circle.</p><p><strong>Before</strong>: she opened a separate geometry tool, finished the diagram, took a screenshot, and pasted it back into your question bank. The text lived in one place and the image in another. Students saw a static picture that could not be dragged, edited, or reused after the test.</p><p><strong>Now</strong>: she clicks “insert geometry board” in your question bank admin, and the Algeo editor opens in place. Circles, points, and auxiliary lines are created in the same workflow. When she saves, the board data enters your question bank and is bound to her account, school, and textbook chapter.</p><p>When students open the problem, they can drag a point on the circle and see the angle change directly. Throughout the whole process, <strong>your product stays in control</strong>: the data is yours, the permissions are yours, the content rights are yours, and the user behavior logs are yours.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Mh8ZkGV4MpkG57aSZCjB1w.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ojhfMilCMqDqVt9WyCrqyQ.png" /></figure><h3>Scenario 2: AI math products can let AI and students work on the same board</h3><p>This is one of the fastest-growing customer categories we have seen over the past year.</p><p>A student uploads a photo of a geometry problem. Your AI parses the problem and generates a solution path. But text alone is not enough. The student needs to <strong>see</strong> why an auxiliary line is drawn that way, and needs to <strong>test by hand</strong> whether an equality still holds when a point starts moving.</p><p>Algeo embedded editing closes that loop for the first time:</p><ul><li>After AI parsing, code can generate board content and load it into the editor automatically</li><li>Students interact directly inside your product by dragging, modifying, and trying alternatives</li><li>Every student edit can be sent back to your system as an event and used in the next AI analysis round</li><li>AI can respond to the student’s specific change instead of giving generic explanation</li></ul><p>Education is a <strong>feedback loop</strong>. Text plus static diagrams can no longer carry that loop for geometry. The missing piece is a board that can be driven by code while still giving students hands-on control.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*N2ClY_HU-sjQPTateyJQrw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vPfkQRr7frdsfGCrBYB2Mg.png" /></figure><h3>Scenario 3: educational publishing can turn geometry assets into a managed production workflow</h3><p>In many publishing workflows, geometry illustrations used to operate like a separate workshop: an author drew the figure, a designer remade it as vector art, an editor reviewed it, and a layout designer processed it again. One geometry asset for one problem could pass through four tools and five people.</p><p>After embedding Algeo into a content management system, that pipeline becomes much flatter:</p><ul><li>Authors write problems and draw figures directly in the CMS, with assets stored as structured geometry data rather than images</li><li>Editors can open the original board and revise it directly instead of asking the author to recreate it</li><li>The same geometry data can export to PDF, web, print, and interactive courseware: <strong>draw once, reuse everywhere</strong></li><li>Version control stays inside the CMS, so geometry boards stop being external unmanaged files</li></ul><p>For content organizations, this is not just about saving one tool. It is about turning geometry into a managed asset.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*f1wTmv_jY5t7EwDUh1Y3fw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2tQMNnCdGZUFssS6fzWkmQ.png" /></figure><h3>Scenario 4: schools and institutions can finally build a shared geometry asset library</h3><p>Teaching research has an old pain point: Chinese language groups have material libraries, English groups have corpora, math teams have question banks, but <strong>geometry</strong> often remains scattered. Every teacher has dozens of local geometry source files. They leave with the teacher, disappear with an old computer, and are hard for new teachers to inherit.</p><p>When an institution embeds Algeo into its collaborative teaching research platform:</p><ul><li>Geometry assets enter the institutional asset library and can be organized by subject, grade, and knowledge point</li><li>Teachers can remix the same board while keeping a complete revision history</li><li>New teachers can receive accumulated geometry resources on day one</li><li>Permissions and approvals follow the institution’s own rules, including what can be shared broadly and what stays inside a subject group</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*92TQi0n_rYFYqmKFJB5ohw.png" /></figure><h3>Scenario 5: question banks and homework systems can make geometry a first-class format</h3><p>Many question bank systems have structured templates for multiple choice, fill-in-the-blank, and written-response questions. <strong>Geometry is often still just an image</strong>. That creates three limits:</p><ul><li>Similar-question recommendation is weak because the system cannot tell whether two geometry problems share the same mathematical structure</li><li>Fine-grained grading is hard because the student’s answer often comes back as another image</li><li>Learning analytics are shallow because the system cannot see which construction step caused the student to get stuck</li></ul><p>Once Algeo turns geometry problems into structured data, these workflows become possible:</p><ul><li>Both the problem and the solving process are structured, so the question bank can handle geometry more like algebra</li><li>Every student operation can be reported back, allowing the grading system to locate which point was moved at which step</li><li>Learning analytics can tell a teacher that 70% of a class did not think to draw a specific auxiliary line</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nP-Lxqhql9lA8PhNTPQ-sg.png" /></figure><h3>What is ready at the technical level</h3><p>The scenarios are compelling, but production adoption is always an engineering problem. Algeo SDK 2.0 is designed to be production-ready in several core areas.</p><h3>Bidirectional communication with clear data ownership</h3><p>Every edit, board switch, and save request can be sent back to the host application through postMessage. <strong>You control the save button</strong>. The iframe does not bypass your business system to persist anything directly. When to save, where to save, and which permissions are required are all decided by your backend. The SDK only maintains the UI state for saved and unsaved changes.</p><h3>Fully configurable UI that fits into your product</h3><p>The navigation bar, board list, toolbox, algebra panel, and document panel can each be toggled independently at runtime. In an AI-assisted scenario, the editor can be reduced to a clean canvas. In a professional authoring scenario, the full toolchain can be shown. In advanced integrations, you can even <strong>replace our board list with your own UI</strong> and drive it through the SDK capability APIs.</p><h3>Engineered capability layers</h3><p>The SDK separates editor capabilities into four clear units: board file document, multi-board slides, history, and display mode. Each unit can be called independently, which also gives us room to improve each one over time without breaking the others.</p><h3>Versioned protocol for long-term evolution</h3><p>Every handshake between the SDK and iframe carries a protocol version. That means an integration you build today can continue to work after future upgrades, while still allowing us to deliver new capabilities without asking you to rewrite the integration every time.</p><h3>Production-oriented robustness</h3><p>The SDK includes a 30-second initialization timeout, standardized error codes, a clean destroy lifecycle, and self-hosted base URL support through baseUrl. These details matter when a real product faces network jitter, CSP rules, and complex route changes in single-page applications. We have already validated the approach in multiple production customer environments.</p><h3>Why choose Dino-GSP and Algeo</h3><p>There are very few teams in China that can build a <strong>dynamic geometry</strong> editor at this level. We spent a year making it production-ready, then another release cycle turning it from a product into a component. Geometry as a category really opens up only when it can be installed inside any product.</p><p>If your product contains the word “geometry”, whether in K12, higher education, AI math, educational publishing, or teaching research, we would be glad to talk.</p><p>Docs: <a href="https://open.dajiaoai.com/?utm_source=luhuidev">open.dajiaoai.com</a></p><p>Repository: <a href="https://github.com/dajiaoai/algeo-sdk">github.com/dajiaoai/algeo-sdk</a></p><p>Put a geometry board inside your product, starting today.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=79cbadafb57e" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[AHE Deep Dive: How Coding Agent Harnesses Automatically Evolve]]></title>
            <link>https://luhuidev.medium.com/ahe-deep-dive-how-coding-agent-harnesses-automatically-evolve-a0736ae5594c?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/a0736ae5594c</guid>
            <category><![CDATA[luhuidev]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[ai-agent]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Mon, 04 May 2026 14:49:21 GMT</pubDate>
            <atom:updated>2026-05-04T14:49:21.674Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*uTzTIPRH49Kg-vWF" /></figure><p>When building a coding agent, the capability of your base model is only part of the equation. In real production scenarios, what matters just as much is the <strong>harness</strong> wrapped around that model — the prompt, tools, middleware, memory, execution environment, trace, and evaluation pipeline.</p><p>This is exactly what the AHE paper addresses: <strong>how to make a coding agent’s harness continuously observable, modifiable, testable, rollback-able, and even self-iterating — just like software engineering.</strong></p><p>The full paper title is <strong>“Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses”</strong>, authored by researchers from Fudan University, Peking University, and Shanghai Qiji Zhifeng Co., Ltd. The academic teams bring methodological design, while the industry team contributes experience from Agent/LLM infrastructure and Nex AGI systems.</p><p>Even better, AHE is open source: china-qijizhifeng/agentic-harness-engineering.</p><p>This makes it more than just a paper concept — you can directly examine the seed coding agent, evolve agent, experiment configs, traces, manifests, and rollback structures. For anyone building coding agents, agent infrastructure, or broader agent products, this repository is worth dissecting.</p><p>This article explores three questions: why AHE works, how it evolves harnesses, and how to start your own small experiment with the repository.</p><h3>Part 1: A Quick Intro to Harness Engineering</h3><p>A harness is the external engineering shell that makes a model actually work. In a coding agent, it typically includes:</p><ul><li><strong>System prompt</strong>: defines the agent’s basic working mode</li><li><strong>Tools</strong>: file I/O, shell, search, test execution, code modification, etc.</li><li><strong>Tool descriptions</strong>: what the model sees about tool usage and parameter schemas</li><li><strong>Middleware</strong>: interception, validation, correction, and logging before/after tool calls</li><li><strong>Memory</strong>: short-term, long-term, and experience accumulation</li><li><strong>Context management</strong>: compression, pruning, and retrieval</li><li><strong>Execution environment</strong>: sandbox, permissions, runtime isolation</li><li><strong>Evaluation/observability</strong>: testing, trace, logs, rewards, failure reports, regression tracking</li></ul><p>This structure determines how the model approaches tasks, invokes tools, handles failures, and judges completion.</p><p>For example, when a shell command hangs in production, the solution isn’t to keep adding “don’t use interactive commands” to the prompt. A more robust approach: add timeout to the shell tool, use middleware to detect high-risk commands, truncate long outputs at the response layer, and enforce state checks before task completion.</p><p>This is the essence of Harness Engineering: putting agent capabilities into a maintainable runtime system.</p><p>I won’t dive deeper into the Harness concept here. If you want to learn more, search for keywords like: Harness Engineering, Agent Harness, Agent Runtime, Tool-use Agent, Agent Observability, Agent Evaluation, Coding Agent Infrastructure.</p><p>Let’s move to the main focus of this article.</p><h3>Part 2: AHE’s Core Positioning — Self-Iterating Coding Agent Harnesses</h3><p>AHE stands for <strong>Agentic Harness Engineering</strong>.</p><p>The paper’s subtitle contains the key phrase: <strong>Observability-Driven Automatic Evolution of Coding-Agent Harnesses</strong>.</p><p>This breaks down into three layers:</p><p>First, AHE targets <strong>coding agent harnesses</strong>. It doesn’t train new models or modify base model parameters.</p><p>Second, it performs <strong>automatic evolution</strong>. The goal isn’t a one-time manual prompt tweak, but continuous harness evolution across multiple runs.</p><p>Third, it relies on <strong>observability</strong>. Changes come from traces, logs, rewards, failure analysis, change manifests — not from vague “self-reflection” in a prompt.</p><p>So AHE’s precise positioning is:</p><p><strong>An automatic evolution framework for coding agent harnesses. Through observable runtime evidence, it continuously improves the agent’s surrounding prompt, tools, middleware, memory, skills, and sub-agents.</strong></p><p>This is the key difference from ordinary prompt optimization. AHE does modify prompts, but its <strong>action space is much larger — it includes tools, middleware, and memory as evolvable structures</strong>.</p><h3>Part 3: AHE’s Experimental Results</h3><p>AHE’s main experiments ran on Terminal-Bench 2. The paper reports that after 10 iterations, AHE improved the seed harness’s pass @1 from <strong>69.7% to 77.0%</strong>. This shows that on the target benchmark, AHE found effective harness modifications.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*IhNvMxa-rVrdc_dO" /></figure><p>The ablation study is even more revealing. The paper replaced different components in full AHE back to the seed harness individually, with roughly these results:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/908/0*2hpKsuRLvzJG2ufS" /></figure><p>This result is highly informative.</p><p>If gains mainly came from better system prompts, prompt-only should improve. But in the experiment, prompt-only actually decreased, while memory, tools, and middleware showed more significant improvements.</p><p>This means AHE’s key benefits come from structural harness modifications. It also suggests that in complex tasks, many agent failures require harder (more engineering-focused) mechanisms: tool behavior, runtime interception, state recording, long-term experience, regression testing.</p><p>The paper also conducted transfer experiments. When the evolved harness transferred to SWE-bench-verified, success rate gains were small, but token usage dropped more noticeably. This suggests AHE’s evolved structures may be better at reducing ineffective exploration and context waste.</p><p>Cross-model transfer is also noteworthy. When AHE-generated harnesses were applied to multiple base models, the paper reports positive gains across the board. This indicates the learned components contain some transferable engineering structures.</p><p>My assessment: AHE’s prediction of “which changes will fix problems” is significantly better than random, but its prediction of “which changes will cause regressions” is still relatively weak. It does prove that harnesses can be continuously evolved in a file-based, evidence-based, version-controlled manner.</p><h3>Part 4: AHE’s Key Workflow — Evaluate, Diagnose, Modify, Verify, Rollback</h3><p>AHE’s main loop:</p><pre>graph TD<br>    A[Current Harness] --&gt; B[Run Code Agent on benchmark]<br>    B --&gt; C[Collect trace, log, reward]<br>    C --&gt; D[Analyze failure patterns]<br>    D --&gt; E[Evolve Agent modifies Harness files]<br>    E --&gt; F[Write change_manifest]<br>    F --&gt; G[Re-evaluate next round]<br>    G --&gt; H[Verify if changes work, rollback if needed]<br>    H -.-&gt; A</pre><p>This closed loop has three main actors.</p><p>First is the <strong>Code Agent</strong>.</p><p>This is the actual agent completing coding tasks, and the object being optimized. In the AHE repository, the seed agent is quite simple — basically a bash-only coding agent.</p><p>Second is the <strong>Agent Debugger</strong>.</p><p>It reads the Code Agent’s execution traces and compresses massive traces into readable failure reports. After a benchmark run, raw traces can be extremely long, making direct model reading too costly. Agent Debugger converts these traces into overviews and per-task analyses, providing evidence for subsequent modifications.</p><p>Third is the <strong>Evolve Agent</strong>.</p><p>It reads the previous round’s results, failure analysis, and historical modification records, then modifies harness files in the workspace. Its modification targets include prompts, tools, middleware, memory, skills, sub-agent configs, etc.</p><p>AHE adds strong engineering constraints to this process:</p><p>Every modification must land in files. Every modification requires a manifest. The next round must verify predictions in the manifest. Poor results must be rollback-able. The entire process should leave an auditable evidence chain.</p><p>The self-reflection agent must answer more specific questions: which file was changed, why, which tasks are expected to be fixed, which tasks might be harmed, and whether the next round’s results validate this judgment.</p><h3>Part 5: What Evolvable Components Does AHE Break the Harness Into?</h3><p>AHE’s first step is breaking the harness into explicit components.</p><p>The paper emphasizes several evolvable object types:</p><p><strong>System Prompt</strong>: Defines the Code Agent’s basic behavior, like executing shell non-interactively, checking state before task completion, not exiting prematurely.</p><p><strong>Tool Descriptions</strong>: What the model sees about tools. The tool itself might not change, but if the description changes, so does how the model calls it.</p><p><strong>Tool Implementations</strong>: The actual tool implementation. For example, how the shell tool executes commands, handles timeouts, truncates output, returns error messages.</p><p><strong>Middleware</strong>: Runtime interception layer. It can check before/after tool calls, like detecting dangerous commands, reminding about unverified tasks, blocking premature endings, recording risk states.</p><p><strong>Skills</strong>: Reusable experience. Think of these as operation manuals for certain task patterns.</p><p><strong>Sub-agents</strong>: Sub-agent configurations. Complex tasks can be split to different roles.</p><p><strong>Long-term Memory</strong>: For accumulating experience across tasks and rounds.</p><p>This decomposition gives the Evolve Agent a richer action space. It can choose the right place to intervene based on failure evidence.</p><p>Example: Code Agent keeps hanging in shell. The least efficient approach is adding more prompt reminders. AHE’s path is more engineering-focused: add timeout to shell tool; middleware checks for obviously interactive commands; return messages explicitly state failure reasons; system prompt adds behavioral constraints.</p><p>These structural modifications are more stable and easier to reuse and rollback.</p><p>The key is understanding the positioning: <strong>prompts are behavioral suggestions; tools, middleware, and memory are execution mechanisms.</strong></p><p>AHE’s value lies in bringing these execution mechanisms into the evolution scope.</p><h3>Part 6: Three Layers of Observability — How AHE Avoids Blind Search</h3><p>Just having an agent randomly modify files and rerun benchmarks has limited value. AHE’s core design is three layers of observability.</p><h3>1. Component Observability</h3><p>Component observability means the system knows what parts the harness has, where each part is, how to modify it, and how to register it.</p><p>In the AHE repository, prompts, tool descriptions, tool implementations, middleware, memory, etc., all appear as files. New tools need YAML descriptions and Python implementations, plus config registration; new middleware needs explicit integration; new skills or sub-agents also need config exposure.</p><h3>2. Experience Observability</h3><p>Experience observability means after an agent runs, the system records how it succeeded or failed.</p><p>AHE collects each task’s trace, runtime log, reward, etc. Then Agent Debugger compresses these raw traces into analysis reports.</p><p>When a coding agent fails, simply knowing “it failed” isn’t very useful. What you really need to locate is the failure level: command execution failure, dependency installation failure, test not run, file path error, output too long causing context pollution, agent prematurely judging task complete, losing previous state in long tasks.</p><p>Through traces and analysis, AHE turns failures into readable, summarizable, actionable evidence.</p><h3>3. Decision Observability</h3><p>After each modification, the Evolve Agent must write a change_manifest.json. This manifest records which files were changed, what failure pattern they address, why this component was chosen, which tasks are expected to be fixed, which might regress, and the modification&#39;s constraint strength.</p><p>After the next evaluation round, the system checks this manifest to see if predictions came true.</p><p>This step turns every modification into a verifiable hypothesis. Even without using AHE’s full automatic evolution pipeline, just introducing the change manifest habit into your own agent team will immediately improve engineering transparency.</p><p>Many agent projects struggle with long-term maintenance precisely because of this: lots of prompt changes, lots of tool adjustments, but nobody knows what each change actually solved, and nobody knows if it introduced new problems. AHE’s manifest mechanism at least makes this process auditable.</p><h3>Part 7: AHE’s Engineering Organization from the Repository</h3><p>The main entry point for the AHE repository is evolve.py. It orchestrates the entire evolution workflow, including initializing workspace, running evaluations, handling iteration directories, doing attribution, recovery, and rollback.</p><p>The seed agent being evolved is agents/code_agent_simple/, which includes:</p><p>code_agent.yaml describes how this agent loads prompts, which tools it uses, what tracer to use.</p><p>systemprompt.md is the initial system prompt.</p><p>LongTermMEMORY.md and ShortTermMEMORY.md correspond to long-term and short-term memory interfaces. tool_descriptions/ holds tool descriptions, tools/ holds tool implementations.</p><p>The Evolve Agent is in agents/evolve_agent/. Key files worth examining:</p><p>evolve_agent.yaml defines what tools, middleware, and skills the Evolve Agent itself can use.</p><p>evolve_prompt.md is an evolution contract: it specifies that Evolve Agent can only modify workspace, must make evidence-based changes, must write summaries and manifests, must follow registration rules.</p><p>Config files are in configs/ and configs/experiments/. configs/base.yaml is the base config, configs/experiments/exp-simple-code-gpt54.yaml is a config overlay close to the paper experiments.</p><p>Launch scripts are in scripts/, like scripts/evolve.sh for starting long experiments, scripts/build_templates.py for building task templates for E2B.</p><p>If you just want to understand the project, you don’t need to read all files at once. I recommend this reading order:</p><pre>README<br>  ↓<br>agents/code_agent_simple/code_agent.yaml<br>  ↓<br>agents/code_agent_simple/systemprompt.md<br>  ↓<br>agents/evolve_agent/evolve_prompt.md<br>  ↓<br>configs/base.yaml<br>  ↓<br>configs/experiments/exp-simple-code-gpt54.yaml<br>  ↓<br>evolve.py</pre><p>This sequence helps you build concepts first, then see execution details.</p><h3>Part 8: Getting Started with the Repository — Run a Small Experiment First</h3><p>AHE is not a lightweight SDK. You can’t expect to pip install and immediately embed it in production systems.</p><p>It’s more like a research experiment framework. Running full paper-level experiments requires LLM API, E2B sandbox, SERPER API, benchmark data, concurrent scheduling, and considerable token costs.</p><p>So a more realistic onboarding approach is to run a minimal closed loop first.</p><p>Set the goal as: get AHE’s core pipeline running.</p><p>That is:</p><pre>graph LR<br>    A[Task execution] --&gt; B[Trace generation]<br>    B --&gt; C[Analysis generation]<br>    C --&gt; D[change_manifest written]<br>    D --&gt; E[Next round re-evaluation]<br>    E --&gt; F[change_evaluation&lt;br&gt;judges modification effect]</pre><p>Once this pipeline works, you understand AHE’s practical value.</p><h3>1. Clone the Repository</h3><p>Official repository:</p><pre>git clone https://github.com/china-qijizhifeng/agentic-harness-engineering.git<br>cd agentic-harness-engineering</pre><h3>2. Install Dependencies</h3><p>The project uses uv to manage Python dependencies.</p><pre>uv sync</pre><h3>3. Configure Environment Variables</h3><p>Copy the environment variable template:</p><pre>cp .env.example .env</pre><p>At minimum, pay attention to these variables:</p><pre>LLM_API_KEY<br>LLM_BASE_URL<br>E2B_API_KEY<br>SERPER_API_KEY<br>GITHUB_TOKEN</pre><p>Agent Debugger can also configure model endpoints separately. Refer to .env.example for specifics.</p><p>One important note: AHE’s task execution depends on E2B sandbox. Much code execution happens in isolated remote environments. This helps with security and reproducibility, but also means you need an E2B account and credits.</p><h3>4. Prepare Benchmark Task Templates</h3><p>The official workflow requires building task templates first. Example command:</p><pre>uv run python scripts/build_templates.py --dataset-dir /path/to/dataset -j 16</pre><p>Replace /path/to/dataset with your actual task data path.</p><p>If you’re just doing a small experiment, I don’t recommend preparing full Terminal-Bench 2 at the start. Select a few tasks and get the pipeline working first — that’s more important.</p><h3>5. Start with a Small Config</h3><p>For paper experiment config, refer to:</p><pre>configs/experiments/exp-simple-code-gpt54.yaml</pre><p>Running the full config is quite costly. Copy a small config, for example:</p><pre>cp configs/experiments/exp-simple-code-gpt54.yaml configs/experiments/exp-mini.yaml</pre><p>Then reduce the parameters:</p><pre>max_iterations: 2<br>harbor:<br>  k: 2<br>  n_concurrent: 4</pre><p>If the config supports specifying task subsets, use only 3 to 5 tasks. The point of a small experiment is validating the workflow, not chasing scores.</p><h3>6. Launch the Evolution Experiment</h3><p>You can use the script:</p><pre>./scripts/evolve.sh configs/experiments/exp-mini.yaml</pre><p>Or look inside the script to see how it calls evolve.py, then manually launch as needed.</p><p>Full experiments can run for a long time. Even small experiments require attention to API costs, E2B concurrency limits, and network stability.</p><h3>7. Look at Experiment Artifacts, Not Just Scores</h3><p>After running, don’t just look at pass rate.</p><p>What’s more worth examining are these artifacts:</p><pre>runs/iteration_*/<br>analysis/overview.md<br>analysis/detail/*.md<br>change_manifest.json<br>change_evaluation.json<br>agent/nexau_in_memory_tracer.cleaned.json<br>verifier/reward.txt</pre><p>After running, focus on observing and answering these questions:</p><ul><li>What patterns were this round’s failures attributed to?</li><li>Which files did Evolve Agent change?</li><li>Why did it choose to change these files?</li><li>Which tasks does the manifest predict will be fixed?</li><li>Did the next round verify this prediction?</li><li>Were there cases where fixing one task broke another?</li></ul><p>If you can find answers to all these questions in the artifacts, it means AHE’s core closed loop is working.</p><h3>Part 9: What AHE Hasn’t Solved Yet</h3><p>AHE is valuable, but its boundaries should be clear too.</p><p>First, it’s still a research framework. Full runs aren’t cheap, requiring benchmarks, sandboxes, LLM APIs, and fairly complex experiment configs.</p><p>Second, the effectiveness evidence in the paper needs more replication experiments. The improvement on Terminal-Bench 2 is clear, but for strong statistical conclusions, more seeds, more campaigns, and more confidence intervals are needed.</p><p>Third, its prediction of regression risk isn’t strong enough. The system is better at explaining what a modification might fix, but not as good at judging what it might harm. This is a hard problem for automatic evolution systems.</p><h3>Part 10: AHE’s Inspiration for Agent Product Teams</h3><p>AHE’s biggest inspiration for product-focused agent teams is pulling agent improvement processes from “mystical prompt tuning” back into the engineering world.</p><p>A real agent product will eventually face these questions:</p><ul><li>After a user reports an error, how do you reproduce it?</li><li>How do you aggregate failure causes?</li><li>Did a certain prompt modification actually help?</li><li>Did a tool change regress other scenarios?</li><li>Is there regression testing before release?</li><li>Can you rollback if production performance degrades?</li><li>How do you distill effective experience into memory or skills?</li></ul><p>No single model can solve these problems for you.</p><p>They belong to the scope of harness engineering work.</p><p>If you’re also building your own agent, this repository is worth thoroughly dissecting. Even without running it completely, you can learn a lot about harness organization, trace design, modification attribution, and regression verification engineering methods.</p><h3>References</h3><ul><li>Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses<br>arXiv: <a href="https://arxiv.org/abs/2604.25850">https://arxiv.org/abs/2604.25850</a></li><li>AHE Official Code Repository<br>GitHub: <a href="https://github.com/china-qijizhifeng/agentic-harness-engineering">https://github.com/china-qijizhifeng/agentic-harness-engineering</a></li><li>Harness engineering: leveraging Codex in an agent-first world<br>OpenAI Engineering Blog: <a href="https://openai.com/index/harness-engineering/">https://openai.com/index/harness-engineering/</a></li></ul><p>🙋‍<br><em>I’m Luhui Dev, a developer who has been breaking down Agent engineering and exploring how AI can be applied in education.<br>I focus on Agent Harness, LLM application engineering, AI for Math, and the productization of education SaaS.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a0736ae5594c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[DSPy Tutorial: Why Signatures Are Easier to Optimize Than Raw Prompts]]></title>
            <link>https://luhuidev.medium.com/dspy-tutorial-why-signatures-are-easier-to-optimize-than-raw-prompts-b22b9663f05d?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/b22b9663f05d</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[ai-agent]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Wed, 22 Apr 2026 10:09:19 GMT</pubDate>
            <atom:updated>2026-04-22T10:09:19.317Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FHC1fpp0n2ck5lR5" /></figure><p>One useful thing I ran into while building recently: <strong>DSPy</strong>.</p><p>While building Canviz’s content generation pipeline, I kept running into the same engineering problem: explanation quality and whiteboard-script reliability were both important, but it was hard to keep both stable with prompt text alone. As soon as I switched models or added a new grade level, I had to retune the string again. DSPy gave me a more systematic way to think about it.</p><p>If you are looking for a practical DSPy tutorial, a cleaner prompt engineering workflow, or a better way to optimize LLM pipelines, the core idea is simple: define the task as a Signature first, then let DSPy optimize the prompting layer around it.</p><h3>The Core Tension in Prompt Engineering</h3><p>Before talking about DSPy, it is worth stating one thing clearly: why is handwritten prompting an engineering problem rather than just a craft problem?</p><p>Traditional prompts have a structural flaw: <strong>they mix together “what the task is” and “how to tell the model to do it.”</strong></p><p>That one natural-language string is doing two jobs at once:</p><ol><li>It describes the task logic: what the inputs are and what outputs should come back.</li><li>It acts as a model-specific incantation tuned for the current model.</li></ol><p>Take a math-teaching example. The task logic of “explain a chicken-and-rabbit cage problem to a student” is stable. But the spell that works well for GPT may not be the one that works well for Claude Sonnet. Once you switch models, or move from grade 3 to grade 5, the spell may break. Worse, there is usually no systematic way to repair it beyond trial and error.</p><p>That is just hardcoding by another name. In normal software engineering, we already know not to freeze core logic into brittle literals. In LLM pipelines, though, we often lock core behavior into a fragile string.</p><p>DSPy’s author, Stanford researcher Omar Khattab, describes the problem like this:</p><blockquote>LM pipelines are often implemented as hard-coded prompt templates discovered by trial and error, and they are unusually brittle.</blockquote><h3>What Is DSPy, and What Is the Key Insight?</h3><p><strong>DSPy (Declarative Self-improving Python)</strong> is a framework open-sourced by Stanford NLP in 2023 and published at ICLR 2024. Its core claim is:</p><blockquote><strong><em>Programming language models, not prompting them.</em></strong></blockquote><p>The solution is elegant: <strong>separate the task interface from the concrete prompt implementation.</strong></p><p>You tell DSPy:</p><ul><li>what each step takes in and returns,</li><li>what the pipeline structure is,</li><li>and how success should be evaluated.</li></ul><p>Then DSPy’s compiler and optimizer search for a better prompt strategy for your chosen model, data, and metric.</p><p>The official analogy is useful here: it feels a bit like moving from assembly to a higher-level language, or from handwritten SQL to an ORM.</p><h3>Three Core Ideas That Explain DSPy</h3><h3>1. Signature: the task type signature</h3><p>A Signature is DSPy’s interface description. You specify what the step does, not how to word it:</p><pre>import dspy<br>class ExplainMathProblem(dspy.Signature):<br>    &quot;&quot;&quot;Explain a math problem to a student at a specified grade level.&quot;&quot;&quot;<br>    problem: str = dspy.InputField(desc=&quot;Original math problem&quot;)<br>    grade: int = dspy.InputField(desc=&quot;Student grade, e.g. 3 means third grade&quot;)<br>    explanation: str = dspy.OutputField(desc=&quot;A step-by-step explanation suitable for the grade&quot;)<br>    key_concept: str = dspy.OutputField(desc=&quot;The main concept tested by the problem&quot;)</pre><p>There is no handwritten prompt here. There is only <strong>interface semantics</strong>, not roleplay wording like “You are a patient and caring math teacher.”</p><h3>2. Module: composable functional blocks</h3><p>Modules are DSPy’s execution units, inspired by PyTorch’s nn.Module. You can compose them into a full pipeline:</p><pre>class MathLessonPipeline(dspy.Module):<br>    def __init__(self):<br>        # Step 1: explain the problem<br>        self.explain = dspy.ChainOfThought(ExplainMathProblem)<br>        # Step 2: generate a matching DinoGSP visualization script<br>        self.generate_diagram = dspy.Predict(<br>            &quot;problem, explanation -&gt; dinogsp_script: str&quot;<br>        )<br>        # Step 3: create a similar practice exercise<br>        self.make_exercise = dspy.Predict(<br>            &quot;problem, key_concept, grade -&gt; exercise: str, answer: str&quot;<br>        )<br>def forward(self, problem, grade):<br>        # Explain<br>        step1 = self.explain(problem=problem, grade=grade)<br>        # Generate diagram<br>        step2 = self.generate_diagram(<br>            problem=problem,<br>            explanation=step1.explanation<br>        )<br>        # Create exercise<br>        step3 = self.make_exercise(<br>            problem=problem,<br>            key_concept=step1.key_concept,<br>            grade=grade<br>        )<br>        return dspy.Prediction(<br>            explanation=step1.explanation,<br>            dinogsp_script=step2.dinogsp_script,<br>            exercise=step3.exercise,<br>            answer=step3.answer<br>        )</pre><p>Across the whole three-step pipeline, no prompt string is manually written. What you write is the <strong>logic structure</strong>.</p><p>DSPy also ships several common reasoning strategies:</p><p>ModuleReasoning styleExample use in teachingdspy.Predictdirect predictiondifficulty grading, concept labelingdspy.ChainOfThoughtchain-of-thoughtstep-by-step explanationsdspy.ReActthink-act looptool-based script validationdspy.ProgramOfThoughtprogrammatic reasoningexecutable math code generation</p><h3>3. Optimizer: the auto-tuning engine</h3><p>This is the most distinctive part of DSPy.</p><p>You provide:</p><ul><li>an evaluation dataset, such as 100 problems with human-annotated references,</li><li>and a metric function that decides whether the output is good enough.</li></ul><p>Then the optimizer searches for stronger prompt instructions and better few-shot examples:</p><pre># Define the metric: is the explanation age-appropriate, and does the script parse?<br>def lesson_quality_metric(example, prediction, trace=None):<br>    explanation_ok = len(prediction.explanation) &gt; 50  # minimum length<br>    script_parseable = validate_dinogsp(prediction.dinogsp_script)  # valid script<br>    grade_appropriate = check_vocabulary_level(<br>        prediction.explanation, example.grade<br>    )  # age-appropriate wording<br>    return explanation_ok and script_parseable and grade_appropriate</pre><pre># Optimize with MIPROv2<br>optimizer = dspy.MIPROv2(metric=lesson_quality_metric, auto=&quot;medium&quot;)<br>optimized_pipeline = optimizer.compile(<br>    MathLessonPipeline(),<br>    trainset=annotated_lessons<br>)<br># Save the result and load it directly in production later<br>optimized_pipeline.save(&quot;./optimized_math_lesson.json&quot;)</pre><p>A medium run costs time and money, but the return is a content-generation system tuned to a specific model, dataset, and metric rather than a lucky handwritten prompt.</p><h3>A Data Point Worth Looking At</h3><p>One official DSPy result that stuck with me is from HotPotQA, a multi-hop reasoning benchmark.</p><p>Using dspy.ReAct with a gpt mini series model:</p><ul><li>accuracy was around <strong>24%</strong> before optimization,</li><li>and reached around <strong>51%</strong> after MIPROv2 optimization on 500 examples.</li></ul><p>The important point is not that a more expensive model was used. It is that the same class of model became much better at the task through optimization.</p><h3>How It Differs from LangChain and LlamaIndex</h3><p>A reasonable question is whether DSPy matters if you already use LangChain.</p><p><strong>LangChain / LlamaIndex</strong> are orchestration frameworks. They are good at wiring together LLMs, vector stores, retrieval, and tool calls, but the prompts are still usually human-written strings. When the model changes, you often still have to go back and edit prompts by hand.</p><p><strong>DSPy</strong> is closer to a compiler framework for AI programs. It does not just connect components. It tries to take over prompt generation and optimization as well. The developer writes the logic; DSPy searches for a better natural-language realization of that logic for a given model.</p><p>The difference becomes obvious in a math-education pipeline:</p><ul><li>With LangChain, if you built a “third-grade explanation” flow and tomorrow need fifth-grade support, you usually revisit the prompt strings manually.</li><li>With DSPy, you are more likely to change the inputs, dataset, or evaluation target, then recompile and let the framework search again.</li></ul><p>If I had to compress it into one analogy: LangChain is an automation assembly line; DSPy is a higher-level language with a compiler.</p><h3>My Developer View: What It Solves, and What It Still Does Not</h3><h3>What DSPy genuinely solves</h3><ul><li><strong>Model migration pain</strong>: when moving from GPT-5.4 to a cheaper model, you can recompile instead of rewriting all prompts.</li><li><strong>Joint optimization across steps</strong>: explanation quality and diagram-script usability can be optimized together instead of separately.</li><li><strong>Experiment reproducibility</strong>: optimized results can be saved as JSON and shared across the team.</li></ul><h3>Where it is still hard</h3><ul><li><strong>Metrics are the hardest part</strong>: a function like validate_dinogsp() has to be designed carefully, or the optimizer will exploit loopholes.</li><li><strong>Optimization is not free</strong>: as datasets, model costs, and optimization rounds grow, the bill grows too.</li><li><strong>Debugging is still maturing</strong>: when the optimized pipeline is still not good enough, it is often hard to tell whether the bottleneck is the dataset, the metric, or the model itself.</li></ul><h3>When Should You Use DSPy?</h3><h3>Good fit</h3><ul><li>You are building a multi-step LLM pipeline.</li><li>You need to switch between different models.</li><li>You have an evaluation dataset and measurable quality targets.</li><li>You are tired of vibe-based prompt tuning.</li><li>You need something maintainable in production.</li></ul><h3>Probably not a good fit</h3><ul><li>You are only validating an idea quickly.</li><li>The task has no clear metric, so the optimizer has nothing reliable to optimize against.</li></ul><h3>Closing Thought</h3><p>What I like most about DSPy is not just that it can auto-optimize prompts. It pushes a more reliable engineering mindset:</p><p><strong>In an AI pipeline, prompts are closer to parameters than source code.</strong></p><p>Just as I would not hardcode neural-network weights into source files, I should not treat a prompt tuned for one model as the program logic itself. Those prompts are better treated as artifacts that can be learned, optimized, saved, and migrated.</p><p>The <strong>logic</strong> of teaching content is stable: step-by-step explanation, visual support, age-appropriate wording. But <strong>how to get a model to deliver that</strong> changes with model upgrades, new grades, and new problem types. DSPy separates those two layers, which is what makes an AI teaching system actually maintainable.</p><p>🙋‍♀️ <em>If you’re also working on AI education, feel free to connect.</em></p><h3>References</h3><ul><li>DSPy docs: <a href="https://dspy.ai/">dspy.ai</a></li><li>Paper: <a href="https://arxiv.org/abs/2310.03714">DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ICLR 2024</a></li><li>GitHub: <a href="https://github.com/stanfordnlp/dspy">stanfordnlp/dspy</a></li><li>Optimizer guide: <a href="https://dspy.ai/learn/optimization/optimizers/">dspy.ai/learn/optimization/optimizers</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b22b9663f05d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Struggling with Research Figures? Here’s How Multi-Agent Collaboration Gets It Right]]></title>
            <link>https://luhuidev.medium.com/struggling-with-research-figures-heres-how-multi-agent-collaboration-gets-it-right-ea7bc2c8f608?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/ea7bc2c8f608</guid>
            <category><![CDATA[multi-agent-systems]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Sat, 11 Apr 2026 08:41:23 GMT</pubDate>
            <atom:updated>2026-04-11T08:41:23.327Z</atom:updated>
            <content:encoded><![CDATA[<h3>The Problem Every Researcher Knows Too Well</h3><p>Anyone who’s done research knows this pain: creating a single figure from concept to completion can be more exhausting than writing the actual paper. You need logical structure, data precision, and style compliance — miss any one of these, and you’re back to the drawing board.</p><p>Single-model AI generation tools often produce beautiful images with broken logic, or logically sound diagrams that look terrible, or worst of all — figures where all the proportions are completely off.</p><p>PaperBanana solved this problem, and it works remarkably well. The key insight? <strong>Break the task into multiple roles and let an AI team collaborate.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*kitg7fSP46xbS-Y0" /></figure><h3>Why Traditional AI Falls Short</h3><p>Many assume that throwing a large language model at the problem should work. But research figures aren’t ordinary illustrations — they need to <strong>accurately express logic</strong>, <strong>ensure data precision</strong>, and ultimately meet academic journal aesthetics.</p><p>A single model can’t nail all three at once. The result? Either gorgeous images with completely wrong logic, or logically correct diagrams that look like they’re from the ’90s, and almost always with numerical proportions that make no sense.</p><p>This is the core pain point of research figure generation, and exactly why solutions like PaperBanana emerged.</p><h3>PaperBanana’s Five-Role Collaboration</h3><p>PaperBanana’s design philosophy is simple: <strong>Split the generation task into five specialized roles, let each focus on what they do best, then collaborate iteratively.</strong></p><h3>The Visual Workflow</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*e4qIyTV2mrtySdok.jpeg" /></figure><h3>1. Retriever — The Inspiration Board</h3><p>The Retriever searches through a curated reference database to find the most relevant examples.</p><p>It focuses on <strong>visual structure matching</strong>, ensuring that subsequent generation has reliable layout references to work from.</p><p>Think of it like a designer browsing templates before starting to sketch — that’s what the Retriever does.</p><h3>2. Planner — The Skeleton Designer</h3><p>The Planner is the core brain. It transforms paper descriptions and figure objectives into detailed figure plans, including:</p><ul><li>Figure components (nodes/modules)</li><li>Logical relationships and arrow directions between components</li><li>Spatial layout suggestions</li><li>Labels, annotations, etc.</li></ul><p>The Planner’s core job is to provide the skeleton, preventing the generation from going off the rails.</p><h3>3. Stylist — The Aesthetic Director</h3><p>With the skeleton in place, the Stylist handles the aesthetics.</p><p>It extracts colors, fonts, line weights, and shapes from reference examples, optimizing the Planner’s output to meet journal standards.</p><p>NeurIPS and Nature have different figure styles — the Stylist ensures generated figures comply with academic norms.</p><h3>4. Visualizer — The Executor</h3><p>The Visualizer generates figures based on the standardized plan:</p><ul><li><strong>Method figures</strong> → Rendered using high-quality image generation models</li><li><strong>Data charts</strong> → Outputs <strong>reproducible Matplotlib code</strong></li></ul><p>This means generated figures aren’t just pretty — they’re directly usable as research materials, reproducible and modifiable.</p><h3>5. Critic — The QA/Feedback Loop</h3><p>The Critic is key to closing the loop. It checks whether the figure faithfully reflects the text, whether it’s clear, and whether it meets style specifications.</p><p>If unsatisfied, it provides revision suggestions, prompting the Planner/Visualizer to iterate. Usually 2–3 rounds produce high-quality figures.</p><h3>Why Multi-Role Collaboration Works</h3><p>Compared to single-model end-to-end generation, PaperBanana has three major advantages:</p><ol><li><strong>Reference-driven</strong>: The Retriever provides structural and stylistic examples, making generation more reliable</li><li><strong>Clear division of labor</strong>: Logic, style, and rendering are separated, avoiding the chaos of black-box generation</li><li><strong>Closed-loop self-checking</strong>: Critic + iteration makes figure quality controllable</li></ol><p>In other words, this is a <strong>process innovation</strong> for AI-assisted research figure creation. In experiments, PaperBanana significantly outperformed baselines in fidelity, readability, and aesthetics.</p><p>If you’re interested in the design of this scenario, I’ve compiled <a href="https://luhuidev.com/zh-cn/essays/paperbanana-ai-academic-method-figure-collaboration">the complete Prompt set</a> — grab it below 👇</p><h3>Beyond Academic Figures</h3><p>This multi-role collaboration pattern isn’t limited to academic illustrations.</p><p>For flowcharts, experimental design diagrams, teaching demonstrations, automated data visualization, and even complex tasks like code generation and decision planning, multi-agent collaboration proves more reliable.</p><h3>References</h3><ul><li><a href="https://arxiv.org/abs/2601.23265">PaperBanana: Automating Academic Illustration for AI Scientists (arXiv)</a></li><li><a href="https://paper-banana.ai/">PaperBanana Official Site</a></li><li><a href="https://hyper.ai/en/papers/2601.23265">PaperBananaBench Dataset and Evaluation</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ea7bc2c8f608" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Dino-GSP Major Update: dynamic geometry demos, geometry embeds, and AI drawing upgrades]]></title>
            <link>https://luhuidev.medium.com/dino-gsp-major-update-dynamic-geometry-demos-geometry-embeds-and-ai-drawing-upgrades-f2c690d03161?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/f2c690d03161</guid>
            <category><![CDATA[math]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[dino-gsp]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Tue, 07 Apr 2026 12:34:34 GMT</pubDate>
            <atom:updated>2026-04-07T12:34:34.282Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>Dino-GSP 2.4.0 was released on March 23, 2026.</strong> This update is not just a list of extra features. It connects <strong>dynamic geometry demos, online geometry embeds, region area calculation, and AI geometry drawing</strong> into a more complete workflow.</p><p>If you are comparing <strong>dynamic geometry software, online geometry tools, math teaching tools, or interactive geometry platforms</strong> for lessons, content, or websites, this release deserves attention.</p><h3>Dino-GSP 2.4.0 at a glance</h3><p>This release focuses on four high-frequency needs:</p><ul><li><strong>Slider-based dynamic demos</strong> that make geometry figures actually move</li><li><strong>Geometry embed mode</strong> for blogs, course pages, and product sites</li><li><strong>Boolean region operations and area calculation</strong> for more complex analysis</li><li><strong>Broader AI geometry assistance</strong> that fits real creation workflows</li></ul><h3>1. Dynamic geometry demos upgraded: sliders are now a first-class feature</h3><p>The point of dynamic geometry is not just drawing figures. It is showing parameter changes, geometric relationships, and reasoning processes in motion. The latest Dino-GSP release fully rounds out slider support and makes it much closer to a real <strong>dynamic geometry software</strong> workflow for classrooms and content creation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*lnlVWQSg2-Nhkpz_" /></figure><p>This upgrade includes:</p><ol><li><strong>Create and edit dynamic parameters</strong>: sliders can directly control lengths, angles, and point positions, with figures updating in real time.</li><li><strong>Text-linked values</strong>: slider values can be inserted into explanatory text so teaching copy updates together with the figure.</li><li><strong>Autoplay support</strong>: presentation and sharing modes support autoplay, speed adjustment, and looping for lessons and recorded demos.</li><li><strong>More complete exports</strong>: sliders can be exported to SVG and TikZ while preserving labels and control styles for papers, handouts, and blogs.</li></ol><p>This pushes Dino-GSP beyond a static geometry board and makes it more suitable for <strong>interactive geometry demos</strong>, classroom walkthroughs, and parameter-driven explanations.</p><h3>2. Geometry embed mode arrives: the online geometry tool can now live inside web pages</h3><p>For course builders, bloggers, and documentation teams, the ability to embed geometry into a page is a practical requirement. The latest Dino-GSP release adds a full <strong>geometry embed mode</strong>.</p><h3>2.1 Where this helps</h3><ul><li>Embedding interactive geometry into teaching blogs</li><li>Showing manipulable math demos inside online courses</li><li>Adding interactive diagrams to product sites or knowledge bases</li><li>Preserving parameter control and geometry state in shared pages</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VJXce4xGi0r1qj_s" /></figure><h3>2.2 What is included</h3><ol><li><strong>A complete embed architecture</strong>: dedicated routing, state synchronization, and communication bridging.</li><li><strong>iframe export</strong>: exportable iframe links with configurable aspect ratios for different layouts.</li><li><strong>REPL integration</strong>: embedded surfaces can load and edit geometry content, so the experience goes beyond passive viewing.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*sWAIBzW3eyMVuf_n" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rCNyb_6ZSv9i7wLI" /></figure><h3>3. Region area calculation and boolean operations improved: analysis is more complete</h3><p>If you need to work with overlapping shapes, composite figures, or region logic, this release strengthens the analytical layer.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YTL67zYALZ60iJz5" /></figure><p>The update includes:</p><ol><li><strong>Boolean path operations</strong>: intersection, union, and difference for more complex region construction.</li><li><strong>Region area calculation</strong>: direct area calculation plus contains checks.</li><li><strong>Precision fixes</strong>: better handling of boundary precision issues, negative radii, and undefined dependencies.</li></ol><p>This matters for:</p><ul><li>Solving geometry problems involving overlapping areas</li><li>Verifying region relationships in teaching contexts</li><li>Building composite paths for cleaner exports</li><li>Running more stable geometry computation workflows</li></ul><h3>4. Master management is now available: keep diagram styles consistent at scale</h3><p>If you produce many teaching diagrams or worksheet visuals, repeated style setup quickly becomes inefficient. The latest release adds <strong>master management</strong> to improve content production efficiency.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*V8_xrFMIyvDZEgsp" /></figure><p>You can now:</p><ol><li>Open the master panel directly from the editor tabs</li><li>Create, update, apply, and delete masters</li><li>Set default styles and preview them in real time</li></ol><p>For teachers, geometry creators, and worksheet teams, this improves batch production more than one-off drawing speed.</p><h3>5. AI geometry drawing keeps improving: a smarter geometry assistant</h3><p>Dino-GSP has been pushing AI toward an executable geometry assistant, not just a chat box. This AI update is part of that broader workflow.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*shuC_TKa3z-VHqx6" /></figure><p>The main AI improvements include:</p><ol><li><strong>Usage and credit records</strong>: clearer tracking for AI costs and consumption.</li><li><strong>Image upload entry points</strong>: users can upload sketches or images and be routed to image-capable models.</li><li><strong>Better conversation tools</strong>: copy, reaction, and feedback support for a more stable interaction loop.</li><li><strong>Clearer instruction display</strong>: formatting, truncation, and expansion improve readability for complex prompts.</li><li><strong>Animation support</strong>: AI can help create geometry animations and assist with keyframes and motion paths.</li></ol><h3>6. Axes, grids, and algebra definitions continue to improve</h3><p>Beyond the larger features, this release also includes lower-level upgrades that affect daily use.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*eCeForsa-RMSzAnB" /></figure><h3>6.1 Coordinate system and grid</h3><ul><li>Custom grid ranges are supported</li><li>Axis point selection can lock intelligently</li><li>pi and pi/2 spacing are supported</li><li>X and Y ranges, labels, and intervals are more configurable</li></ul><h3>6.2 Automatic algebra definition reordering</h3><ul><li>Object order is adjusted automatically when algebra definitions change</li><li>Circular dependency detection and error prompts are supported</li></ul><h3>7. More upgrades across drawing and sharing workflows</h3><h3>7.1 Geometry and drawing</h3><ul><li>New orthogonal drawing mode</li><li>Better ellipse arc editing</li><li>Added arrow styles</li><li>Dynamic anchor support for labels</li><li>Formula editor symbols better aligned with classroom math notation</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*2uW8kXyc8a_aCVIH" /></figure><h3>7.2 Interaction and interface</h3><ul><li>Floating toolbar for union selection, color settings, and hover hints</li><li>More line and point styling options</li><li>Clearer property panel structure</li><li>Input width adjusts dynamically with expression count</li></ul><h3>7.3 Sharing and SEO</h3><ul><li>Community sharing can control whether AI chat records are public</li><li>Shared works can restrict saving and remixing</li><li>Shared pages support dynamic titles and descriptions</li></ul><p>This makes Dino-GSP better not just for drawing, but also for <strong>distribution, discoverability, and search visibility</strong>.</p><h3>8. Which day-to-day issues were fixed</h3><p>This release also fixes a large number of practical issues, including:</p><ul><li><strong>Region computation</strong>: negative area, path restoration, arc judgment, and precision flicker</li><li><strong>Sliders</strong>: style copying, step and speed defaults, snapping, previews, and history behavior</li><li><strong>Selection</strong>: deselect with Shift, incorrect select-all behavior, and function graph box selection</li><li><strong>Exports</strong>: inconsistencies across SVG, LaTeX, and Canvas, plus font embedding and clipping offsets</li><li><strong>Tool compatibility</strong>: grid snapping, compass and transform tool errors, file jumps, and copy/paste</li></ul><h3>Try Dino-GSP</h3><p>If you are comparing geometry software, math teaching tools, or embeddable dynamic geometry options, this version is now a much stronger reference point.</p><p>👉 <a href="https://dajiaoai.com/?utm_source=luhuidev">Try Dino-GSP now</a></p><h3>About Dino-GSP</h3><p>Dino-GSP is a tool for math teaching, geometry creation, and online sharing. It combines a geometry engine, AI assistance, and professional export capabilities into a more modern geometry workflow.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f2c690d03161" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Embed a Geometry Canvas in Your Webpage with One Line of Code]]></title>
            <link>https://luhuidev.medium.com/embed-a-geometry-canvas-in-your-webpage-with-one-line-of-code-c225edc7abcc?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/c225edc7abcc</guid>
            <category><![CDATA[math]]></category>
            <category><![CDATA[geometry]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Tue, 17 Mar 2026 13:36:57 GMT</pubDate>
            <atom:updated>2026-03-17T13:37:13.544Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fSnu_ROvvoMDQx0B.png" /></figure><h3>Introduction</h3><p>Many products actually need <strong>geometry capabilities</strong>.</p><p>For example:</p><ul><li>Online education platforms need to display geometric shapes in courses</li><li>Question bank systems need to create diagrams for math problems</li><li>AI Tutors need to draw diagrams dynamically when explaining problems</li><li>Lesson plan and courseware tools need to generate mathematical graphics</li></ul><p>But here’s the problem:</p><p><strong>A geometry canvas is actually a very complex software system.</strong></p><p>If you develop it yourself, you’ll quickly find yourself dealing with a pile of problems:</p><ul><li>Geometric object management (points, lines, circles, angles, curves)</li><li>Intersection calculation and constraint computation</li><li>Graphics rendering and drag-and-drop interaction</li><li>Multi-canvas management</li><li>File format and sharing system</li></ul><p>All these capabilities combined basically constitute a complete product.</p><p>The final choice for many teams is either to use <strong>static images</strong> or integrate an <strong>existing geometry system</strong>.</p><p>Recently, we did something interesting: <strong>We turned a geometry canvas into a component that can be directly embedded in webpages.</strong></p><p>Developers only need one line of code to put a complete geometry canvas into their own products.</p><h3>A Geometry Canvas That Can Be Embedded in Webpages</h3><p>The Dino-GSP（大角几何）Open Platform provides an <strong>embeddable geometry canvas SDK</strong>.</p><p>Developers can embed the geometry canvas into their own web applications just like using a frontend component.</p><p>The core concept is actually quite simple:</p><pre>Your webpage<br>   ↓<br>Embed geometry canvas<br>   ↓<br>Gain complete geometry capabilities</pre><p>This means:</p><ul><li>No need to develop your own geometry engine</li><li>No need to implement geometry calculations yourself</li><li>No need to write complex interaction logic yourself</li></ul><p>Just embed it and use it.</p><p>In the official capability design, Dino-GSP（大角几何）aims to become “geometry capability infrastructure”: through SDK, API, REPL, and other methods, making geometry capabilities embeddable in more products and systems.</p><h3>The Simplest Way: Direct Embedding</h3><p>If you just want to display a geometric figure, the simplest method is <strong>iframe embedding</strong>.</p><p>For example:</p><pre>&lt;iframe src=&quot;https://dajiaoai.com/e/33TA3484&quot; width=&quot;800&quot; height=&quot;600&quot; allow=&quot;fullscreen&quot;&gt;&lt;/iframe&gt;</pre><p>This way you can directly embed a geometry canvas into a webpage.</p><p>Suitable scenarios include:</p><ul><li>Displaying geometric figures on teaching pages</li><li>Embedding mathematical graphics in blog articles</li><li>Showing dynamic figures in online textbooks</li></ul><p>No additional development work required.</p><h3>Developer Approach: Using the SDK</h3><p>If you want deeper control over the canvas, such as:</p><ul><li>Dynamically loading graphics</li><li>Switching canvases</li><li>Importing files</li><li>Calling geometry operations</li></ul><p>You can use the <strong>SDK integration approach</strong>.</p><p>First, install the SDK:</p><pre>npm install @dajiaoai/algeo-sdk</pre><p>Then create a canvas on the page:</p><pre>import { AlgeoSdk } from &#39;@dajiaoai/algeo-sdk&#39;</pre><pre>const container = document.getElementById(&#39;algeo-container&#39;)</pre><pre>const sdk = await AlgeoSdk.create(container, {<br>  initialId: &#39;33TA3484&#39;<br>})</pre><p>This creates a geometry canvas instance.</p><p>You can then operate it through the API, for example:</p><p>Load shared content:</p><pre>await sdk.loadShareById(&#39;33TA3484&#39;)</pre><p>Get canvas count:</p><pre>const { count } = await sdk.getSlideCount()</pre><p>Switch canvas:</p><pre>await sdk.switchSlide(2)</pre><p>Developers can use the geometry canvas as a <strong>programmable component</strong>.</p><h3>A Very Interesting Capability: REPL</h3><p>In addition to regular APIs, Dino-GSP（大角几何）also provides a <strong>REPL interface</strong>.</p><p>Simply put, it means using commands to directly control the geometry system.</p><p>For example:</p><ul><li>Define geometric objects</li><li>Query graphic states</li><li>Execute geometry operations</li></ul><p>The REPL output is in structured text format, making it convenient for AI or Agent systems to call.</p><p>This means that in the future, not only humans can operate the canvas, <strong>but AI can also directly call geometry capabilities.</strong></p><p>This is why we call it: <strong>AI-native geometry capability interface.</strong></p><h3>Which Products Is This Suitable For?</h3><p>The embeddable geometry canvas is actually suitable for many products.</p><h3>1. Online Education Platforms</h3><p>Directly embed geometric figures in course pages, supporting drag-and-drop and dynamic demonstrations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Wq9jtVU_KpDWUWn0" /></figure><h3>2. Question Bank Systems</h3><p>Automatically generate or load geometric figures for math problems.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GWOkmAixzhdC7eO2" /></figure><h3>3. AI Tutors</h3><p>Draw diagrams dynamically when explaining geometry problems.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rmow1BCMrFuLn98j" /></figure><h3>4. Math Content Platforms</h3><p>Directly embed geometric figures in articles.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*h2wJO0y_RHenbCHA" /></figure><h3>5. Independent Developer Tools</h3><p>Quickly build a math tool without developing your own geometry engine.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*MlUbeH2cDHKoFay9" /></figure><h3>Why We Built This Open Platform</h3><p>Over the past year, while working on the geometry system, I’ve had a deep realization: <strong>geometry capability is actually a fundamental capability for many products.</strong></p><p>But there aren’t many solutions available on the market currently — either complete software (like GeoGebra) or simple graphics libraries.</p><p>There’s a lack of a way <strong>to call geometry capabilities like an API.</strong></p><p>So what the Dino-GSP（大角几何）Open Platform hopes to do is enable more products to directly use geometry capabilities without having to reinvent the wheel.</p><p>👉 Dino-GSP（大角几何）Open Platform: <a href="https://open.dajiaoai.com/en/">open.dajiaoai.com</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c225edc7abcc" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[AlphaGeometry2 Deep Dive: How Does Google AI Solve IMO Geometry Problems?]]></title>
            <link>https://luhuidev.medium.com/alphageometry2-deep-dive-how-does-google-ai-solve-imo-geometry-problems-2a8662f64014?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/2a8662f64014</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Fri, 06 Mar 2026 12:40:07 GMT</pubDate>
            <atom:updated>2026-03-06T12:40:07.800Z</atom:updated>
            <content:encoded><![CDATA[<p>## Introduction</p><p>Lately I have been building products around AI for Math, not the kind of tools that just let a large model explain solutions, but systems more focused on structured geometric expression, constraint relations, diagram structure, and reasoning capability.</p><p>**As you work on this long enough, one thing becomes obvious: it is not hard to make AI talk about math, but it is very hard to make AI actually do math.**</p><p>**Geometry is especially hard.**</p><p>I can ask a model to explain why alternate interior angles are equal, but if I ask it to find the key auxiliary line on its own inside a complex diagram, it starts making things up.</p><p>When you look back at Google’s path from that angle, you realize they started addressing this systematically much earlier.</p><p>AlphaGeometry (AG1) was the first generation.</p><p>AlphaGeometry2 (AG2) is the upgraded version.</p><p>If you look at the two generations together, they read like a very clear field report on what breaks inside a math AI system.</p><p>I wrote this essay because I want to stand on Google’s shoulders and see:</p><p>* where they got stuck</p><p>* how they solved it</p><p>* which parts are engineering problems</p><p>* which parts are cognitive misunderstandings</p><p>&lt;br/&gt;</p><p>&lt;br/&gt;</p><p>## 1. First-Generation AlphaGeometry: What Did It Actually Solve?</p><p>The core idea behind AG1 was actually pretty pragmatic: **do not expect a large model to prove geometry problems by itself. Let the model do the “guessing,” let the symbolic system do the “reasoning,” and let search do the “finding.”**</p><p>Concretely:</p><p>* the LLM proposes auxiliary constructions</p><p>* the DDAR symbolic engine handles angle and ratio reasoning</p><p>* the search system traverses possible paths</p><p>This architecture is highly rational. It admits that language models are not good at rigorous deduction, but they are good at pattern matching and generating constructions that might be useful. The actual proof work is handed over to the symbolic system.</p><p>The key component here is the **DDAR reasoning engine**.</p><p>DDAR stands for Deductive Database of Angle and Ratio.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2a8662f64014" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Google DeepMind Aletheia: A Deep Dive into a Fully Autonomous Math Research Agent]]></title>
            <link>https://luhuidev.medium.com/google-deepmind-aletheia-a-deep-dive-into-a-fully-autonomous-math-research-agent-ec36c258aa09?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/ec36c258aa09</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[google]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Wed, 25 Feb 2026 16:27:53 GMT</pubDate>
            <atom:updated>2026-02-25T16:27:53.539Z</atom:updated>
            <content:encoded><![CDATA[<p>Google DeepMind Aletheia leads the IMO-ProofBench Advanced benchmark with an impressive <strong>~91.9%</strong> score.</p><p>It also significantly outperforms baseline systems on hard USAMO 2025 problems. On harder internal benchmarks, it surpasses earlier reasoning models as well; while gaps still remain, it is clearly ahead of prior baselines.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*vo_SH03lSDN1vCI0.png" /></figure><p>Recent discussion around Aletheia feels familiar.</p><p>Headlines say “AI mathematician,” and comments ask: “Is it going to replace mathematicians? Can it already do autonomous research?”</p><p>I carefully reviewed the Aletheia paper and dataset, then organized the key architecture and practical implications I learned. That is exactly what this essay is about.</p><h3>1) The Road to DeepMind Aletheia</h3><p>Looking at the timeline, it is clear that Google DeepMind has been building toward this for a long time.</p><p>When <strong>AlphaGo</strong> appeared in 2016, the underlying question was already there: <strong>How do you optimize decision trajectories inside a system with complete rules and a clear evaluation function?</strong></p><p>A board game is discrete, outcomes are decidable, and the search space is huge but structured. That is an ideal environment for strategy optimization.</p><p>DeepMind’s “neural networks + search” was never just about Go. It tested a broader hypothesis: if a problem can be strictly described and each step can be judged as correct or incorrect, “talent” can be partially replaced by computation.</p><p>With <strong>AlphaGeometry</strong> in 2024, the question shifted: <strong>Can mathematical reasoning also be placed inside a rule-closed system like this?</strong></p><p>AlphaGeometry’s key design:</p><ul><li>LLM proposes auxiliary construction candidates</li><li>A symbolic geometry system verifies constraints</li><li>Search handles backtracking and expansion</li></ul><p>For the first time in this context, the LLM does not decide truth; it proposes possibilities, and structural systems guarantee logical validity.</p><p>That transition matters: Google had begun to place math reasoning inside a verifiable loop.</p><p>In late 2024, <strong>AlphaProof</strong> moved the battlefield into formal systems such as Lean. The question became: <strong>If geometry can be structured, can all of mathematics be formalized to machine level?</strong></p><p>By entering Lean-like systems, AlphaProof sharply narrows expression:</p><ul><li>every step must be machine-checkable,</li><li>the type system imposes strong constraints,</li><li>vague language no longer works,</li><li>proofs are not “plausible”; they must pass verification.</li></ul><p>It also adds reinforcement learning to optimize strategy paths, so the system does not just write proofs; it learns tactic selection, goal decomposition, and branch value estimation.</p><p>From this point on, DeepMind’s direction becomes explicit: turn mathematical behavior into a schedulable search problem.</p><p><strong>Aletheia is the continuation of that path, and currently its strongest result.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*LWE0SN2psKjHAKfr.png" /></figure><h3>2) What Is Actually Worth Discussing About Aletheia</h3><p>Saying “it can autonomously propose and prove conjectures” is still too shallow.</p><p>The hardest core of Aletheia has three parts: <strong>closed loop, structure, and scheduling</strong>.</p><p>If that loop runs stably, mathematical research truly begins to move beyond purely human time scales.</p><h3>2.1 It really built a runnable research loop</h3><p>Most “math AI” systems are still input problem -&gt; output answer.</p><p>Aletheia looks more like a laboratory pipeline. A minimal loop looks like this:</p><ul><li><strong>Conjecture proposal</strong>: generate statements from existing theory, failed paths, or structural patterns</li><li><strong>Proof attempt</strong>: generate drafts, choose lemmas, decompose goals</li><li><strong>Formal verification</strong>: send to a proof assistant; accepted results are stored, failures return hard errors</li><li><strong>Error-driven repair</strong>: rollback, add lemmas, change decompositions, rewrite conjectures</li><li><strong>Knowledge/strategy update</strong>: feed newly proved theorems/lemmas back into the system for the next round of generation/search</li></ul><p>The key is that failure is not “weak quality”; it is a <strong>hard error signal</strong>. That makes the feedback loop engineering-grade.</p><p>You can think of it as LLM for creative candidate generation, while formal systems decide whether anything actually hits the target.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wDDIzx_SB8xO98XY.png" /></figure><h3>2.2 The core is not the model; it is IR and verification interfaces</h3><p>When people see a new SOTA math result, the default reaction is often “bigger and stronger model again.”</p><p><strong>But in formal math, system ceilings are usually set less by parameter count and more by representation: how do you encode a theorem or proof state, map “ideas” into checkable syntax trees, and turn proof-assistant feedback into learnable signals?</strong></p><p>In that sense, Aletheia is closer to a “math compiler + debugger + search engine” stack.</p><p>This requires a heavy middle layer:</p><ul><li><strong>Theorem Graph / Lemma Graph</strong>: dependency graph across results</li><li><strong>Goal-state representation</strong>: structured encoding of current proof state (goal, hypotheses, type constraints)</li><li><strong>Tactic/step representation</strong>: executable action space for proving (similar to AlphaProof action spaces)</li></ul><p>Without this, even a strong model can only produce “essay-style proofs” that do not land in formal systems.</p><h3>2.3 Why the engineering meaning is bigger than the score</h3><p>Scores are outcomes. Engineering value is reusable method.</p><p>If Aletheia really has those three layers, then:</p><ul><li>mathematical research can be decomposed into an <strong>action space + feedback + policy optimization</strong> paradigm,</li><li>formal systems move <strong>correctness</strong> from human review to machine adjudication,</li><li>LLMs move from judge to candidate generator, reducing hallucination blast radius.</li></ul><p>The value of this route is that it turns “research” from an abstract human process into something software systems can implement.</p><p>Put differently: <strong>it gives research a CI/CD-like pipeline flavor — propose, verify, fail, repair, merge.</strong></p><h3>3) What Happens After Research Behavior Is Engineered?</h3><p>One long-standing bottleneck in mathematics is verification cost.</p><p><strong>Complex proofs can take months or years for peer confirmation. Human review time is scarce.</strong></p><p>Formal systems shift correctness from human judgment to machine judgment. Once verification is no longer the bottleneck, generation speed becomes the main variable.</p><p>Imagine a system that expands theorem graphs daily, outputs large volumes of intermediate lemmas, and continuously reorganizes dependencies.</p><p>It may not solve grand open problems immediately, but it will keep filling theoretical space.</p><p>What changes under scaled research output?</p><p>My guess: first, rhythm.</p><p>The field’s rhythm has long been constrained by human verification throughput. If verification is machine-managed, <strong>the speed of theoretical expansion will rise sharply</strong>. Then the scarce resource is no longer proving ability, but problem selection and theory organization.</p><p><strong>When proposition-generation speed exceeds human reading speed, disciplinary rhythm breaks.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YflaoxZ3OJhimNxy.png" /></figure><h3>Final Notes</h3><p>If you are also working in education + AI, I am fairly confident about this: purely text-based “solution explanation” AI products will become harder to sustain.</p><p>As formal systems integrate and verification gets standardized, products that only explain steps will be pushed to the margins.</p><p>Future defensibility will likely require three things:</p><ol><li><strong>A structural intermediate layer</strong>: not just text output, but executable objects</li><li><strong>Built-in verification</strong>: machine checking as a default capability</li><li><strong>Exploration mode support</strong>: let learners propose conjectures, test hypotheses, and observe failure feedback</li></ol><p>Teaching systems will increasingly resemble small theorem environments, not chat-only bots.</p><p>This path is not easy. It likely requires at least DSL or formal representation capabilities, plus executable constraint systems and interfaces to provers/verification engines.</p><p>But if the Aletheia direction continues, this is likely a long-term trend.</p><h3>Further Reading</h3><ul><li>Google DeepMind. <a href="https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think"><strong>Accelerating Mathematical and Scientific Discovery with Gemini Deep Think.</strong></a> Official Blog Post, 2026.</li><li>Google DeepMind. <a href="https://deepmind.google/blog/alphageometry-an-olympiad-level-ai-system-for-geometry"><strong>AlphaGeometry: An Olympiad-Level AI System for Geometry.</strong></a> Official Blog Post, 2024.</li><li>Google DeepMind. <a href="https://arxiv.org/abs/2602.10177"><strong>Towards Autonomous Mathematical Research.</strong></a> arXiv preprint, 2026.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ec36c258aa09" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[HKU CodePlot-CoT Deep Dive: Visual Reasoning or Geometric Reasoning?]]></title>
            <link>https://luhuidev.medium.com/hku-codeplot-cot-deep-dive-visual-reasoning-or-geometric-reasoning-92e1a61be8cd?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/92e1a61be8cd</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Wed, 18 Feb 2026 02:01:01 GMT</pubDate>
            <atom:updated>2026-02-18T02:01:01.305Z</atom:updated>
            <content:encoded><![CDATA[<h3>Preface</h3><p>In my previous analysis of <strong>MathCanvas</strong>, I argued:</p><blockquote><em>The instability of LLMs in geometry is not because they cannot see the diagram — <br>it’s because they lack a stable intermediate structure to operate on.</em></blockquote><p>Some recent work tries to solve this by making models <em>draw before thinking</em>.</p><p>MathCanvas lets the model generate internal sketches and reason over them.</p><p>After publishing that article, a reader asked:</p><blockquote><em>If visual intermediate states matter so much, why not let the model actually draw the diagram?</em></blockquote><p>It turns out someone already tried that.</p><p>HKU’s <strong>CodePlot-CoT</strong> does exactly this.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sZ1Ao2_mViMp2Irscr_fzw.png" /></figure><p>Instead of imagining auxiliary lines, the model writes Python matplotlib code, renders the diagram, then continues solving.</p><p>Sounds reasonable:<br>if visual reasoning is unstable, give the model an executable world.</p><p>But this raises a deeper question:</p><blockquote><em>When the model writes plotting code, is it doing geometry — or merely testing a numerical example?</em></blockquote><p>To answer that, we need to understand what problem the paper is really addressing.</p><h3>What Problem CodePlot-CoT Actually Solves</h3><p>CodePlot-CoT targets a fundamental phenomenon:</p><blockquote><em>Multimodal models have unstable spatial working memory in math problems.</em></blockquote><p>Concretely, the model may understand the question and even produce a correct reasoning chain — <br>but once the reasoning depends on diagram state, it drifts.</p><p>Typical failures:</p><ul><li>Auxiliary lines change across steps</li><li>Spatial relations are forgotten</li><li>Later reasoning depends on structures that never existed</li></ul><p>MathCanvas responds by creating an <strong>internal visual CoT</strong>.</p><p>CodePlot-CoT chooses a different path:</p><blockquote><em>Instead of imagining a diagram, let the model manipulate a real one.</em></blockquote><p>The diagram is outsourced to Python.</p><h3>The Core Technique: The Model Writes Matplotlib</h3><p>In the paper, when the model reasons:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UfYGEA4cKuTwUPBL8SzeQg.png" /></figure><blockquote><em>“Connect C and D”</em></blockquote><p>it doesn’t continue in natural language.</p><p>It outputs:</p><pre>ax.plot([C[0], D[0]], [C[1], D[1]])</pre><p>The pipeline becomes:</p><blockquote><em>text reasoning → generate plotting code → render image → feed back → continue reasoning</em></blockquote><p>The model’s intermediate thoughts no longer live in tokens or latent space — they live in an executable environment.</p><p>This brings immediate benefits:</p><h3>1. Stable spatial state</h3><p>The model relies on the environment instead of memory (agent-style tool use)</p><h3>2. Visual consistency</h3><p>Typical multimodal “diagram drift” disappears</p><h3>3. Scalable supervision</h3><p>The paper constructs the Math-VR dataset (178K problems) where<br>diagram → code → reasoning becomes supervision signal</p><p>This is a classic computer-vision approach:<br>don’t make the model imagine the world — give it a world.</p><p>So far, the idea is elegant.</p><p>But it has a very direct limitation.</p><h3>Does the Model Actually Understand Geometry?</h3><p>Consider the code:</p><pre>ax.plot([C[0], D[0]], [C[1], D[1]])</pre><p>This means: draw a line segment in coordinates.</p><p>But in geometry, “connect CD” is not drawing — it is a <strong>construction</strong>.</p><p>CD could be:</p><ul><li>a chord</li><li>an angle bisector</li><li>a perpendicular</li><li>a radical axis</li><li>a locus constraint</li></ul><p>These are the sources of reasoning.</p><p>Matplotlib describes appearance.<br>Geometry requires relational constraints.</p><p>So the intermediate structure is still:</p><blockquote><em>looks correct, not logically necessary</em></blockquote><h3>Loss of Causality</h3><p>Geometry depends on <em>why</em>, not <em>whether</em>.</p><p>Construct angle bisector → angles equal<br>This is logical implication.</p><p>But in a rendered diagram it becomes:</p><p>measured angles ≈ equal</p><p>These are fundamentally different:</p><p>TypeNatureGeometric constructionNecessaryNumerical instanceAccidental</p><p>CodePlot-CoT reasoning is:</p><blockquote><em>generate a coordinate instance → inspect → conclude</em></blockquote><p>Mathematically this is single-instance verification.</p><p>Geometry theorems require validity across all configurations.</p><h3>Visual Verification vs Mathematical Proof</h3><p>We can abstract two paradigms:</p><p>CodePlot-CoTGeometryEvidenceLooks correctMust be correctMethodExperimentDeductionNatureEmpirical reasoningFormal reasoning</p><p>CodePlot-CoT is essentially a <strong>geometry experiment AI</strong>, not a proof AI.</p><p>It answers:<br>“Does this diagram support the conclusion?”</p><p>Not:<br>“Is the conclusion derivable in the system?”</p><h3>Why It Can’t Reach AlphaGeometry</h3><p>HKU gives an executable diagram.<br>Google <strong>AlphaGeometry</strong> gives a derivable proof.</p><p>Between them lies a missing layer:</p><blockquote><em>geometric objects themselves</em></blockquote><p>Questions that matter:</p><ul><li>Is a point free or constructed?</li><li>Is a line a bisector or arbitrary?</li><li>Is a circle defined by three points or sampled?</li></ul><p>This is not vision — it is mathematical modeling.</p><p>We can place approaches on an axis:</p><blockquote><em>image → rendering code → geometric objects → logical proof</em></blockquote><p>CodePlot-CoT stops at layer 2<br>AlphaGeometry operates at layer 4</p><p>Humans mostly work at layer 3: construction.</p><p>We don’t start with proofs, and we don’t just look at pictures.</p><p>We manipulate objects:</p><p>draw perpendiculars<br>take intersections<br>construct circles<br>create constraints</p><p>This step determines whether reasoning can exist at all.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rUN3rAbJFXpoylkeUNFdlA.jpeg" /></figure><h3>Closing Thoughts</h3><p>In my own project <a href="https://dajiaoai.com/?inviter=685b4c2009d5753784ebe7df"><strong>Dino-GSP</strong></a>, I’m trying to isolate exactly this layer.</p><p>Not drawing, not proving — operating on geometric objects.</p><p>The model outputs neither:</p><pre>ax.plot(...)</pre><p>nor:</p><pre>Therefore AB ⟂ CD</pre><p>Instead:</p><pre>PerpLine(&lt;Line&gt;, &lt;Point&gt;)<br>Intersection(Circle(O,2), Line(2,3))</pre><p>Once the intermediate structure becomes constraints:</p><ul><li>diagrams become stable</li><li>relations become verifiable</li><li>reasoning becomes chainable</li></ul><p>CodePlot-CoT matters not because it solves geometry — <br>but because it proves visual reasoning needs an external workspace.</p><p>LLMs don’t fail math because they can’t reason.</p><p>They fail because they have nowhere to build a mathematical world.</p><p>CodePlot-CoT provides a workspace.<br>It’s just not yet a mathematical one.</p><p>Geometry likely requires a manipulable geometric language.</p><h3>References</h3><p>CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images<br><a href="https://arxiv.org/abs/2510.11718">https://arxiv.org/abs/2510.11718</a></p><p>AlphaGeometry: An Olympiad-level AI system for geometry<br><a href="https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/">https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=92e1a61be8cd" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A New Direction for AI for Math: What MathCanvas Actually Changes]]></title>
            <link>https://luhuidev.medium.com/a-new-direction-for-ai-for-math-what-mathcanvas-actually-changes-71053c9c874c?source=rss-65692e5d4fe5------2</link>
            <guid isPermaLink="false">https://medium.com/p/71053c9c874c</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Luhui Dev]]></dc:creator>
            <pubDate>Fri, 13 Feb 2026 08:59:48 GMT</pubDate>
            <atom:updated>2026-02-13T08:59:48.583Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><p>Over the past two years, progress in AI math solving has been striking.</p><p>GSM8K is nearly saturated.<br>Algebra problems are stable.<br>Competition benchmarks keep getting refreshed.</p><p>So a natural assumption emerged:</p><blockquote><em>Larger models + longer chains of thought → better math ability.</em></blockquote><p>But working on a geometry product for a long time reveals something very different.</p><p><strong>Large models are still highly unstable on geometric construction problems.</strong></p><p>In a simple isosceles triangle proof, a model may produce a perfectly standard reasoning outline — yet forget to construct the auxiliary line that the logic depends on.</p><p>The failure is not numerical.<br>It is structural.</p><ul><li>The angle bisector should be constructed — but isn’t</li><li>The perpendicular should be dropped — but is missing</li><li>The diagram looks plausible but violates constraints</li><li>The generated figure cannot support the reasoning</li></ul><p>Geometry has never primarily been about language understanding.</p><p>It is about <strong>constructing structure</strong>.</p><p>After reading <em>MathCanvas</em>, this became very clear to me:</p><blockquote><em>The breakthrough for AI geometry lies in the intermediate structure.</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aFm9RKn-FQzF7J-0hYO64A.png" /></figure><p>MathCanvas is not a visual enhancement paper.<br>It incorporates <strong>construction behavior itself into the reasoning process</strong>.</p><p>This article does three things:</p><ol><li>Deconstruct the architecture and training logic of MathCanvas</li><li>Analyze its real technical contribution</li><li>Explain why I believe this points to the future</li></ol><h3>The Core Idea of MathCanvas: Turning Drawing into a Reasoning Operator</h3><p>MathCanvas does not propose another Visual Chain-of-Thought.</p><p>It proposes:</p><blockquote><strong><em>Intrinsic Visual Chain-of-Thought</em></strong></blockquote><p>The key word is <em>intrinsic</em>.</p><p>Traditional multimodal reasoning works like this:</p><blockquote><em>read image → explain → answer</em></blockquote><p>The image is only an input feature.</p><p>MathCanvas changes something fundamental:</p><p>The model can actively generate diagrams during reasoning, and those diagrams become conditions for subsequent reasoning steps.</p><p>After each reasoning segment, the model must decide whether to draw.</p><p><strong>Drawing becomes a learnable strategic action.</strong></p><p>This matters because geometric reasoning actually works like this:</p><blockquote><em>write a few steps → get stuck → add an auxiliary line → reasoning resumes</em></blockquote><p>MathCanvas models this behavior inside the inference process.</p><h3>Technical Breakdown</h3><p>The real contribution of the paper lies in the training pipeline.</p><h3>Stage I — Visual Manipulation</h3><p><strong>Goal: teach the model how to construct</strong></p><p>They train on:</p><ul><li>10M caption → diagram pairs</li><li>5.2M step-by-step editing trajectories</li></ul><p>The key is not producing a final diagram — <br>it is learning <em>incremental edits</em>.</p><p>The model learns to:</p><ul><li>build primitives</li><li>add auxiliary lines</li><li>modify geometry</li><li>maintain consistency</li></ul><p>During this phase, the reasoning pathway is frozen.<br>Only the generation ability is trained.</p><p>This avoids damaging existing reasoning capability.</p><h3>Stage II — Strategic Visual-Aided Reasoning</h3><p>This is the core stage.</p><p>After each text generation segment, the model predicts whether to emit:&lt;vision_start&gt;Drawing becomes part of next-token prediction.</p><p>The model learns:</p><ul><li>when to construct</li><li>what to construct</li><li>how reasoning changes after construction</li></ul><p>Loss = text CE + image flow loss</p><p>The diagram becomes an intermediate state in the reasoning chain.</p><h3>The Real Moat: Data Design</h3><p>MathCanvas builds a geometry primitive + relation system:</p><ul><li>geometric objects</li><li>constructive relations (bisector, perpendicular, circumcenter, tangent, parallel…)</li></ul><p>Large-scale editing trajectories are synthesized automatically.</p><p>This is effectively an <strong>implicit geometry DSL</strong>, rendered as images for training.</p><p>Ablations show removing edit trajectories significantly hurts performance.</p><p>Which suggests the model learns not just geometry — <br>it learns the <em>rhythm of construction</em>.</p><h3>What Actually Matters in This Paper</h3><h3>1. Structure operations enter the reasoning chain</h3><p>This is a shift from language CoT → structural CoT.</p><h3>2. Intermediate visual states stabilize reasoning</h3><p>Performance improves across planar and spatial geometry tasks.</p><p>The model is not just explaining better — it is reasoning differently.</p><h3>3. Scalable synthetic data generation</h3><p>Primitive + relation generation is more valuable than manual annotation.</p><h3>4. Limitations remain</h3><ul><li>constraints are implicit in pixels</li><li>correctness cannot be formally verified</li><li>structures cannot be exported</li><li>editing is not persistent</li></ul><h3>Why This Points to the Future</h3><p>After reading MathCanvas, my conclusion strengthened:</p><blockquote><em>The future of AI geometry lies in operable intermediate representations.</em></blockquote><p>Visual state is one form.<br>Constraint graphs are another.<br>DSLs are another.</p><p>The format does not matter.</p><p>What matters is whether the intermediate state is:</p><ul><li>constructible</li><li>persistent</li><li>monitorable</li><li>reusable</li></ul><h3>Where Dino-GSP Fits</h3><p>The system I’m building — <strong>Dino-GSP (Dynamic Geometry System)</strong> — follows this pipeline:</p><blockquote><em>natural language → geometry DSL → constraint graph → rendering → continuous editing</em></blockquote><p>Every object has dependency relations.<br>Every construction step is traceable.<br>Edits update constraints.<br>Structures can be verified and exported.</p><p>It is a white-box geometry system.</p><p>The alignment with MathCanvas is clear:</p><p>SystemIntermediate StateMathCanvasvisual intermediate stateDino-GSPexecutable constraint state</p><p>If MathCanvas proves construction behavior improves reasoning,</p><p>Dino-GSP attempts to turn construction into a computable system.</p><p>You can use it at <em>dajiaoai.com</em>.</p><h3>The Three-Layer Future Architecture</h3><p>I increasingly believe future systems will contain three layers:</p><ol><li><strong>Language planning layer</strong> — thinking</li><li><strong>Structural construction layer</strong> — constraint</li><li><strong>Visual feedback layer</strong> — perception</li></ol><p>Only when these connect does AI truly gain geometric reasoning ability.</p><p>MathCanvas is not a benchmark paper.<br>It is a direction paper.</p><p>And one conclusion feels unavoidable:</p><blockquote><em>If the intermediate state is only an image, the system cannot become engineering-grade.</em></blockquote><p>The next stage of AI for Math is not smarter models.</p><p>It is models that can construct.</p><p>And in geometry — construction determines everything.</p><h3>References</h3><ul><li>MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning<br><a href="https://arxiv.org/abs/2510.14958">https://arxiv.org/abs/2510.14958</a></li><li>AlphaGeometry: Solving Olympiad Geometry without Human Demonstrations<br><a href="https://arxiv.org/abs/2401.05492">https://arxiv.org/abs/2401.05492</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=71053c9c874c" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>