Stories by Takuma Yamaguchi (Kumon) on Medium

Beyond Titans: What Has Changed Since the MIRAS Framework

Takuma Yamaguchi (Kumon) — Sat, 09 May 2026 09:37:23 GMT

Notes on adaptive weights, long context, and what the 2025–2026 papers actually say

My previous post, From Transformers to Titans: A Look at the MIRAS Framework, covered how Google’s Titans architecture and the MIRAS framework changed how we think about memory in neural networks. Instead of treating memory as a fixed storage state, these systems treat it as a continuous optimization problem — updating model weights during inference as new input arrives.

Since then, the field has moved quickly. At NeurIPS 2025 and ICLR 2026, several papers extended this idea in different directions. This post covers the most important ones.

The common thread across all of them is simple: models no longer have to keep their weights frozen at inference time. How far you can take that idea — and where it breaks — is what each paper explores differently.

1. TNT: Training Deep Memory Models Without Killing Your GPU

Paper: TNT: Improving Chunkwise Training for Test-Time Memorization (ICLR 2026)

Titans and similar models update their weights during inference using small gradient descent steps. This is powerful, but it creates a problem at training time: the sequential nature of weight updates makes parallelization very difficult. Hardware utilization is low, and training is slow.

TNT (Two-stage Non-linear Training) solves this by separating training efficiency from inference quality. In the first stage, it uses a hierarchical memory structure — a global module handles large chunks of context for efficiency, while multiple local modules handle fine-grained details in parallel. A periodic reset of the local memory states removes sequential dependencies and enables parallel computation. In the second stage, a short fine-tuning pass adapts the local modules to smaller chunk sizes for higher inference accuracy.

The result is up to 17× faster training than baseline Titans configurations, with better accuracy. This matters because it removes a practical barrier that was quietly blocking the adoption of deep memory models at scale.

2. Nested Learning and Hope: A Model That Learns Its Own Update Rules

Paper: Nested Learning: The Illusion of Deep Learning Architectures (NeurIPS 2025) | Google Research Blog

Google Research introduced the Nested Learning paradigm at NeurIPS 2025. The core idea is that a model’s architecture and its optimization algorithm are not fundamentally different things — they are just optimization problems operating at different timescales. By making this structure explicit, you can design models that update at multiple frequencies simultaneously.

The practical result of this framework is Hope, a variant of Titans. Instead of the original two-speed memory (long-term and short-term), Hope uses a Continuum Memory System (CMS) — a set of memory modules, each updating at a different frequency. Hope also uses this system to optimize its own memory through a self-referential in-context learning loop.

On long-context Needle-in-a-Haystack (NIAH) benchmarks, Hope consistently outperforms standard Transformers, Titans, and Mamba-2. That said, experiments are at the 340M to 1.3B parameter scale. Whether this holds at production scale is still an open question.

3. MesaNet: Always Solve the Optimization Problem Exactly

Paper: MesaNet: Sequence Modeling by Locally Optimal Test-Time Training (ICLR 2026)

Titans uses gradient descent to approximate the optimal memory update at each step. MesaNet asks: what if we just solve the optimization problem exactly every time?

It does this using a fast conjugate gradient (CG) solver inside a chunkwise-parallel layer. Instead of taking a gradient step, MesaNet computes the exact optimal fast weights for the current context. This costs more compute per step, but because the weights are always fully informed by the current context, the model achieves lower perplexity and stronger performance on several benchmarks compared to other linear-time models.

One thing worth being honest about: on long-context global benchmarks, MesaNet still trails full-attention Transformers. The paper acknowledges this directly — “the performance of all RNNs drops severely” at extended lengths. MesaNet is a step forward within linear-time models, but the gap with full attention at long context has not been closed.

4. InftyThink and TTT-E2E: Two Approaches to Unlimited Context

InftyThink: Summarize as You Think

Paper: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models (ICLR 2026)

InftyThink does not target long input context — it targets long reasoning chains that hit context limits before reaching an answer. The approach is a protocol rather than an architecture change: the model generates a short reasoning segment, then produces a concise summary of its progress, then continues from the summary.

This creates a sawtooth pattern where the active context expands during reasoning and resets after each summary. Reasoning depth becomes unbounded without any growth in memory cost.

Models fine-tuned on InftyThink-style data show 3–11% improvements on MATH500, AIME24, and GPQA benchmarks. A follow-up paper, InftyThink+, trains models with end-to-end reinforcement learning to decide when to summarize rather than relying on fixed intervals. It shows a 21% improvement on AIME24 over standard chain-of-thought RL.

TTT-E2E: Compress Context Directly Into Weights

Paper: End-to-End Test-Time Training for Long Context (arXiv:2512.23675) | NVIDIA Blog

TTT-E2E takes a different direction. Rather than designing a new memory architecture, it treats the context itself as training data and runs gradient updates during inference via standard next-token prediction loss.

In practice: as context arrives, the model compresses it into the weights of a designated subset of MLP layers (the final 25% of transformer blocks). Sliding-window attention handles the immediate context. The modified MLP weights hold longer-term memory. No custom memory cells — just standard training infrastructure reused at inference time.

At 128K tokens on an H100, TTT-E2E runs 2.7× faster than full-attention Transformers with constant inference latency, while matching full-attention loss scaling where other linear-time models diverge.

One limitation is worth noting, though it is not specific to TTT-E2E: on NIAH retrieval tasks at 128K tokens, TTT-E2E scores 6% versus 99% for full attention. Mamba-2 and Gated DeltaNet score similarly at 7%. This is a shared failure across all non-full-attention approaches — exact fact retrieval at long context remains unsolved for any compression-based method.

5. TTT3R: The Same Idea Applied to 3D Vision

Paper: TTT3R: 3D Reconstruction as Test-Time Training (ICLR 2026) | Project Page

TTT3R shows that test-time training is not limited to language. It applies the same principle to 3D reconstruction and Structure-from-Motion: treating continuous video frame ingestion as an online learning problem.

Recurrent 3D reconstruction models generalize poorly to long sequences — they train on 64-frame sequences and degrade on longer videos. TTT3R’s fix is to compute the alignment confidence between the current memory state and each incoming frame, and use this confidence to derive a per-token learning rate. When memory aligns well with the observation, the update is small. When it does not, the model adapts more aggressively.

This training-free change delivers a 2× improvement in global pose estimation, runs at 20 FPS on 6GB of VRAM, and handles thousands of frames without performance collapse.

6. Mamba-3: A Disciplined Upgrade to State Space Models

Paper: Mamba-3: Improved Sequence Modeling using State Space Principles (ICLR 2026)

While test-time training approaches have attracted most of the attention, the State Space Model (SSM) camp made a more conservative but technically rigorous step forward with Mamba-3.

Three specific changes distinguish it from Mamba-2, all grounded in classical control theory:

Exponential-Trapezoidal Discretization: Mamba-1 and Mamba-2 used first-order (Euler) discretization when converting continuous SSMs to discrete sequences. Mamba-3 uses a second-order approximation, which is more accurate and has the side effect of making the short causal convolution that other models carry unnecessary.

Complex-Valued States via Data-Dependent RoPE: Real-valued SSMs cannot represent oscillatory dynamics — they perform no better than random on tasks like parity checking or modular arithmetic. Complex values fix this but at 4× compute cost. Mamba-3 proves that a complex-valued SSM is mathematically equivalent to a real-valued SSM with data-dependent Rotary Positional Embeddings (RoPE). This gives the expressivity of complex values at the speed of real arithmetic.

MIMO (Multi-Input Multi-Output) State Updates: Standard SSM decoding has very low arithmetic intensity (~2.5 ops/byte), making it memory-bound and hardware-inefficient. Switching to a rank-R state update increases arithmetic intensity and improves GPU utilization without increasing decode latency.

At 1.5B parameters, Mamba-3 achieves comparable perplexity to Mamba-2 using half the state size.

What These Papers Are Really About

Looking across all six developments, the underlying shift is the same: the question is no longer whether a model’s weights are static at inference time, but how and when they should change.

TTT-E2E compresses entire contexts into weights via gradient descent. TNT makes that update mechanism practical to train. Hope and Nested Learning generalize it across multiple timescales. MesaNet solves the update problem exactly. InftyThink extends the idea to reasoning chains. TTT3R applies it to video. Mamba-3 achieves similar gains through classical SSM theory without touching weights at inference at all.

A few honest caveats to close with:

Exact retrieval is still a weakness. Gradient-based memory compression is good at capturing general patterns, not at preserving specific facts. NIAH benchmarks make this visible. Applications that need precise retrieval — code navigation, legal documents, structured data — are not yet well served by these approaches.

Long-context quality gaps remain. MesaNet, Mamba-3, and similar models have closed the gap with full-attention Transformers on many tasks, but the hardest long-context benchmarks still favor quadratic attention.

Scale is mostly untested. Most experiments here are at 300M–1.5B parameters. Whether the results transfer to 70B+ models in production settings is still unknown.

Why Context Compression Sometimes Fails

Takuma Yamaguchi (Kumon) — Fri, 27 Feb 2026 11:04:39 GMT

AI models are growing longer memories — but longer doesn’t always mean better. When context compression fails, most engineers blame the algorithm. A new paper argues the real problem is the data itself.

The Problem: AI Memories Are Getting Too Long

Modern AI assistants can process entire books, long conversations, and massive documents. But this comes at a cost. Longer inputs mean higher computing bills, slower responses, and more energy use.

Context compression offers a solution. The idea is simple: before feeding a long document to an AI, compress it — like zipping a file — so the model only has to process the key information.

Compression doesn’t always work well. Sometimes the compressed version loses important meaning. Sometimes the AI gives worse answers than if you had just skipped compression entirely.

Researchers assumed the problem was in the compression algorithm. A new paper, “Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Models”, argues the real problem is somewhere else entirely: the data itself.

Data Complexity Is the Key Variable

The researchers from Alibaba investigated context compression from a fresh angle. Instead of asking “which compression method is best?”, they asked: “what properties of the input data make compression succeed or fail?”

Their answer: entropy matters most.

Entropy, in this context, measures how unpredictable or complex a piece of text is. High entropy text is dense with information — every word carries weight and removing any of it risks losing something important. Low entropy text is more repetitive and predictable, making it easier to compress without losing meaning.

The researchers found a clear pattern: higher input entropy leads to worse compression quality. When text is complex and information-dense, compression algorithms struggle to preserve meaning. When text is simpler and more structured, compression works much better.

Think of it like packing a suitcase. A suitcase full of clothes can be compressed by rolling and folding. But a suitcase full of delicate electronics cannot. The contents determine how well packing works, not the packing technique.

Entropy measures how unpredictable text is. High-entropy text (left) is dense with unique tokens — compress it and meaning is lost. Low-entropy text (right) has patterns that compress cleanly.

The Second Finding: Mismatched Knowledge Breaks Compression

The research used an autoencoder-based framework to study compression. In this setup, one part of the system (the encoder) compresses the text, and another part (the decoder) reads it back.

This revealed a second key finding: when the encoder and decoder have different background knowledge, compression quality drops significantly.

Context compression doesn’t just squeeze text mechanically — it relies on shared knowledge between the compressor and the reader. If the encoder “knows” that certain words can be inferred from context and removes them, but the decoder doesn’t have that same background knowledge, meaning gets lost.

The researchers found this knowledge gap is hard to fix. Unlike entropy, which you can potentially address by preprocessing your data, a mismatch in background knowledge between encoder and decoder is a structural problem that cuts into compression quality in ways that are difficult to work around.

Context compression relies on shared knowledge. When the encoder and decoder are trained on different data, the encoder removes details the decoder needed — and cannot recover them.

What Makes This Research Different

Most context compression research focuses on designing better algorithms. This paper takes a step back to ask a more fundamental question: even if you had a perfect algorithm, what properties of the data would limit it?

The answer has practical implications. Before choosing a compression method, you should consider:

How complex is your text? Technical documents, legal texts, and scientific papers have high information density. Compression will be harder and riskier for them than for simpler conversational text.
Do your encoder and decoder share knowledge? If you’re compressing with one model and reading with another (a common setup), their differences in training data could silently degrade quality.

The model-centric approach applies the same compression algorithm to everything. The data-centric approach analyzes input first, then selects the method that fits.

The Broader Picture: Data-Centric AI

This paper is part of a broader shift in AI research toward data-centric thinking. For years, the focus was on building better models. More recently, researchers have started asking: what if the data itself is the bottleneck?

In the context compression space, this means accepting that no algorithm can fully compensate for fundamentally difficult input. A better compression method helps, but understanding your data distribution matters just as much.

For engineers building AI applications that use long contexts, this is a useful reminder: don’t just benchmark your compression algorithm on average cases. Test it on the hardest, most information-dense text you expect to see. That’s where it will fail first.

Conclusion

Context compression is a promising way to make AI faster and cheaper. But it’s not a silver bullet. How well it works depends heavily on the complexity of your input text and whether your encoder and decoder share enough background knowledge.

Their framework gives researchers a clearer way to evaluate and predict compression quality — not just in general, but for specific types of data. The result is a more honest accounting of where compression helps, and where it still falls short.

AI Hit a Wall

Takuma Yamaguchi (Kumon) — Mon, 16 Feb 2026 12:42:42 GMT

Why the industry is shifting from bigger models to smarter agents

For years, making AI smarter meant making it bigger. That era is ending. A new survey of over 50 AI models reveals two major shifts happening right now. First, the industry is pivoting to six new approaches that build better AI without burning through data and money. Second, AI is evolving from systems that just answer questions into agents that can think, plan, and use tools to solve real problems.

The Wall Everyone Hit

Building smarter AI used to be simple. You made the model bigger. You fed it more data. The results got better. But this approach is breaking down fast.

We will run out of internet text to train on between 2026 and 2028. That’s 9 to 27 trillion tokens of data depleted. Training costs jumped from $3 million to over $300 million in just five years. Energy use increased 22 times.

The industry calls this the scaling wall. It’s the point where throwing more resources at AI stops making it better.

Badri

A major survey called LLMOrbit (Badri et al., 2026) analyzed over 50 models across 15 organizations from 2019 to 2025. It documents a remarkable paradigm shift already happening across the field. The old approach is dying, but two new paths forward are emerging.

Six Ways Around the Wall

The first path is about efficiency. Six new paradigms are emerging across the industry to make AI better without making it bigger.

Test-time compute lets models think longer when they answer. Instead of responding instantly, they spend more time reasoning. Models like o1 and DeepSeek-R1 use 10x more compute during inference. The result? They match GPT-4 performance without needing massive training budgets.

Quantization compresses models down to 4–8x smaller sizes. Think of it like zipping a file. Multi-head Latent Attention compressed the KV cache by 8x. This technique enables GPT-4-level performance at under $0.30 per million tokens. The cost drops dramatically.

Distributed edge computing spreads work across many small devices instead of one supercomputer. This approach cuts costs by 10x. You don’t need a massive data center anymore.

Model merging combines strengths from different models. It’s like mixing the best parts of several recipes to create something better.

Efficient training methods reduce waste. ORPO cuts memory use by 50%. Mixture of Experts (MoE) routing delivers 18x efficiency gains. The models learn faster and cheaper.

Small specialized models can match giants. Phi-4 has only 14 billion parameters. It performs as well as much larger models. Size isn’t everything anymore.

The results speak for themselves. DeepSeek-R1 scored 79.8% on the difficult MATH benchmark. Llama 3 hit 88.6% on the MMLU knowledge test. GPT-4 scored 86.4%. Open-source models are now competitive with frontier closed models on standardized benchmarks.

The Bigger Shift: From Passive to Active

But something even more important is happening. AI is evolving from passive tools into active agents.

Think about traditional AI. You ask a question. It gives an answer. That’s it. The conversation ends.

Agentic AI is different. It can sense what’s needed. It thinks through the problem. It takes action. It uses tools. It checks if the action worked. Then it tries again if needed.

This is a fundamental change. We’re moving through three nested stages.

Stage 1: LLM Foundation. Models that understand and generate text.

Stage 2: GenAI. Models that create content on demand.

Stage 3: Agentic AI. Models that act autonomously to achieve goals.

Most people are still thinking about Stage 2. But Stage 3 is already here.

How Agentic Systems Think

Agentic AI follows a core cycle: Sense-Think-Act.

Sense means perceiving what’s happening. The agent reads the situation. It understands the context. It knows what tools are available.

Think means reasoning about what to do. This isn’t just pattern matching. It’s genuine problem solving.

Act means taking action in the world. The agent uses tools. It calls APIs. It retrieves information. It changes things.

This cycle repeats until the goal is achieved.

The Building Blocks

Several key techniques make this possible.

ReAct is a framework that bridges reasoning and action. It combines three things: internal deliberation, external tool interaction, and feedback loops. The agent thinks out loud about what to do. Then it does it. Then it checks the result and adjusts.

RAG (Retrieval-Augmented Generation) lets agents pull fresh information from external sources. Instead of relying only on training data, they query knowledge bases in real time. This grounds their responses in current facts.

Chain-of-Thought (CoT) makes reasoning visible. The model generates intermediate steps before giving a final answer. You can see how it got there.

Tree-of-Thoughts (ToT) goes further. Instead of one chain of reasoning, the model explores multiple paths at once. It considers different approaches. It picks the best one.

Tool use is the game changer. Agents can recognize when they need a tool. They format the request correctly. They call the tool. They process the results. They continue with the task.

This is like the difference between thinking about hammering a nail and actually picking up a hammer.

Memory Makes It Real

For agents to work across time, they need memory.

Modern systems have three types:

Episodic memory stores specific experiences. “Last time I tried this approach, it failed.”

Semantic memory stores general knowledge. “API keys go in headers, not URLs.”

Procedural memory stores how to do things. “To authenticate, first get a token, then include it in requests.”

With context windows now exceeding 128,000 tokens, agents can remember long conversations and complex tasks.

Planning Gets Sophisticated

Early agents were reactive. They responded to immediate inputs.

Newer agents are deliberative. They plan multiple steps ahead.

The most advanced are adaptive. They adjust plans when things change.

This requires scale. The paper identifies three requirements for reasoning emergence. Training on over 100 billion tokens (10¹¹). Reinforcement learning with verifiable feedback. Test-time search to explore options.

These requirements are steep. But models like o1 and DeepSeek-R1 are meeting them.

Multi-Agent Systems

The next frontier is multiple agents working together.

Why? Because specialization works.

One agent handles research. Another writes code. A third reviews for errors. They work in parallel. They hand off tasks. Emergent capabilities appear that no single agent has.

The Model Context Protocol (MCP) standardizes how agents talk to tools and each other. It’s like HTTP for AI agents. Everyone speaks the same language.

What This Means for Us

We’re witnessing two paradigm shifts at once.

The first shift is efficiency. AI is getting better without getting bigger. It’s getting cheaper. It’s using less energy. Open-source models are beating expensive private ones. This democratizes access.

The second shift is agency. The entire field is moving from answering questions to solving problems. From generating text to taking action. From tools we use to agents that work alongside us.

These changes compound. Efficient models make agents affordable. Agentic capabilities make AI useful for complex real-world tasks.

The field is evolving from passive AI toward agentic systems. This shift is still early, but the direction is clear.

And unlike the scaling race, this evolution doesn’t require billion-dollar budgets. It requires smart design. That means more people can participate. More innovations will emerge.

The wall that stopped scaling isn’t stopping progress. It’s redirecting the entire industry toward something more sustainable and more powerful.

Conclusion

AI progress isn’t slowing down. It’s growing up. Instead of building bigger models that consume more resources, the industry has shifted to six new paradigms that work smarter. Instead of passive systems that only respond, the field is moving toward active agents that think, plan, and act.

The combination is powerful. Efficient models make agents practical. Agentic capabilities make AI genuinely useful. And with open-source models becoming increasingly competitive, access to capable AI is democratizing. The scaling wall isn’t the end of AI progress. It’s the beginning of something better.

AI Regulation Developments in 2026

Takuma Yamaguchi (Kumon) — Tue, 27 Jan 2026 14:09:13 GMT

A collection of notes on US state laws and global frameworks as of Jan 2026

I’ve been trying to keep up with AI regulation lately, and it seems like things are moving pretty fast in different parts of the world. This is mainly focused on what’s happening in the US — especially California, Colorado, and Texas where new laws are already in effect or coming soon. But I also wanted to include some notes on other countries because honestly, some of them are moving faster than the US in terms of having clear rules. The EU has a pretty comprehensive law now, and countries like China, South Korea, and Japan all have their own frameworks. Singapore and India are taking a softer approach with guidelines instead of strict laws.

United States

In recent years, there has been increasing discussion about how to regulate AI in the United States. While there is still no comprehensive federal law on AI, some states like California, Colorado, and Texas have started introducing their own regulations. As a non-expert, I tried to collect and summarize some of the latest developments, especially focusing on the laws that are already enforced or soon to be enforced as of 2026. I also included some notes on how companies are reacting.

Who These Laws Apply To (Important Distinction)

One thing that confused me at first was that not all these laws apply to the same type of organizations. After reading more carefully, I noticed there are basically two groups:

Developers or providers of AI models (like OpenAI, Anthropic, Google DeepMind)

These are responsible for transparency, safety, and sometimes publishing training data details.
Laws that apply: California SB 53, AB 2013

Businesses that use AI systems in their operations (for example, companies using OpenAI API for customer service or loan decisions):

These are required to inform users, explain decisions, prevent bias, and offer appeal processes.
Laws that apply: Colorado AI law, Texas TRAIGA

California SB 53 — Transparency in Frontier AI Act

Who is covered: AI model developers (large-scale frontier models)

Status: Enacted in 2025, effective January 1, 2027

Main points:

Applies to companies making more than $500M and developing powerful models (over 10²⁶ FLOPs).
These companies need to assess catastrophic risks and publish a “Frontier AI Framework.”
Must publish transparency reports before releasing or significantly updating such models.
Report serious incidents within 24 hours or 15 days.
Internal whistleblower protections are also required.

Reference:

California SB 53 Overview: https://www.brookings.edu/articles/what-is-californias-ai-safety-law/
California SB 53 Bill Text: https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240SB53

California AB 2013 — Generative AI Training Data Transparency Act

Who is covered: Developers/providers of generative AI systems

Status: Effective January 1, 2026

Main points:

Requires providers to publish a high-level summary of training data on their websites.
Applies to any model made after 2022 and accessible to people in California.
Includes items like data sources, copyright content, personal data, etc.
xAI (Elon Musk’s company) is suing the state, claiming the law violates freedom of speech.

Reference:

California AB 2013 Bill Text: https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202320240AB2013
xAI Lawsuit Coverage: https://law-ai.org/xais-challenge-to-californias-ai-training-data-transparency-law-ab2013/

Colorado SB 24–205 — AI Consumer Protection Law

Who is covered: Both AI developers and AI service providers

Status: Effective June 30, 2026 (delayed from original February plan)

Main points:

Applies only to “high-risk AI systems” (e.g., hiring, loans, insurance)
Developers must document design, risk, and bias mitigation.
AI service providers must notify users, allow appeals, and report discrimination within 90 days.
NIST AI RMF can provide safe harbor.

Reference:

Colorado SB 24–205 Bill Text: https://leg.colorado.gov/bills/sb24-205

Texas HB 149 — Texas Responsible AI Governance Act (TRAIGA)

Who is covered: AI developers and deployers in Texas

Status: Effective January 1, 2026

Main points:

Prohibits AI designed to incite harm or criminal activity
Requires disclosure when interacting with AI systems
Establishes a regulatory sandbox for testing innovations

Reference:

Texas HB 149 Bill Text: https://capitol.texas.gov/BillLookup/History.aspx?LegSess=89R&Bill=HB149

How Companies Are Responding

OpenAI: Formed a Safety & Security Committee; pre-release risk assessment and transparency reports.

https://openai.com/index/openai-board-forms-safety-and-security-committee/

Anthropic: Responsible Scaling Policy with safety levels based on model capability.

https://www.anthropic.com/rsp-updates

Google: Publishes an annual Responsible AI Report.

https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf

What About Businesses That Use AI?

If your company uses AI for decision-making (e.g., hiring, lending), you may be subject to new legal obligations:

Notify users when AI is used
Keep records of decisions
Allow appeals or human review
Monitor bias and fairness

Recommended practices include:

Reviewing provider documentation
Keeping internal risk assessments
Updating internal policies (e.g., using NIST AI RMF)

Developments Outside the US

EU AI Act

Comprehensive risk-based regulation for the EU
Unacceptable-risk systems banned
High-risk AI must pass conformity assessments and include oversight and documentation
Foundation models face transparency, energy use, and safety obligations
Enforced by European AI Office

Reference:

EU AI Act Summary: https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

Japan

AI Promotion Law (2023), fully enforced from 2025
Transparency and documentation encouraged, not required
Guidance includes data disclosure and explanation of AI decision processes

Reference:

Japan AI Promotion Law: https://www8.cao.go.jp/cstp/ai/ai_act/ai_act.html

China

Generative AI regulations enforced since 2023
Requires registration and audit of models, labeling of content (including watermarks)
Strict government enforcement (CAC)

Reference:

China Generative AI Interim Rules: https://www.chinalawtranslate.com/en/generative-ai-interim/

South Korea

AI Basic Act (2025), effective from 2026
High-impact AI systems must be risk-assessed and auditable
Human oversight and content labeling are mandatory

Reference:

South Korea AI Basic Law: https://www.onetrust.com/blog/south-koreas-new-ai-law-what-it-means-for-organizations-and-how-to-prepare/

Singapore

No binding AI law; uses voluntary frameworks
Notable initiatives: Model AI Governance Framework, Agentic AI guidance
Strong emphasis on explainability, human oversight

Reference:

Singapore AI Governance Framework: https://www.pdpc.gov.sg/help-and-resources/2020/01/model-ai-governance-framework

India

AI Governance Guidelines (2025), no law yet
Encourages transparency reports, impact assessments, and complaint channels
Legislation being considered

Reference:

India AI Governance Guidelines: https://static.pib.gov.in/WriteReadData/specificdocs/documents/2025/nov/doc2025115685601.pdf

Conclusion

As AI regulation continues to evolve globally, the regulatory landscape is becoming increasingly complex and varied across jurisdictions. While the United States has adopted a decentralized, state-level approach, regions like the EU, China, and South Korea are implementing comprehensive national frameworks with clear enforcement mechanisms. Understanding whether your organization acts as an AI developer, provider, or deployer is critical, as obligations differ significantly across these roles.

The convergence of global standards around transparency, risk assessment, and human oversight suggests that proactive preparation is no longer optional. Organizations should begin documenting their AI systems, conducting regular bias and risk assessments, and establishing clear governance processes — regardless of whether specific regulations currently apply in their jurisdiction. The trajectory is clear: AI accountability is becoming a fundamental business requirement worldwide.

Prompt Repetition for Non-Reasoning LLMs: A Reproduction Study

Takuma Yamaguchi (Kumon) — Sat, 03 Jan 2026 13:40:19 GMT

Experiments with Gemini 2.0 Flash Lite on Name Index Tasks

Recently, many large language models focus on reasoning. Chain-of-Thought, step-by-step reasoning, and deliberate reasoning are now common.

As recent LLM research often focuses on reasoning, the paper “Prompt Repetition Improves Non-Reasoning LLMs”, written by researchers at Google Research, might be easy to overlook at first.

After reading the paper and trying to reproduce its results, I realized that it offers a very useful and simple perspective. This perspective is still helpful for understanding LLM behavior today.

The figure below, taken from the original paper, summarizes the main effect reported by the authors. This figure shows that prompt repetition can significantly improve performance for non-reasoning models.

In this post, I focus on the following points:

what the paper means by non-reasoning LLMs
why prompt repetition can work
why the effect is limited for recent models
how I reproduced the result using Gemini 2.0 Flash Lite

What Is a Non-Reasoning LLM in This Paper?

In this paper, non-reasoning LLM does not mean a model that cannot reason.

It means a model that:

is trained mainly with next-token prediction
is not trained to output reasoning steps explicitly
does not rely on Chain-of-Thought during inference

These models can still solve many tasks, but they do not intentionally generate reasoning processes.

On the other hand, reasoning models:

are trained with Chain-of-Thought or similar data
learn to output intermediate steps
are more robust for complex logical tasks

Why Prompt Repetition Works

Prompt repetition does not add reasoning ability. Instead, it changes how the model attends to the input.

In causal language models:

earlier tokens have stronger influence
long inputs reduce attention stability
important details can be missed

By repeating the same prompt:

important tokens appear multiple times
their probability becomes higher
attention becomes more stable

In short, repetition helps the model read the input more carefully. It does not make the model smarter, but it makes mistakes less likely.

This is why repetition works well for:

long lists
position-based questions
tasks that are deterministic but attention-sensitive

Why the Effect Is Limited for Recent LLMs

When I tested many examples with recent models, most of them were already correct without repetition. This does not mean the paper is wrong.

Recent LLMs:

are often trained as reasoning models
handle attention better
internally re-check information even without explicit CoT prompts

Because of this:

baseline accuracy is already high
there is little room for improvement
repetition does not show a clear effect

To reproduce the paper today, it is often necessary to use:

lightweight models
speed-optimized models
models that focus less on reasoning

Reproducing the Result with Gemini 2.0 Flash Lite

The paper reports strong effects with Gemini 2.0 Flash Lite, so I used this model.

One task where I could reproduce the effect is the Name Index task (25th of 50) with noise.

Prompt Used

Here is a list of names. Comments in parentheses are NOT part of the list.

Arden, Bexley, Corin, Daxter, Elric,
Fenna, Garrick, Helia, Ivor, Jessa, 
Kael, Liora, Merek, Nyra, Orin,
(Pause here)
Pryce, Quill, Rhea, Soren, Talia, Ulric, Vanna, Wren, Xander, Yara,
Ziven, Alric, Brisa, Cato, Delia, Eamon, Freya, Galen, Hilda, Isen,
Joran, Kiera, Lucan, Mira, Nolen, Ophra, Perrin, Quinlan, Riven, Selah,
Torin, Una, Veska, Wyeth, Zora

What is the 25th name in the list?
Answer with only the name.

Correct Answer

Yara

Without prompt repetition, Gemini 2.0 Flash Lite gave a wrong answer. The error was usually an off-by-one mistake. With prompt repetition, the answer became stable and correct.

The screenshots below show example outputs from Vertex AI Studio Chat. They illustrate how the answer changes with and without prompt repetition. For consistency, I set the temperature to 0 in all experiments.
This eliminates randomness in the output, so any difference comes from the prompt itself.

w/o prompt repetition

w/ prompt repetition

Reproduction Was Possible, but Not Always Easy

I could reproduce the effect in several cases. However, in many other cases:

the model already gave the correct answer
repetition did not change the result

This is especially true for recent reasoning models.

This matches the paper’s assumption. The paper focuses on non-reasoning behavior, which is becoming less common. In that sense, the difficulty of reproduction is also an important observation.

What I Found Useful in This Paper

While the effect of prompt repetition is limited for many recent models,
I still found this paper useful.

In particular, it helped me notice a few points:

the method is extremely simple
it does not introduce any obvious downside
it provides a clear example of how attention-related errors can happen in LLMs

Prompt repetition is not meant to replace reasoning models. Instead, it works as a reminder that small changes in prompt design can sometimes improve model behavior in simple but meaningful ways.

Conclusion

This post looked at the paper “Prompt Repetition Improves Non-Reasoning LLMs” and my attempt to reproduce some of its results.

Through this process, I learned more about:

how non-reasoning models can fail in position-based tasks
why off-by-one errors are common in such cases
how simple prompt design can sometimes reduce these errors

Even today, this perspective is helpful when thinking about how LLM outputs are influenced by attention and prompt structure.

Becoming AI‑Native at Mercari: Group Strategy and a US Case Study

Takuma Yamaguchi (Kumon) — Thu, 11 Dec 2025 00:36:07 GMT

How we’re transforming our products, culture, and ways of working with AI

We at Mercari US Engineering previously shifted to publishing on our corporate site, but to once again share our technological advancements and engineering culture more widely and quickly with the global community, we’re returning to this blog.

At the Group level, we are embedding AI into product development and operations to improve velocity, quality, and customer experience. With this post, we aim to introduce our Mercari Group’s transformation toward becoming AI-Native, highlight the US business’s strategic positioning, and showcase a recent contribution from the US team at Mercari GEARS 2025.

We first outline the Group’s AI‑Native strategy, then show how Mercari US applies it through an end‑to‑end testing initiative.

Mercari Group’s AI-Native Transformation

This transformation is already producing significant results Group-wide:

Engineering Output: We achieved a dramatic 64% year-over-year increase in output per engineer involved in development.
AI Adoption: 95% of Mercari employees are already using AI tools.
Code Generation: 70% of code generated for product development involves AI.

We see this is an opportunity for exponential growth, aiming for 10x or 100x gains, ultimately believing that AI will “unleash us from the bottlenecks and resource limitations that previously held us back.”

(Source: “Back to Startup” and “AI-Native” — The Next Chapter in Mercari’s Journey at 12 Years)

AI Strategy and Product Innovation

The Group’s strategy focuses on transforming core processes and enhancing user experience and platform reliability:

Product Experience: Our core vision is achieving lower friction for users listing or purchasing items. For example, in Japan, we introduced AI Listing, which automatically generates item descriptions from uploaded photos. We also leverage image recognition to suggest optimal categories and prices, simplifying the entire listing process.

Platform Security: To manage the new risks associated with generative AI/LLM technology, Mercari formed an AI security team in May 2025. This specialized team is dedicated to handling issues such as those outlined in frameworks like the OWASP 2025 Top 10 Risks for Generative AI. (Source: How Mercari’s AI Security Team is Securing AI Native)

These initiatives embody our AI-Native philosophy where every improvement, from user experience to infrastructure, leverages AI at its core.

Mercari US: Building a Hassle-Free Marketplace

At Mercari US, we apply the Group’s AI‑Native principles to improve unit economics and customer experience. We translate these principles into day‑to‑day product decisions and operational improvements for sustainable growth in the US market.

Our core mission in the US is: To build the go-to marketplace for hassle-free selling and discovering deals.

To realize sustainable growth, our strategy focuses on:

Enhancing the product’s core experience.
Using AI to innovate our UI/UX.
Pursuing category‑specific strategies to better serve distinct buyer and seller needs.

Mercari US at Mercari GEARS 2025

Mercari GEARS 2025, our Group’s technical conference, recently took place in Japan. The event showcased our technology, organization, and culture — including the evolution of our AI-Native initiatives.

At the event, Gleb Bahmutov from the US team delivered a presentation titled “Running 1000 End-to-End Web Tests Daily.” He shared how Mercari US scaled its end-to-end testing pipeline to run over thousands of E2E tests efficiently per day by parallelizing workloads, tagging and selectively running tests, and even using AI to help determine which tests to execute.

This approach not only accelerated our release cycle but also demonstrated how AI and automation can drive smarter, faster, and more reliable development workflows ; an embodiment of our AI-Native mindset in action.

https://medium.com/media/9e06889db92efeba189801ec686a008c/href

Our Commitment to the AI-Native Future

The shift to AI-Native means moving towards an exciting future where AI becomes deeply integrated into our daily work.

The challenge ahead isn’t just about using AI ; it’s about shifting how we think and work. We need to stay curious, take on new challenges, and not be limited by what we’ve already built.

We will continue to share outcomes and methods as we scale these practices.

Becoming AI‑Native at Mercari: Group Strategy and a US Case Study was originally published in Making Mercari on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Transformers to Titans: A Look at the MIRAS Framework

Takuma Yamaguchi (Kumon) — Tue, 09 Dec 2025 09:28:27 GMT

Notes on long-term memory, test-time learning, and the future of LLMs

It has been quite a while since I last wrote anything here. I wanted to share something that caught my attention recently. Google recently released new work on Titans and a related framework called MIRAS. These ideas introduce different ways of thinking about long-term memory and test-time learning in language models. This article provides a simple overview of the concepts behind both Titans and MIRAS, and how they relate to the broader discussion about handling very large contexts.

The Transformer’s Wall: Quadratic Complexity

The Transformer architecture built most of the modern AI boom. GPT models, Llama, Claude, and many others rely on it. At the same time the Transformer has a clear limitation. Its computation and memory usage grow quadratically with the length of the input.

Why this happens

Attention compares every new token with all previous tokens.
If the sequence becomes twice as long, computation becomes four times heavier. If the sequence grows ten times longer, cost becomes one hundred times higher.

The KV Cache also grows in the same pattern. When the context window goes beyond one million tokens, GPU memory usage increases rapidly and eventually becomes unsustainable.

This problem encouraged researchers to search for alternatives. For more details, see “Attention Is All You Need” (Vaswani et al., 2017).

Linear-Time Models: Hawk, Griffin, and Mamba

To move past the quadratic wall, the field explored models whose computation grows only in a linear way. These models compress past information into a fixed state instead of keeping everything around.

Examples include:

Mamba (Gu and Dao, 2023):
A linear state-space model that selectively keeps important information.
Hawk & Griffin (De et al., 2024):
Two related models introduced in the same paper. Hawk provides a stable recurrent architecture, and Griffin extends it by combining recurrence with local attention.

These models are fast and efficient. They can handle long sequences better than Transformers. However, they have a different limitation. Compressing everything into a fixed-size state eventually causes information loss. No matter how good the compression is, a single vector is not enough for very long contexts.

Titans: When Memory Becomes Learning

Titans takes a different approach. Instead of storing memory as data, Titans treats memory as a form of learning during inference.

The key idea appears in “Titans: Learning to Memorize at Test Time” (Behrouz et al., 2025). Titans introduces a module called Neural Memory. This is a small neural network that updates its weights while the model is reading input.

How this works

During inference Titans performs the following loop:

It reads a segment of input.
It measures how surprising that segment is.
It updates Neural Memory with gradient descent.

Only the Neural Memory is updated, while the rest of the model weights stay the same. Because of this design Titans can internalize information without increasing memory usage. This makes extremely long contexts more manageable.

MIRAS: A Unified Memory Framework

After learning about Titans, I started to wonder why updating memory at test time makes sense in the first place. The answer appears in the MIRAS framework published in “It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization” (Behrouz et al., 2025).

MIRAS offers a unified way to understand memory in Transformers, RNNs, SSMs, and Titans. It connects several ideas such as associative memory, test-time updates, retention, and optimization.

The MIRAS perspective

MIRAS suggests that memory is better viewed as an optimization process.
In this perspective:

memorization is parameter adjustment
forgetting is regularization
recall is applying the adjusted parameters
surprise is the size of the gradient

With this view Titans becomes a natural extension of previous models. Instead of storing past information as data, Titans stores it by performing small optimization steps.

Why Titans and MIRAS Matter

If I try to explain this in everyday language:

The Transformer keeps every book on the desk to remember things.
Linear models like Mamba or Griffin take short notes in a notebook.
Titans actually learns the material while reading.

It is not only about efficiency. It is about giving AI the ability to adapt within a session. Titans can keep long-term memory without increasing context length. MIRAS explains why this behavior is consistent with broader ideas in machine learning.

This combination feels like a shift from static models to models that learn continuously.

Closing Notes

Titans and the MIRAS framework present a set of ideas for handling long-term memory and test-time learning in language models. Titans focuses on updating only its memory module during inference, and MIRAS offers a unified way to think about memory, retention, and optimization across different architectures.

These approaches do not define the future of model design, but they suggest possible directions for systems that work with increasingly large contexts. It is interesting to consider how ideas like neural memory and test-time updates might influence future developments in long-context modeling.

Instant NeRF on Google Compute Engine via Chrome Remote Desktop

Takuma Yamaguchi (Kumon) — Sun, 14 Aug 2022 11:47:15 GMT

Rendering a 3D NERF Toy Gun with Neural Radiance Fields (NeRF) on a Google Cloud VM

NERF Toy Gun Generated with Instant-NeRF

Introduction

Neural radiance field (NeRF) synthesizes novel views of complex scenes using a simple fully connected neural network based on a collection of 2D images.

The paper, Representing Scenes as Neural Radiance Fields for View Synthesis, was presented in ECCV 2020 and won best paper honorable mention. Their project web page is https://www.matthewtancik.com/nerf.

NeRF shows impressive view synthesis, but it’s slow, like 1 to 2 days to train for every single scene and tens of seconds to synthesize one frame on a single NVIDIA V100 GPU. So some studies have been conducted to reduce the computation times.

A SIGGRAPH 2022 paper, Instant Neural Graphics Primitives with a Multiresolution Hash Encoding, has reduced the time for training and frame rendering significantly, like a few seconds for training and a few milliseconds for frame rendering. The dramatic improvement caught a lot of attention. Their project web page is https://nvlabs.github.io/instant-ngp/ and their GUI tool is also available in https://github.com/NVlabs/instant-ngp.

I got interested in using the instant-ngp/instant-nerf as it’s fast, but I didn’t have a development environment with GUI and GPUs on my local machine. So I built such an environment on Google Cloud/GCP.

It can run on Google Colab and an example notebook is available in the repository, but using the GUI tool is fun and allows us to understand the behavior easily.

Build a GUI environment on Google Cloud

Create a VM instance

The first step to create a VM is to select a machine type. Building some packages requires a certain amount of RAM, so 4 CPUs with 26GB memory one is used. Additionally, a GPU is needed, so the cheapest one, NVIDIA T4, is selected. Even with T4, you don’t have to wait for a long time for training scenes.

Machine Type Selection

As for machine image, Debian 10 based Deep Learning VM for TensorFlow Enterprise 2.9 with CUDA 11.3 is used. The simpler image, Debian 10 based Deep Learning VM with CUDA 11.0, should work, but I got some errors while I was trying the instant-ngp.

Machine Image Selection

Using a preemptible VM, the hourly cost of the instance was $0.17 in us-central1.

VM Instance Cost

Setup a GUI environment

When you ssh to the instance, you would see the message. Type y to install NVIDIA drivers automatically.

This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n] y

Install chrome desktop: The next step is to install chrome remote desktop on the instance. Here is the official document, https://cloud.google.com/architecture/chrome-desktop-remote-on-compute-engine

sudo apt update
sudo apt install --assume-yes wget tasksel

wget https://dl.google.com/linux/direct/chrome-remote-desktop_current_amd64.deb
sudo apt-get install --assume-yes ./chrome-remote-desktop_current_amd64.deb

sudo DEBIAN_FRONTEND=noninteractive apt install --assume-yes xfce4 desktop-base dbus-x11 xscreensaver

sudo bash -c 'echo "exec /etc/X11/Xsession /usr/bin/xfce4-session" > /etc/chrome-remote-desktop-session'

sudo systemctl disable lightdm.service

Go to the remote desktop site, https://remotedesktop.google.com/headless, from your local machine. Then, move to: Set up another computer > Begin > Next > Authorize.

Copy the command for Debian Linux.

DISPLAY= /opt/google/chrome-remote-desktop/start-host --code="xxxxxxxxxx" --redirect-url="https://remotedesktop.google.com/_/oauthredirect" --name=$(hostname)

Paste the command onto the VM instance and enter your PIN.

On the remote access page, you will see your VM. Click the link and enter your PIN.

Setup Instant-NGP

Install Dependent Packages

sudo apt install -y \
    build-essential libatlas-base-dev libboost-filesystem-dev \
    libboost-graph-dev libboost-program-options-dev \
    libboost-system-dev libboost-test-dev libcgal-dev \
    libeigen3-dev libfreeimage-dev libgflags-dev libglew-dev \
    libglfw3-dev libgoogle-glog-dev libmetis-dev libomp-dev \
    libopenexr-dev libqt5opengl5-dev libsuitesparse-dev \
    libxcursor-dev libxi-dev libxinerama-dev qtbase5-dev

Upgrade cmake

sudo apt remove --purge cmake
pip install cmake
hash -r
cmake --version
cmake version 3.24.0

Install Vulkan

Here is the official document, https://vulkan.lunarg.com/doc/sdk/1.3.216.0/linux/getting_started.html.

cd ~
mkdir vulkan
cd vulkan

wget https://sdk.lunarg.com/sdk/download/latest/linux/vulkan-sdk.tar.gz
tar xf vulkan-sdk.tar.gz
source $(ls|grep 1.)/setup-env.sh

Copy files to system directories

sudo cp -r $VULKAN_SDK/include/vulkan/ /usr/local/include/
sudo cp -P $VULKAN_SDK/lib/libvulkan.so* /usr/local/lib/
sudo cp $VULKAN_SDK/lib/libVkLayer_*.so /usr/local/lib/
sudo mkdir -p /usr/local/share/vulkan/explicit_layer.d
sudo cp $VULKAN_SDK/etc/vulkan/explicit_layer.d/VkLayer_*.json /usr/local/share/vulkan/explicit_layer.d

sudo ldconfig # You can ignore some warnings for now

Build Instant-NGP

cd ~
git clone --recursive https://github.com/nvlabs/instant-ngp
cd instant-ngp

cmake . -B build
cmake --build build --config RelWithDebInfo -j

Test Instant-NGP with the Fox Images

On the remote desktop, you can run the instant-ngp for the fox images. You can see high resolution outputs by setting target FPS as 2.0.

cd ~/instant-ngp
./build/testbed --scene data/nerf/fox

https://medium.com/media/1c98ca3e23e20b080342219cbdcf123c/href

It works, but our goal is to render a 3D NERF toy gun or our images. NeRF requires camera poses of input images. As for the fox images, camera poses are included in the transforms.json file in the data/nerf/fox. The next section describes how to predict camera poses.

Setup Instant-NGP for Any Images

COLMAP is a widely used general-purpose Structure-from-Motion (SfM) tool. We can predict camera poses with this tool.

Install Ceres Solver

COLMAP is depending on Ceres Solver.

cd ~
git clone --depth 1 -b 2.1.0 https://github.com/ceres-solver/ceres-solver.git

cd ceres-solver
mkdir build
cd build

cmake .. -DBUILD_TESTING=OFF -DBUILD_EXAMPLES=OFF
make -j
sudo make install

Install COLMAP

cd ~
git clone --depth 1 -b 3.7 https://github.com/colmap/colmap

cd colmap
mkdir build
cd build

cmake ..
make -j3  # updated to -j3 from -j as 26GB RAM is not enough
sudo make install

pip install opencv-python

Test Instant-NGP with the Fox Images From Scratch

First, move or remove the original transforms.json

cd ~/instant-ngp/data/nerf/fox

# Move or remove transforms.json
mkdir backup
mv transforms.json backup/

# Output directory
mkdir colmap_text

Launch COLMAP via the remote desktop.

colmap gui

COLMAP GUI

Create a new project through the menu, File > New project

Extract feature points of the images withProcessing > Feature extraction > Extract
Match feature points with Processing > Feature matching > Run
Estimate camera poses withReconstruction > Start reconstruction
Save files with File > Export model as text. Select the colmap_text directory which is created while ago
Terminate COLMAP

https://medium.com/media/6af408db5c732b4861169d8e8fc94bc8/href

Generate transforms.json by running the following script

cd ~/instant-ngp/data/nerf/fox
python ~/instant-ngp/scripts/colmap2nerf.py --colmap_matcher exhaustive --aabb_scale 4

Run instang-ngp the same as before

cd ~/instant-ngp
./build/testbed --scene data/nerf/fox

Instant-NeRF with the Fox Images from Scratch

Instant-NGP for Any Images

Conference Room

Some datasets for NeRF are available from the NeRF project page. https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1?usp=sharing. Let’s use nerf_llff_data/room/images. The data consists of 41 images.

Conference Room Images

Estimate camera poses with COLMAP. Save colmap outputs in ~instant-ngp/data/room/colmap_text.

Camera Pose Estimation for the Conference Room Images

Generate transforms.json

cd ~/instant-ngp/data/room
python ~/instant-ngp/scripts/colmap2nerf.py --colmap_matcher exhaustive --aabb_scale 2

Run instang-ngp

cd ~/instant-ngp
./build/testbed --scene data/room

https://medium.com/media/bfa369dccaf15987c16d78fde1d1ecc6/href

As we can see, impressively, the light reflection on the display and ambient occlusion are rendered very well.

NERF Toy Gun

The next target is a NERF toy gun. I borrowed it from my son and took 26 photos using my cellphone.

NERF Toy Gun Images Taken with a Cellphone

COLMAP result

Final result

https://medium.com/media/1e149040886075d53f0bd0f129c50e7c/href

The output is not perfect, but it’s still amazing as it’s generated based on only 26 images.

Conclusion

NeRF is an impressive technology to generate 3D scenes from a collection of 2D images. Instant-NGP / Instant-NeRF enables very fast model training and rendering novel views. It’s better to have a GUI development environment with a GPU to try it. Setting up a remote desktop environment allows cloud service users to enjoy NeRF easily.

Similarity Search: ScaNN and 4-bit PQ

Takuma Yamaguchi (Kumon) — Fri, 17 Sep 2021 10:59:59 GMT

ScaNN is a vector similarity search algorithm. This blog post introduces the relationship between ScaNN and 4-bit PQ.

Introduction

Around 1 year ago (2020), Google published a very impressive blog post and paper.

Scalable Nearest Neighbors (ScaNN) is a vector similarity search algorithm and it is used in Vertex Matching Engine (GCP), which is a managed similarity search service. The vector similarity search field has been studied for many years, so usually the latest state-of-the-art algorithm is slightly better than the previous one.

However, in this case, the ScaNN achieved significantly different results. The QPS is almost double compared to the following algorithm.

https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html

Maximum Inner Product Search (MIPS)

The similarity search is to find the most similar vector to a given query vector. There are some ways to calculate similarities, like inner product, cosine similarity and Euclidean distance.

Intuitively, vector a or b is the closest or similar to the query vector q in the figure above, but vector c is the similar vector based on inner product maximization. ScaNN was developed for MIPS

Vector Quantization in ScaNN

There are 2 major findings in the ScaNN vector quantization.

https://arxiv.org/pdf/1908.10396.pdf

Score Aware Loss: Not all pairs of q and x are equally important. For x, it is more important to accurately quantize the inner product of <q1, x> than <q2, x> or <q3, x>.

https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html

Anisotropic Loss: Quantization error can be decomposed to parallel component and orthogonal component. And the parallel component penalizes more than the orthogonal component.

PQ with Score Aware Loss and Anisotropic Loss

Standard vector quantization is not very practical for high dimensional or large scale databases. PQ (product quantization) is a widely used scalable vector quantization method.

https://speakerdeck.com/matsui_528/cvpr20-tutorial-billion-scale-approximate-nearest-neighbor-search?slide=79

In the standard vector quantization, a code book is generated from original vectors. On the other hand, PQ divides vectors into multiple subspaces and for each subspace a code book is generated. This approach allows to handle high dimensional vectors and large scale databases.

ScaNN also uses PQ, so we can say ScaNN is that PQ with score aware loss and anisotropic loss.

Why so fast?

I had a big question why ScaNN is very fast. Using score aware loss and anisotropic loss doesn’t reduce computational complexity. The reason was not in the blog post or the paper. The paper simply said “SIMD based ADC”. SIMD is commonly used in the similarity search field and ADC is asymmetric distance computation, which is described in the PQ paper.

https://github.com/google-research/google-research/blob/master/scann/scann/hashes/internal/lut16_avx512.inc

The answer was in the code. SIMD in-register lookup tables is implemented.

SIMD in-register lookup tables

Using lookup tables for distance computation with PQ is not special, but since the table size cannot be fit to registers (128-bit — 512-bit), realizing in-register lookup tables is not straightforward. I found a paper, “Andre et al., Accelerated Nearest Neighbor Search with Qick ADC”, which addresses the issue.

In PQ, 8-bit sub-quantizer is widely used and distances are represented as 32-bit float, so the table size is 2⁸ * 32 = 8,192 bits. To minimize the table size, 4-bit sub-quantizer and 8-bit integers for distance representations are used in Quick ADC. As a result, the table size became 2⁴ * 8 = 128 bits. It allows to run the lookup tables in SIMD registers.

https://arxiv.org/pdf/1704.07355.pdf

Register access is super faster than main memory access, like more than 10 times, and also 16 lookups are performed in 1 cycle. That’s why Quick ADC is quick. Since the sub quantizer is 4 bits, it’s also called 4-bit PQ.

Benchmarks

Some benchmark results for 4-bit PQ and ScaNN are available in the Faiss wiki. Faiss is a similarity search library, which is developed and maintained by Facebook Research.

https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors#results-on-sift1m

The reorder means that instead of querying the top 10 results in one shot, the top 100 vectors are retrieved and reordered with more accurate distance computation. Judging from these results, 4-bit PQ looks better than ScaNN.

https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors#results-on-glove

As for Glove dataset, ScaNN performs better than 4-bit PQ. The dataset is used in the ScaNN paper, so the anisotropic loss quantization works for this dataset. Interestingly, with the reordering, the performance is almost the same.

Conclusion

ScaNN is a vector quantization algorithm for maximum inner product search. The algorithm is a combination of product quantization, score aware loss and anisotropic loss. To accelerate the search speed, ScaNN is implemented with SIMD in-register lookup tables. The performance difference between ScaNN and 4-bit PQ is limited, but a similarity search engine as managed service is super beneficial for many use cases.

CVPR 2020 Tutorial “Image Retrieval in the Wild”

Takuma Yamaguchi (Kumon) — Tue, 23 Jun 2020 00:16:10 GMT

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) is one of the world’s top conferences in computer vision. We organized a half day tutorial “Image Retrieval in the Wild” at CVPR 2020.

Introduction

Content-based image retrieval is one of the most essential techniques used for interacting with visual collections. Significant progress has been made in the last decade by technological advances in deep learning and similarity search. Although commercial applications using the technologies are increasing, there has not been enough discussion about how to build a practical and a large-scale visual search system.

The organizers of this tutorial

This tutorial covered several important components of building an image retrieval system for real-world applications. The organizers were Yusuke Matsui (The University of Tokyo), Zheng Wang (National Institute of Informatics) and Takuma Yamaguchi (Mercari, Inc).

All the presentation slides and videos are available at our project site https://matsui528.github.io/cvpr2020_tutorial_retrieval/.

Sessions

Billion-scale Approximate Nearest Neighbor Search

https://medium.com/media/27d5ef99656642892fffbcf3ddb3f2bb/href

Yusuke Matsui introduced state-of-the-art algorithms of approximate nearest neighbor search. Since many algorithms and libraries have been proposed and published in the field and design of the search algorithm is critical for application performance, it’s a time consuming task to choose one of them. To make it easy, a practical guide to select the best algorithm and similarity search library for each given task, which was depending on database size and vector dimensions, was provided.

A Large-scale Visual Search System in the C2C Marketplace App Mercari

https://medium.com/media/63683dd06426b86f5d09bcdc6b3e71ac/href

Takuma Yamaguchi presented an example of how such an algorithm was utilized in an online C2C marketplace app, which has over one billion listings and over 16 million monthly active users. He showed how to productionize a highly scalable and available visual search system on Kubernetes for the app. Additionally, since the general deep learning based feature extraction didn’t work very well due to a C2C marketplace specific issue, a technique to handle the issue was introduced.

Beyond Intra-modality Discrepancy: A Survey of Heterogeneous Person Re-identification

https://medium.com/media/1ace4c328e97782f0983e095f9c85297/href

Zheng Wang conducted a systematic review for heterogeneous person re-identification, where the inter-modality discrepancy works as the main challenge. The survey covered four cross-modality application scenarios: low-resolution (LR), infrared (IR), sketch, and text. It also included the latest topics which were presented in the conference CVPR 2020. Additionally, the available datasets in each category were introduced and the representative approaches were compared and summarized in his talk.

Live-coding Demo to Implement an Image Search Engine from Scratch

https://medium.com/media/7d52d6c762012ea96487c297b634043c/href

Yusuke Matsui provided a live-coding demo to implement an image search engine from scratch within 30 mins without copying and pasting the code. The only 100 lines of Python code realized an image search web API by leveraging a pre-trained deep learning model. It will be very useful for those who are trying to build their own image search system for the first time.

Conclusion

The CVPR 2020 was a virtual conference this year. All of us organized the tutorial from Japan and it started at 12:30am. The time was a minor matter. We had more concerns before the tutorial, like network/machine troubles, microphone quality, the number of participants, and so on.

Fortunately, the tutorial finished successfully. We had many participants, lively discussions, and no network/machine troubles. Furthermore, we were very happy that some participants were satisfied with the contents.

Thank you to all the participants and the CVPR 2020 organizers.

CVPR 2020 Tutorial “Image Retrieval in the Wild” was originally published in Making Mercari on Medium, where people are continuing the conversation by highlighting and responding to this story.