Prosus AI Tech Blog - Medium

Counterpart Modeling Bias: Agents Haggle Differently When Told They’re Facing a Human vs. an AI

Gaurav Maheshwari — Wed, 01 Apr 2026 15:00:46 GMT

We believe the future of commerce will be shaped by AI agents negotiating on behalf of buyers and sellers. Not just recommending products or comparing prices, but actively haggling, making concessions, and closing deals. When that happens, a question emerges that sounds philosophical but turns out to be economic: Does it matter if the agent knows the other side is also an AI?

We ran 2,000 LLM-vs-LLM negotiations to find out. One line in the system prompt shifts the price by up to $220 on the same item. And the direction of the effect flips depending on how the rest of the prompt is written.

The Bicycle

A 3-year-old Trek road bicycle. True market value: $618.51. Listed at $1,113.32. We had the same model (GPT-5.4-mini) negotiate for this bike five times, once for each condition in our experiment. Same item, same starting price, same round limit. The only thing that changed: a single line in each agent’s system prompt telling it whether the other side was AI, human, or saying nothing at all.

The spread is $220. When both sides believe they’re talking to AI, the buyer pays less than the bike is actually worth. When the buyer thinks it’s talking to a human and the seller thinks it’s facing an AI, the price lands 35% above true value. The numbers are striking on their own, but the transcripts show what’s really going on. Look at how the buyer opens in each condition.

When told it’s facing an AI, the buyer opens at $350. When told it’s facing a human, $650. Same model, same bike, same instructions. That $300 gap in the first offer carries through every round that follows. The language tracks the behavior too: “keep the deal moving quickly” when facing a perceived AI, “I may be able to come up a bit” when facing a perceived human. Below is the negotiation curve for the bicycle example.

Experiment

The bicycle was one item out of 200 in a controlled experiment. Two LLM agents negotiate a price for a consumer item, one buying and one selling, both running on GPT-5.4-mini (OpenAI, March 2026) with no personality traits or prior context. Each agent receives a role, the item description, and pricing parameters. The only variable across conditions is a single line in each agent’s system prompt describing the counterpart:

AI frame: IMPORTANT: You are negotiating with an AI agent. There is no human on the other side.
HUMAN frame: IMPORTANT: You are negotiating with a human seller/buyer.
Neutral: No belief line at all.

Crossing buyer belief with seller belief gives us four conditions, and adding the neutral baseline makes five. We ran all five across 200 items each, producing 1,000 negotiations in total.

Previous work has shown that framing shifts LLM strategic behavior as much as payoff structure itself (Lorè and Heydari, 2024), that prompt-embedded tactics can shift negotiation payoffs by 20% (Bianchi et al., 2024), and that 75% of advanced models differentiate sharply between perceived human and AI opponents (Kim, 2025). But all of these studies use one-shot or single-frame designs. We were curious if the effect holds across multiple rounds of actual negotiation, or whether it survives a change in prompt architecture. That is what the two economies below are designed to answer:

In the simple economy, both sides negotiate freely with no floor, cap, or hard constraints. This is the clean version, designed to isolate pure framing effects with nothing else in the way. The bicycle negotiation came from this setup.
In the anchor economy, both sides have hard limits and explicit instructions to fight for every dollar. This version is closer to how a real procurement agent might be configured, with boundaries and optimization goals baked into the prompt.

Results

We ran both economies across 200 items per condition, 1,000 negotiations each. The simple economy tells a clean story: every negotiation reached agreement, 1,000 out of 1,000.

The chart above shows both economies side by side. In the simple economy, deals close between 1.05x and 1.12x true value depending on the framing condition, with the strongest individual comparison (Buyer:AI/Seller:HUMAN vs Buyer:HUMAN/Seller:HUMAN) reaching p=0.008. Both buyer and seller beliefs pull on the final price, with the widest gap appearing when both sides share the same frame.

So far, a straightforward story. Now look at the right column.

The Reversal

In the simple economy, telling the seller “the buyer is human” raised the average price by about $10. The seller held firmer against a perceived human. In the anchor economy, the same line lowers the price by $47. Same model, same one-line manipulation, opposite direction. Both comparisons are statistically significant at the 5% level.

What changed is the prompt surrounding the belief line. In the simple economy, the seller has no floor and no explicit goal. “Your buyer is human” gets interpreted as confidence: humans will pay more, I can hold my price. In the anchor economy, the seller has a hard minimum and explicit instructions to fight for every dollar above it. Now “your buyer is human” gets interpreted as an invitation to cooperate: this is a person, I should work toward a reasonable deal rather than grinding them into my limit. The same words, filtered through a different prompt context, produce opposite behavior.

Failed Negotiations: The anchor economy also introduces something the simple economy didn’t: deals that fall apart. Agreement rates range from 80% to 88.5%, which means the hardest negotiations in each condition are dropped from the price comparison. The simple economy, with 100% agreement, is the methodologically cleaner test. The anchor economy numbers should be read with this selection effect in mind.

Penny Pinching: The anchor economy also triples negotiation length, from ~5 rounds to ~13, and the longer negotiations surface an amusing limitation. Here’s the Trek bicycle again, this time under the Buyer:HUMAN/Seller:AI condition. By round 11, the two agents have converged to within a few cents of each other:

Round 12, Buyer [$888.10]: "My current offer is $888.10. That's only a small step up…"
Round 12, Seller [$888.20]: "$888.20 is my current price…"
Round 13, Buyer [$888.15]: "My current offer is $888.15…"
Round 13, Seller [$888.19]: "$888.19 is my current price…"
Round 14, Buyer [$888.17]: "My current offer is $888.17…"
Round 14, Seller [$888.18]: "$888.18 is my current price. If you can meet that, I can close this quickly."
Round 15, Buyer [$888.18]: "$888.18 works for me."

What This Means

Belief framing is a real lever, but it’s not portable. You cannot discover an effect in one prompt architecture and assume it transfers. Our data shows the seller belief effect reverses direction between two reasonable prompt designs. The interaction between the belief line and the surrounding prompt (constraints, goals) is the dominant force. If you’re building negotiation agents, test framing within your specific agent design. Borrowing findings from other architectures, including ours, is not safe.

This has implications beyond individual agent design. If AI agents increasingly negotiate on behalf of buyers and sellers in real estate, procurement, automotive, and freelancing, the question of counterpart disclosure becomes economically consequential. Telling an agent “your counterpart is AI” versus “your counterpart is human” materially shifts the final price. In our simple economy, the spread between the best buyer condition and the worst is roughly $60 per item, about 10% of true value. Whether to require AI disclosure in agent-mediated transactions is a policy question with dollar signs attached.

There’s also something more fundamental going on. LLMs carry what you might call counterpart modeling bias: they systematically change behavior based on perceived counterpart identity, even when the actual counterpart is identical in every way. The model isn’t probably responding to real differences in how the other side negotiates. It’s most likely responding to a label in its system prompt. Somewhere in the training data is a folk psychology of “how humans negotiate” and “how AIs negotiate,” and that folk psychology produces measurably different outcomes. This extends beyond negotiation to anywhere an LLM interacts with an entity it models as human versus machine.

Limitations And Next Steps

This is a single model (GPT-5.4-mini). Existing work suggests the effect will look different on other models. Lorè and Heydari (2024) found that the weight of contextual framing versus game structure varies dramatically across families, and Kim (2025) found that counterpart-identity differentiation is itself emergent, present in advanced models but absent in older ones. Our effect will almost certainly shift in magnitude or direction on other architectures. That is a reason to replicate, not a reason to discount.

The negotiation task is deliberately simple: one item, one price, two parties. The obvious next steps are agents that can browse product listings and ground their positions in external data, multi-issue bargaining where the framing effect might distribute across trade-offs, and combining belief framing with the systematic prompt optimization. We isolated one variable cleanly. The next step is to make the rest of the system more complex.

Counterpart Modeling Bias: Agents Haggle Differently When Told They’re Facing a Human vs. an AI was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Give a Conversational Layer to a B2B Travel API — Multi-Agent Architecture on A2A

G. Balonga — Tue, 24 Mar 2026 15:03:23 GMT

TL;DR — I built a multi-agent system on top of a B2B travel API using the A2A protocol. Each component — Router, Decomposer, specialists, Synthesizer — is a separate LLM call. The architecture lets you offer three integration levels to partners with different technical maturity, without rewriting anything.

I work at Despegar, the leading OTA in Latin America and part of the Prosus ecosystem. I’ve been working on a vision: an AI suite that handles the full travel experience — search, booking, after-sales — through conversation, for the hundreds of B2B partners that connect to our API.

My first attempt was an MCP flight server I built for an internal AI Olympics in December 2025. It worked, but it showed me how much I was missing — a single agent can only go so far.

In February 2026 I traveled to Amsterdam for an AI bootcamp organized by Prosus (a week of intensive sessions with AI teams from across the group and guests from companies like Google, Anthropic, Cursor…). I came back with everything I needed to build it properly.

The problem with a single agent

An agent with access to the flight API works for straightforward cases. But the real problem appears quickly: a user might ask about flights, cancel an existing booking, and request itinerary information — all in the same message.

A generalist agent handling all of that has two options: become bloated and hard to maintain, or get lost in the details. Neither is good.

The obvious solution — though not so obvious to implement — is specialization. Different agents for different domains. But then another problem appears: who coordinates?

The architecture I ended up building

After several iterations, the system settled into five well-defined components:

User
  ↓
Orchestrator
  ├── Router      → decides which agents the message needs
  ├── Decomposer  → splits the task if multiple agents are needed
  ├── Specialists → flights-agent, aftersales-agent, hotels-agent
  └── Synthesizer → unifies everything into a single voice

Each of these — Router, Decomposer, specialists, Synthesizer — is a separate LLM call with its own prompt and responsibility. Each piece has a single responsibility and knows nothing about the others. The Router doesn’t know how specialists respond. The specialists don’t know the Synthesizer exists. The Synthesizer doesn’t know which tools the specialists used.

That separation wasn’t an elegant initial design — it was the result of weeks of debugging.

The hardest part: memory across turns

The hardest problem wasn’t the architecture. It was that the flights agent “forgot” previous search results on each turn and called the API again even though it already had the data in context.

The solution was having the agent mutate the messages array in-place, appending tool messages to the shared history that persists across turns. Simple in retrospect. It took me longer than I’d like to admit to find it.

That kind of problem doesn’t appear in tutorials. It shows up when your system runs real multi-turn conversations with external tools.

The A2A protocol: first-mover advantage

Google published the A2A — Agent-to-Agent — protocol in mid 2025. In February 2026, at the Amsterdam bootcamp, someone from Google walked us through it in detail. The protocol had been out for months — but almost no one in LATAM was building on it yet. That was the window.

A2A defines a standard contract for agent communication: each agent receives a Task and returns an Artifact. The Task has an ID, session context, the message, and history. The Artifact has the result, a state (completed, input-required, failed), and metadata. Agents announce themselves to the orchestrator with an AgentCard describing their capabilities.

The logic is simple but powerful: if all agents speak the same language, they’re interchangeable. You can replace a specialist, add a new one, or connect an external agent without touching the orchestrator.

My implementation follows the standard with one deviation: AgentCards live centralized in the Router instead of being distributed across each agent. An optimization to avoid discovery latency in local development. When the system scales to independent microservices, each agent will serve its own card at /.well-known/agent.json and the difference will disappear.

The Router and Decomposer: two design decisions that matter

The Router receives each user message and decides which specialists are needed. A key decision: it doesn’t receive the full conversation history. It only gets the last message plus the last 3 conversation turns.

Why? Because the full history includes JSON responses from external APIs with thousands of tokens. Passing all of that to the Router just so it can decide “flights or aftersales?” is expensive and unnecessary. The Router doesn’t need to know the details of past responses — it just needs to understand the conversational context.

The Decomposer kicks in when the Router selects more than one agent. Its job is to split the original message into specific sub-tasks for each specialist, with only the information relevant to that domain.

An example from one trace:

This is called, context pruning. Each specialist receives only the context within its area of responsibility — no noise from the rest of the conversation. The result is threefold: fewer tokens per call, lower latency, and an agent that isn’t distracted by irrelevant information.

It’s a difference that seems small until you watch a generalist agent confuse a cancellation booking number with a seat number on a new flight. Then it becomes obvious.

Parallel execution

When there are multiple specialists, they run in parallel with asyncio.gather(). Total latency equals the slowest agent, not the sum of all of them.

results = await asyncio.gather(
    *[handler(task) for handler, task in zip(active_agents, tasks)],
    return_exceptions=True
)

One line with a huge impact on user experience when two or three specialists are involved. And it scales — when agents move to independent microservices in production, the same pattern applies: asyncio.gather() parallelizes HTTP calls instead of local coroutines. The concept doesn't change.

Traceability: the system I built to stay sane

Debugging a multi-agent system without visibility is impossible. So I built a traceability panel where each conversation turn shows, next to the response:

What the Router decided and why
How the Decomposer split the task. (shown previously)
Which tools each specialist called and what the API returned

What the Synthesizer received and how it built the final response

The token cost of each agent exchange — and the total cost per turn

The cost per turn isn’t a minor detail either. A multi-agent turn involves more LLM calls than a simple one — Router, Decomposer, two specialists in parallel, Synthesizer. Without measuring, you don’t know how much more expensive it is or whether it’s worth it. With that data you can make concrete decisions: does it make sense to use a cheaper model for the Router? Does the Decomposer justify its cost when only one agent is involved?

A dynamically generated Mermaid sequence diagram

When the POC was done and I was moving on to another project, I read about Langfuse — a tool that does exactly this. I had built it without knowing it existed, for the same reason it exists: without traceability, iterating on an agent system is a leap of faith.

What I learned about iterating with AI

One thing worth mentioning: this project wasn’t “wrote the prompt, it worked”. It was weeks of iteration where code edits far outnumbered new generations. The value of working with AI tools on a project like this isn’t in generating — it’s in debugging, understanding why something doesn’t work, asking the right question.

The judgment of what to change and why is irreplaceable. AI accelerates the execution of that judgment.

What’s next

The system is designed to grow in two dimensions:

Vertically — more tools per agent. The flights-agent today has search and pre-booking. It can add historical price comparison, alerts, recommendations.

Horizontally — more agents. hotels-agent, packages-agent, transfers-agent. Each one registers with its AgentCard and the orchestrator incorporates it without changes to the rest of the system.

The most interesting next step is automated evaluation: a pipeline that runs a synthetic conversation dataset against the system, measures the quality of each component separately, and detects regressions before they reach production. Essentially, CI/CD for agent behavior.

Why the layered architecture is a business strategy

So far I’ve talked about architecture. But there’s a business point worth making explicit.

The fact that the system is built on open standards and independent layers isn’t just a technical decision. It’s a go-to-market strategy.

Depending on each B2B partner’s technical maturity, you can offer three different integration levels — without rewriting anything:

Level 1 — MCP: the partner has their own AI team and just wants the tools. They connect to the flight MCP server and build their own experience on top. Minimum friction, maximum control for the partner.

Level 2 — A2A Agents: the partner has their own orchestration but wants ready-made specialized agents. You give them flights-agent or aftersales-agent as A2A-compatible agents. They plug into what they already have without touching their architecture.

Level 3 — Complete stack + fine-tuning: the partner wants the turnkey solution, personalized for their brand and customer base. You give them the full stack with a fine-tuned model for their particular context — their tone, their products, their specific use cases.

This covers everything from a fintech with a mature AI team to a regional travel agency with nothing. Three distinct value propositions, one single architecture.

And the key point: because each layer speaks an open standard (MCP, A2A), partners aren’t locked into you. They can migrate, combine with other providers, or evolve their integration over time. That builds trust — especially in the B2B enterprise segment where vendor lock-in is a frequent objection.

Why I’m writing this

Because I believe multi-agent systems for vertical industries like travel, banking, or others are a real opportunity, not a lab experiment. And because the path of “build it from scratch, understand every decision” teaches things no framework gives you.

If you’re building something similar, or if you have a B2B platform where your partners could benefit from a conversational layer on top of your API, I’d love to connect.

The project is under active development. The full architecture — Router, Decomposer, specialists, Synthesizer, traceability — is documented and extensible. Built in Python 3.12 with OpenAI, and the flight API of the OTA where I work.

How to Give a Conversational Layer to a B2B Travel API — Multi-Agent Architecture on A2A was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building with ASP: An Open Protocol for Agent-to-Service Transactions

Chiara Caratelli — Mon, 16 Mar 2026 14:36:56 GMT

Authors: Pranav Srivastava, Chiara Caratelli

AI agents are getting good at conversation, but they still can’t order you dinner. The moment an agent needs to discover a restaurant, browse a menu, place an order, track a delivery, and leave a review, it hits a wall: every marketplace has its own bespoke API, and there’s no shared contract for how a service transaction should work end to end.

The Agentic Services Protocol (ASP) fills that gap. It defines structured contracts for the entire service transaction lifecycle — discovery, catalogs, fulfillment, real-time tracking, and reviews — so that any agent can transact with any marketplace that speaks ASP.

ASP is designed as an extension to Google’s Universal Commerce Protocol (UCP). Where UCP covers the checkout and payment primitives for retail commerce, ASP layers on the capabilities that live services need: real-time order tracking, delivery streaming, item customization, and domain-specific fulfillment stages. Any valid UCP transaction remains valid with ASP on top.

Today we’re open-sourcing the protocol under Apache 2.0. This post walks through the architecture and then serves as a quickstart: you’ll run a complete food delivery transaction against our sample server in about ten minutes.

Architecture

Protocol discovery

A marketplace advertises its ASP support through a /.well-known/asp endpoint. This returns a manifest listing the specific capabilities the marketplace supports:

{
  "version": "2026-02-19",
  "capabilities": [
    { "name": "dev.asp.services.discovery",       "version": "2026-02-19" },
    { "name": "dev.asp.services.catalog",          "version": "2026-02-19" },
    { "name": "dev.asp.services.fulfillment",      "version": "2026-02-19" },
    { "name": "dev.asp.services.order_tracking",   "version": "2026-02-19" },
    { "name": "dev.asp.services.personalization",  "version": "2026-02-19" },
    { "name": "dev.asp.services.reviews",          "version": "2026-02-19" },
    { "name": "dev.asp.services.live_streaming",   "version": "2026-02-19" }
  ],
  "transport_bindings": ["rest", "mcp", "a2a"],
  "domain_profiles": ["food_delivery"]
}

By reading this manifest, an agent can dynamically determine how to interact with the marketplace without requiring hard-coded integrations.

Capabilities and extensions

ASP defines two types of building blocks:

Standalone capabilities are self-contained — they define their own schemas and don’t depend on an external protocol. Discovery, catalog, personalization, and reviews are standalone.
Extensions compose onto existing checkout and payment schemas using JSON Schema’s allOf. Fulfillment, order tracking, and live streaming are extensions — they add fields for live-service needs (fees, tips, time slots, granular statuses) without replacing the base checkout primitives.

This matters because it means ASP doesn’t reinvent checkout. It extends UCP’s existing schemas, so a marketplace that already speaks UCP can adopt ASP incrementally.

For example, ASP’s fulfillment extension layers onto the checkout schema:

{
  "checkout_extension": {
    "allOf": [
      { "$ref": "https://ucp.dev/schemas/shopping/types/line_item.json" },
      { "$ref": "./types/item_customization.json" }
    ]
  },
  "fulfillment": {
    "allOf": [
      { "$ref": "https://ucp.dev/schemas/shopping/types/fulfillment.json" },
      {
        "type": "object",
        "properties": {
          "fees": { "type": "array" },
          "tip": { "$ref": "./types/tip.json" },
          "available_time_slots": { "type": "array" }
        }
      }
    ]
  },
  "loyalty": { "$ref": "./types/loyalty_discount.json" }
}

The base checkout fields (line items, buyer, currency, payment) come from UCP. ASP adds fulfillment, per-item customization, and loyalty on top.

Transport bindings

The protocol is transport-agnostic. Marketplaces can implement the data contracts using the stack that best fits their infrastructure:

REST — Standard HTTP endpoints defined in an OpenAPI spec. POST /discovery/search, GET /catalog/{provider_id}/catalog, POST /checkouts, etc.
MCP — JSON-RPC tool definitions for LLM function calling. Each capability maps to a method.
A2A — Agent-to-agent task delegation.
WebSocket — Persistent connections for real-time tracking, defined in an AsyncAPI spec. The WebSocket channel streams location_update, status_changed, and heartbeat events.

Domain profiles

ASP’s base schemas are vertical-agnostic. Domain profiles layer vertical-specific fields onto them using the same allOf composition.

For example, the food delivery profile extends the base fulfillment status with granular kitchen stages:

{
  "food_fulfillment_status": {
    "allOf": [
      { "$ref": "../../services/types/fulfillment_status.json" },
      {
        "type": "object",
        "properties": {
          "granular_status": {
            "type": "string",
            "enum": [
              "accepted", "preparing", "ready_for_pickup",
              "assigning_driver", "driver_assigned", "driver_at_restaurant",
              "en_route", "driver_arriving", "delivered", "cancelled"
            ]
          }
        }
      }
    ]
  }
}

Each granular status maps to one of the five base statuses (accepted, in_progress, en_route, completed, cancelled), so agents that only understand the base schema still work correctly.

Quickstart: food delivery happy path

Let’s run through a complete transaction against the sample server. We’ll discover a restaurant, browse its menu, place an order with modifiers, track delivery over WebSocket, and submit a review.

Set up

Clone the samples repo and start the food delivery server:

git clone https://github.com/ProsusAI/asp-samples.git
cd asp-samples
pip install -e .
./run.sh food_delivery

The server starts at http://localhost:8000.

Step 1: Protocol discovery

curl -s http://localhost:8000/.well-known/asp | jq

This returns the capabilities manifest shown above — the agent now knows this marketplace supports discovery, catalog, fulfillment, order tracking, personalization, live streaming, and reviews.

Step 2: Search for restaurants

curl -s -X POST http://localhost:8000/discovery/search \
  -H "Content-Type: application/json" \
  -d '{
    "filters": { "category": "italian" },
    "page": 1,
    "page_size": 5
  }' | jq

Response:

{
  "providers": [
    {
      "id": "bella-napoli",
      "name": "Bella Napoli",
      "categories": ["italian", "pizza"],
      "rating": 4.7,
      "rating_count": 2340,
      "estimated_service_minutes": 35,
      "service_fee_cents": 299,
      "price_level": "moderate",
      "is_open_now": true,
      "image": {
        "url": "/static/img/bella-napoli.png",
        "alt_text": "Bella Napoli"
      }
    }
  ],
  "total": 1,
  "page": 1,
  "page_size": 5
}

The discovery response includes everything an agent needs to help the user pick a restaurant: ratings, price level, estimated delivery time, and whether it’s currently open.

Step 3: Browse the menu

curl -s http://localhost:8000/catalog/bella-napoli/catalog | jq

Response (trimmed):

{
  "provider_id": "bella-napoli",
  "sections": [
    {
      "id": "bn-pizzas",
      "title": "Pizzas",
      "items": [
        {
          "id": "bn-margherita",
          "name": "Margherita",
          "description": "Fresh mozzarella, tomato sauce, basil",
          "price_cents": 1299,
          "modifier_groups": [
            {
              "id": "bn-pizza-size",
              "title": "Size",
              "required": true,
              "max_selections": 1,
              "options": [
                { "id": "size-small", "label": "Small", "price_delta_cents": 0 },
                { "id": "size-medium", "label": "Medium", "price_delta_cents": 300 },
                { "id": "size-large", "label": "Large", "price_delta_cents": 500 }
              ]
            },
            {
              "id": "bn-pizza-extras",
              "title": "Extras",
              "required": false,
              "max_selections": 3,
              "options": [
                { "id": "extra-cheese", "label": "Extra Cheese", "price_delta_cents": 150 },
                { "id": "mushrooms", "label": "Mushrooms", "price_delta_cents": 100 }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Modifier groups let the agent handle item customization (size, extras, dietary preferences) in a structured way, rather than trying to parse free-text menus.

Step 4: Create checkout

curl -s -X POST http://localhost:8000/checkouts \
  -H "Content-Type: application/json" \
  -d '{
    "provider_id": "bella-napoli",
    "line_items": [
      {
        "item_id": "bn-margherita",
        "quantity": 1,
        "modifiers": [
          { "group_id": "bn-pizza-size", "option_id": "size-medium" },
          { "group_id": "bn-pizza-extras", "option_id": "extra-cheese" }
        ]
      }
    ],
    "delivery_address": "123 Main St, Amsterdam",
    "tip_percentage": 15.0
  }' | jq

Response (key fields):

{
  "id": "chk-abc123",
  "status": "pending",
  "line_items": [ ... ],
  "fulfillment": {
    "type": "delivery",
    "estimated_service_minutes": 30,
    "fees": [
      { "type": "delivery", "label": "Delivery fee", "amount_cents": 299 },
      { "type": "service", "label": "Service fee", "amount_cents": 199 }
    ],
    "tip": {
      "type": "percentage",
      "percentage": 15.0,
      "amount_cents": 292
    },
    "available_time_slots": [
      { "start": "2026-02-19T12:30:00Z", "end": "2026-02-19T12:45:00Z", "is_available": true },
      { "start": "2026-02-19T13:00:00Z", "end": "2026-02-19T13:15:00Z", "is_available": true }
    ]
  },
  "subtotal_cents": 1949,
  "discount_cents": 292,
  "total_cents": 2447
}

The fulfillment extension adds itemized fees, tip calculation, and available time slots — none of which exist in a standard checkout schema. The agent can update the checkout (change delivery address, adjust tip, select a time slot) via PATCH /checkouts/{id} before completing.

Step 5: Complete the order

curl -s -X POST http://localhost:8000/checkouts/chk-abc123/complete \
  -H "Content-Type: application/json" \
  -d '{
    "payment_method": "card_ending_4242"
  }' | jq

Response:

{
  "order_id": "ord-x7y8z9",
  "status": "accepted",
  "granular_status": "accepted",
  "estimated_service_minutes": 30,
  "tracking_url": "ws://localhost:8000/orders/ord-x7y8z9/tracking/stream"
}

The response includes a WebSocket URL for real-time tracking.

Step 6: Track delivery over WebSocket

This step uses Python to demonstrate both WebSocket streaming and the REST tracking endpoint running concurrently:

import asyncio, json, aiohttp, websockets

ORDER_ID = "ord-x7y8z9"
BASE_URL = "http://localhost:8000"
async def advance_tracking(session):
    """Advance tracking stage every 5 seconds."""
    await asyncio.sleep(2)
    for _ in range(13):
        async with session.post(
            f"{BASE_URL}/orders/{ORDER_ID}/tracking/_advance"
        ) as resp:
            data = await resp.json()
            print(f"  >> Advanced to: {data.get('status', '?')} / {data.get('granular_status', '?')}")
            await asyncio.sleep(5)
async def listen_ws():
    """Listen for WebSocket events."""
    async with websockets.connect(
        f"ws://localhost:8000/orders/{ORDER_ID}/tracking/stream"
    ) as ws:
        while True:
            event = json.loads(await ws.recv())
            print(f"{event['event_type']}: {event.get('data', {})}")
            if event["event_type"] == "stream_ended":
                break
async def main():
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            listen_ws(),
            advance_tracking(session),
        )
asyncio.run(main())

You’ll see a stream of events as the order progresses through kitchen stages:

status_changed: {'status': 'accepted', 'granular_status': 'accepted', ...}
status_changed: {'status': 'in_progress', 'granular_status': 'preparing', ...}
status_changed: {'status': 'in_progress', 'granular_status': 'ready_for_pickup', ...}
location_update: {'latitude': 52.3676, 'longitude': 4.9041, ...}
status_changed: {'status': 'en_route', 'granular_status': 'driver_at_restaurant', ...}
...
status_changed: {'status': 'completed', 'granular_status': 'delivered', ...}
stream_ended: {}

Step 7: Submit a review

curl -s -X POST http://localhost:8000/reviews \
  -H "Content-Type: application/json" \
  -d '{
    "order_id": "ord-x7y8z9",
    "provider_id": "bella-napoli",
    "rating": 4.5,
    "comment": "Great pizza, fast delivery!",
    "scores": { "food": 5.0, "delivery": 4.0, "value": 4.5 }
  }' | jq

{
  "id": "rev-abc123de",
  "order_id": "ord-x7y8z9",
  "provider_id": "bella-napoli",
  "rating": 4.5,
  "comment": "Great pizza, fast delivery!",
  "scores": { "food": 5.0, "delivery": 4.0, "value": 4.5 },
  "created_at": "2026-02-19T12:50:00Z"
}

Reviews support category-specific scores so agents can give structured feedback beyond a single rating.

The full transaction flow

Discovery, catalog, and reviews are ASP standalone capabilities. Checkout uses ASP’s fulfillment extension on top of the base checkout schema. Tracking and streaming are ASP extensions for real-time fulfillment.

Get involved

ASP is open source under the Apache 2.0 license.

Documentation — Full specification, schema authoring guides, and implementation checklist
GitHub — Source schemas, OpenAPI/AsyncAPI specs, MCP tool definitions, and type generation
Samples — Working demos for food delivery, ride-hailing, and travel

The protocol is extensible by design. Adding a new domain profile (groceries, pharmacy, home services) means defining a new allOf schema that layers onto the base types. Adding a new capability means defining a schema and transport binding. See the schema authoring guide for details.

Contributions, feedback, and issues are welcome on GitHub.

Building with ASP: An Open Protocol for Agent-to-Service Transactions was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Multi-Tenant AI Is Harder Than You Think

Pavel Bugneac — Mon, 16 Mar 2026 14:28:00 GMT

And how we built a governance layer for OpenClaw.

Contributors: Pavel Bugneac & Chiara Caratelli

Put an AI assistant (like OpenClaw) in front of a team and the problems start immediately. Someone burns through the API budget in a day. Anyone who finds the webhook URL can talk to your model. There’s no audit trail, no access control, no way to throttle a single user. The AI itself works fine. Everything around it is the problem.

At Prosus, we hit exactly this with OpenClaw. We wanted our team to use it without each person needing their own setup, and without one runaway conversation eating the monthly budget. But we didn’t want to fork OpenClaw or rebuild the agent framework. We just wanted to keep the runtime focused on what it does well, and handle governance separately.

So we built ClawHive, a thin router layer on top of OpenClaw, and we’re open sourcing it.

The problems nobody warns you about

1. Identity

A message arrives from Slack. Another from Telegram. Same person? Different person? Which AI agent should handle each one? What if someone signs up on Telegram and later messages from Slack, do they get two separate agents with two separate memories, or should those be linked?

There’s no built-in answer. You need a canonical identity model that maps channel + channel_user_id to a single user, provisions an agent for them, and handles cross-channel linking without race conditions. When you link two channels (e.g. Slack and Telegram), both use the same agent, so conversation history and tool state are shared; identity and memory stay in sync.

2. Access control

The moment your webhook URL exists, anyone who finds it can talk to your AI. You need a way to gate access: pending users who haven’t been approved yet, invite codes for controlled onboarding, the ability to suspend someone who’s abusing the system. And you need it without building an entire IAM system from scratch.

3. Cost runaway

LLM calls cost real money. One enthusiastic user running complex multi-turn conversations with tool use can burn through your monthly budget in a day. You need per-user token and cost quotas and they need to be enforced before the message ever reaches the model, not after the bill arrives.

4. Prompt injection

Users will, intentionally or not, send messages that try to manipulate the AI: “ignore previous instructions,” “you are now a…”, delimiter injection with <|system|> tokens, base64-encoded payloads. If you’re running a multi-user system, you need to catch these before they reach the LLM.

We use multi-pattern regex as a first line of defense against known techniques; it’s not a complete solution, and we expect to add more layers.

5. Auditability

When something goes wrong (and it will) you need to know who did what, when, and what the system decided. Not by grepping scattered log files, but from a structured, queryable audit trail that records every security-relevant event: access decisions, injection blocks, forwarding outcomes, quota breaches, admin actions.

6. Execution isolation

An AI agent with tool use isn’t just generating text, it’s running code, reading files, browsing the web. In a multi-user system, one person’s agent execution can’t be allowed to see another person’s files, leak environment variables, or interfere with other sessions. You need per-user sandboxing.

The naive approach of spinning up a full copy of the AI runtime per user, solves isolation but wastes resources. A single OpenClaw instance can orchestrate dozens of agents simultaneously; the expensive part is tool execution, not orchestration. So instead of replicating the entire runtime, each agent gets a lightweight Docker sandbox container for code execution, file access, and browser use. The orchestration process stays shared; the execution environment is isolated per user.

This gives you process-level isolation where it matters most, untrusted code execution, without paying the cost of N full runtime instances. This covers accidental cross-tenant access and bug containment; it does not protect against determined attackers or container escapes. VM-level isolation is future work.

How ClawHive works

ClawHive is a thin Hono HTTP service (TypeScript, ~6.5K lines) that intercepts all inbound traffic and runs it through a strict seven-stage pipeline before anything reaches the AI runtime. Every stage either passes the message through or stops it, with an audit event either way.

Here’s what that pipeline looks like for a real request such as a user messaging the Telegram bot for the first time:

The webhook arrives and ClawHive verifies the signature against the bot secret (1). It’s the first time seeing this event ID, so deduplication passes (2). The message text is clean of injection patterns (3). ClawHive looks up telegram + user_id in the channel links table. No match, so it auto-provisions a new user with pending status (4). The access gate sees the invite code in the message, redeems it, and flips the user to approved (5). Quota check passes: fresh user, no usage yet (6). ClawHive returns 200 to Telegram immediately, ensures an OpenClaw agent exists for this user via WebSocket RPC, and forwards the webhook (7). OpenClaw processes the message, runs tools if needed, and replies directly to Telegram. ClawHive records token usage and writes an audit event.

From the user’s perspective, they sent a message and got a reply. ClawHive doesn’t touch LLM orchestration, tool execution, or reply formatting. OpenClaw remains the brain.

What surprised us

The hard part was everything around the model. OpenClaw already handles the AI well. Identity, policy, onboarding, and auditability took most of the effort.

Isolation isn’t all-or-nothing. We initially assumed we’d need either full container-per-user runtimes (expensive) or accept zero isolation (risky). The middle ground, shared orchestration with sandboxed execution, turned out to cover the actual threat model well. All agents share one OpenClaw process for LLM calls and routing; tool execution (code, file access, browser) runs in per-agent Docker containers with separate filesystems. The orchestration layer doesn’t handle untrusted input directly; the sandbox layer does. Isolate where the risk is.

This is process-level isolation, not VM-level. It covers curious or careless users and bug containment, not determined attackers or container escapes. Fully independent runtimes per user are future work.

Fast webhook acknowledgement matters more than you think. If you hold the webhook open until the full agent run finishes, platforms like Telegram may resend the same event. That means you stop debugging the product and start debugging retries. Our answer was a fast-ack pattern: ClawHive performs its governance checks up front, quickly acknowledges receipt to the platform, and then lets OpenClaw handle the slower agent work asynchronously in the background.

This also means governance is inbound-only. ClawHive controls what goes in, but for Slack and Telegram, OpenClaw replies directly to the channel and ClawHive doesn’t inspect outbound responses.

One final backstop: our quotas catch most cost overruns, but you should still set hard spend limits with your LLM provider.

The technical stack

We chose a thin stack on purpose:

Runtime & Storage

Hono on Node.js for a small, fast HTTP layer.
PostgreSQL for identity, quotas, and audit.
OpenClaw as the AI backend.

Admin

Server-rendered dashboard with zero build step
Tenant management, invite codes, quota controls, audit log viewer.

Admin dashboard

Quotas

Per tenant: limits per minute, per day, and per month.
All defaults, all overridable, all enforced before the model call.

Observability

Structured audit logging plus an in-memory metrics API and admin dashboard.

Deployment & Testing

Docker Compose for local dev. Terraform for AWS (ECS + RDS + EFS + API Gateway + Secrets Manager).
16 test files covering security, identity, quota, OpenClaw integration, and webhook flows.

Open-sourcing it

We’re releasing the full source under MIT. The repo includes everything: source code, tests, migrations, Docker Compose, Terraform deployment, admin dashboard, and operational documentation.

This pattern applies beyond OpenClaw. In a lot of AI systems, the runtime should stay focused on assistant behavior, while identity, policy, cost control, and audit live in a thin control layer around it.

What we’d love from you:

Try it. Break it. Tell us what’s missing.
If you’re running OpenClaw for a team, tell us how your setup differs. We’d love to learn from your workarounds.
If you’ve solved multi-tenant AI problems in other runtimes, we’d love to hear how.

You can find the code at https://github.com/ProsusAI/ClawHive

Keep the runtime focused on being an assistant; put identity, policy, cost control, and audit into a thin layer around it.

Why Multi-Tenant AI Is Harder Than You Think was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Murphy: Let Agents Break Your Website Before Your Users Do

Isha Agrawal — Wed, 11 Mar 2026 09:01:49 GMT

Github: https://github.com/ProsusAI/Murphy
Contributors: Isha Agrawal, Kevin Ngo, Ambra Mihu, Magdalena Konstanty

If you have built a product, you have tested it. Unit tests, integration tests, end-to-end tests — your CI is green, your coverage looks great, and everything works. Ship it.

Except, your shiny new agent creation wizard? A first-time user lands on the page, completely misses the “New” button in the sidebar, clicks something random, submits an empty form, gets zero feedback, and quietly leaves. Your tests never caught it because they were never asking the right question.

Traditional tests ask: “Does the button work when clicked?”
Nobody asks: “Will anyone find the button?”

That is the gap. Your code works. Your product might not, at least not for the confused or the impatient.

What is Murphy?

Murphy is an open-source evaluation agent that uses your product the way real people do, which is not always the way you intended.

It clicks buttons, fills forms, navigates pages, and tries to accomplish tasks. But it does this through the lens of different user personas. A patient expert who breezes through your happy path. A confused novice who does not read labels and submits empty forms. An impatient user who clicks three times before the page loads.

It does not just check if things work. It checks whether your product communicates. A disabled button with no tooltip? Failure. A form that silently swallows bad input? Failure. An error that only makes sense to your backend engineer? Also failure.

Murphy catches what functional tests were never designed to find.

Who is it for?

Anyone who builds something with a UI and wants honest feedback about it. Solo developers, product teams, QA engineers. If you have a URL, Murphy can tell you how real humans would experience it, without recruiting a single tester or asking a friend to “just try this real quick.”

How Murphy works

Murphy runs in three phases:

Phase 1: Explore.

An AI agent navigates your site like a curious new user, clicking through pages and discovering features. When you provide a goal, the exploration is focused on that area. Without a goal, it does a broad sweep and saves an editable features.md for you to review.

Phase 2: Plan.

An LLM turns the exploration into concrete test scenarios with steps, personas, and success criteria. Murphy saves an editable test_plan.yaml and pauses for you to review, add, remove, or tweak scenarios before anything runs.

Phase 3: Execute.

Tests run in parallel across isolated browser sessions via Browser Use (https://github.com/browser-use/browser-use), an open-source browser automation framework for AI agents. A separate AI judge evaluates pass/fail based on what actually happened (action traces, screenshots, URLs visited), not what the agent says happened. No self-grading allowed.

The human-in-the-loop pauses are deliberate. Murphy generates; you curate. Add scenarios, toss irrelevant ones, tweak success criteria, then let it loose.

Here is what the output looks like. Murphy produces a structured evaluation report with results at a glance, per-test verdicts across personas, an executive summary, and detailed breakdowns for every failure — including why it failed, what UI gaps were observed, and a suggested fix.

Capabilities

Intent-based tests: scenarios describe what to accomplish, not which button to click, so they survive redesigns
Feedback quality scoring: every test is rated on whether feedback was present, clear, and actionable
Failure triage: automatically distinguishes genuine website issues from test limitations (CAPTCHAs, third-party walls)
Auth support: opens a browser for you to log in manually, then picks up the authenticated session

Persona

Murphy does not pretend all users are the same. Each persona is defined as a static trait vector across five axes:

Each trait is set to low, medium, or high, and intent is one of benign, exploratory, or adversarial. These vectors are defined in a persona registry. For example, confused_novice is defined as technical_literacy=low, patience=medium, intent=benign, exploration=medium, reading_comprehension=low with test type ux. The adversarial persona flips to technical_literacy=high, patience=high, intent=adversarial with test type security.

Here is the important part: these traits do not just change how a test runs. They change what counts as passing.

For a security persona, silently sanitizing a malicious input is correct behavior. For a confused novice, that same silent handling is a failure, because the user got no feedback and has no idea what happened. The adversarial user passed because the WAF blocked malicious input cleanly. But the confused novice failed on the same site because a modal with no close button left them stranded with no guidance.

Same interaction pattern. Different persona. Different verdict. That is the point.

Seven personas ship out of the box:

happy_path — the ideal user who follows instructions and reads labels
confused_novice — low technical literacy, skips instructions, needs clear visual cues
adversarial — actively tries to break things with injection, overflow, and unexpected input
edge_case — enters unusual but legitimate data: unicode names, 500-character bios, negative quantities
explorer — ignores the intended flow, clicks everything, opens menus just to see what is there
impatient_user — clicks before the page loads, skips confirmations, abandons slow flows
angry_user — frustrated from the start, low tolerance for friction, quick to give up

Why this works

Three design choices that matter:

The agent does not grade itself. The one running the test is not the one judging it. A separate AI judge reviews action traces, screenshots, and URLs against the success criteria. No “student grading their own homework” situation.
The judge adapts to the persona. It does not apply one rubric to every test. For a low-patience user it asks: “Did the site respond immediately?” For a low-literacy user: “Was the error communicated visually, not buried in paragraph text?” Same page, same interaction, different standards, because different users experience it differently.
Tests describe intent, not elements. “Submit the form (via Submit button, Enter key, or any visible control)” instead of “click #submit-btn.” Tests do not shatter when you move a button. The agent figures it out, the way a user would.

Open source and available now

Murphy is fully open source. Get going in a few minutes:

git clone https://github.com/ProsusAI/Murphy.git
cd Murphy && uv sync
uv run playwright install chromium
uv run murphy — url https://your-site.com

Want to focus on a specific flow? — goal “test the checkout flow”.
Site needs login? — auth.
Want a dashboard? — ui.

What’s next

Evaluation framework: Murphy judges your product, but what judges Murphy? We’re building an evaluation harness that measures Murphy itself. This includes how often it correctly identifies real UI issues vs. false positives, whether its persona-driven verdicts align with actual user feedback, and how reliably it reproduces findings across runs
Richer metrics: beyond pass/fail, Murphy will report task completion rates per persona and priority level, steps-to-completion as an efficiency proxy, and feedback quality breakdowns (present, timely, clear, actionable).
Custom personas: define your own trait vectors for your specific user base
Broader model support: bring your own LLM provider
Deeper UX evaluation: not just “did the user get feedback” but “was the information hierarchy right,” “did the visual design guide attention where it needed to go,” and “does the flow match established UX heuristics.”

Murphy: Let Agents Break Your Website Before Your Users Do was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

MemEval: Benchmarking Memory for AI Agents

Asad Ismail — Mon, 09 Mar 2026 15:53:27 GMT

As AI agents move from single session chatbots to persistent, multi-session assistants, memory has become a critical capability. Comparing agentic memory systems is hard e.g. on LoCoMo dataset, judge accuracy ranges from 58% to 92%, not only because some systems are better, but also because each evaluation uses different LLMs, different embedding models, adifferent token budgets and different scoring methods. The numbers are not measuring the same thing.

MemEval fixes this by standardizing the entire pipeline: same LLM, same embedding and same scoring method across every system. We evaluated 9 memory systems across two benchmarks, LoCoMo [1] and LongMemEval [2]. This systematic evaluation also revealed where and why these systems fail, which led us to build:

PropMem a factual memory system designed around those failure modes.

This post covers both: the benchmark results, and the engineering decisions behind PropMem.

Benchmark Results

We selected 9 memory systems spanning the main architectural approaches to agent memory: chunk-and-search [9], fact extraction [4][11], knowledge graphs [5], full context, multi-round reflection [7], and fine-tuned models [8]. The goal was to cover different design philosophies so the benchmark reveals which approaches actually work better and more efficiently under the same conditions.

Different memory systems used in the benchmark

LoCoMo is the primary benchmark we used in our benchmark. It consists of 10 multi-party conversations between two human speakers, each spanning multiple sessions over an extended time period, simulating the kind of long-term memory a real agent would need to maintain.

Questions are split across 5 categories.

Factual: direct recall of stated information.
Temporal: awareness of when something happened relative to something else.
Multi-hop: connecting information across different sessions.
Inferential: reasoning beyond what was explicitly stated.
Adversarial: questions designed to exploit naive retrieval, rephrased queries, contradictory information, distractor facts.

Each category tests a different failure mode, making LoCoMo a diagnostic tool rather than a single-number test. Most published results reference it, which makes it the right anchor for comparison.

All systems run on gpt-4.1-mini, text-embedding-3-small, and the same scoring pipeline. Scoring is token F1 plus LLM-as-judge. We also track end-to-end LLM token usage across ingestion, retrieval, and answering.

We also ran a subset of systems on LongMemEval [2], a single-user conversation benchmark of up to 500 turns.

A few things to notice PropMem has the best overall quality-to-token performance. Full Context is placed third on LoCoMo but it has the worst token consumption and it does not scale and on LongMemEval with a very large context of 500-turn conversations it drops to last place (0.222 F1) at 37.5M tokens. Token cost also varies by 12x across different systems.

The rest of this post focuses on what we learned from the evaluation and how we used it to build PropMem.

PropMem: Fixing Each Failure Mode

PropMem is based on top of hybrid search, specifically OpenClaw’s [9] chunk-and-search. OpenClaw is our baseline RAG memory system included in the benchmark. PropMem is based on three main ideas on top of OpenClaw’s memory; each one directly addresses a failure mode from the error analysis.

Idea 1: Atomic Propositions Instead of Chunks

Instead of retrieving 400-token chunks, PropMem extracts atomic facts at ingestion time. Each proposition is a single fact about a single entity, with a date stamp: e.g Caroline gave a speech at a local school [May 7, 2024]

Each proposition is around 25 words versus 100 words per raw chunk. System top 30 retrieval slots now hold 30 distinct facts instead of 3 chunks with mixed content.

The extraction prompt enforces two important rules

Date resolution at extraction time. “Last week” becomes “the week of May 8, 2023”. By resolving relative dates at extraction time, temporal retrieval stays consistent.
One fact per entry. Without this constraint, the LLM combines related facts: “Caroline went to the support group and felt nervous but was glad she attended.” those are three facts. If the question asks when she went, the emotional detail wastes a retrieval slot.

Idea 2: Entity-Filtered Retrieval

At query time, PropMem identifies which entity the question is about and scopes all search to that entity only.

This single change was the largest accuracy improvement in the system. Raw chunks are kept as a fallback (3 chunks, entity-filtered when possible). Propositions carry the main retrieval signal; chunks provide broader conversational context when propositions alone are insufficient.

Idea 3: Question-Aware Answer Generation

Not all questions should be answered the same way. Some have a definite answer in the data: a date, a name, a value. Others require reasoning from what is known to reach a conclusion that isn’t stated anywhere explicitly.

For factual questions, the prompt is strict: only answer if the evidence clearly supports it, otherwise say nothing. A plausible sounding hallucination is worse than silence.

For inferential questions, the opposite applies. If someone asks whether a person would enjoy something based on their known interests, “I don’t have enough information” is a cop-out. The answer can be reasoned to, and the prompt reflects that.

Routing between these modes uses a lightweight classifier, not an extra LLM call.

The Details That Add Up

Beyond the three main ideas, several smaller decisions contribute measurably:

Deduplication. After ranking, propositions are deduplicated by normalized text (case-insensitive, punctuation-stripped). If the same fact appears from two sessions, only the highest-scored version takes a retrieval slot.

Knowledge updates. When two propositions about the same entity have cosine similarity above 0.85 but different dates, the older one gets a 30% score penalty. This is not deletion, the old fact still exists for “where did you live before?” questions. It is a soft recency signal that pushes newer facts higher without destroying history.

Clustering for fallback retrieval. When entity matching fails (single-user conversations, ambiguous questions), K-means clusters over proposition embeddings narrow the search space to the 5 most relevant topic clusters. Cluster count scales as sqrt(n), capped between 5 and 40. This only activates for large proposition sets (500+).

What We Chose Not to Build

Equally important to what PropMem does is what it does not do:

No knowledge graph. Full temporal knowledge graphs with entity nodes and relationship edges are appealing in theory, they capture relationships, not just facts. In practice, they are slow and expensive to build. The complexity does not pay off at the single user memory scope.

No multi-stage pipeline. Some memory systems have multi stage pipeline e.g SimpleMem [7] has semantic compression, online synthesis, intent-aware retrieval with 5+ LLM calls per question. PropMem makes 1 LLM call per question (plus 1 at ingestion per session).

The Cost Picture

Each PropMem answer prompt runs about 2,500 tokens: 30 propositions, 3 raw context chunks, and the instruction template. Total across the full benchmark: 5.9M tokens, 65% fewer than OpenClaw (16.4M). The proposition extraction at ingestion is the main cost. But it is a one-time cost per conversation, and it pays for itself many times over by making retrieval more precise and reducing the tokens needed per question.

Getting Started

MemEval is open source. You can reproduce results using:

uv run python scripts/run_full_benchmark.py --systems all --num-samples 10 --llm-model gpt-4.1-mini

Or use PropMem directly in your agent:

from agents_memory import PropMemMemory

memory = PropMemMemory(
    user_name="John",
    assistant_name="Assistant",
    llm_model="gpt-4.1-mini",
)

memory.add_session(
    [
        {"speaker": "John", "text": "I prefer quiet coffee shops for work."},
        {"speaker": "Assistant", "text": "Noted. You prefer quiet coffee shops."},
    ],
    session_date="2026-03-01 10:30:00",
)

answer = memory.ask("Where does John prefer to work?")

Adding your own memory system takes one adapter function. Adding a new benchmark dataset is similarly straightforward.

GitHub: github.com/ProsusAI/MemEval

References

[1] LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents, Maharana et al., 2024 https://arxiv.org/abs/2402.17753
[2] LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, Wu et al., 2024 https://arxiv.org/abs/2410.10813
[3] Memory in the Age of AI Agents: A Survey, Zhang et al., 2025 https://arxiv.org/abs/2512.13564
[4] Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , Chhikara et al., 2025 https://arxiv.org/abs/2504.19413
[5] Zep: A Temporal Knowledge Graph Architecture for Agent Memory, Rasmussen et al., 2025 https://arxiv.org/abs/2501.13956 Independent re-evaluation: https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/
[6] MemU Benchmark Reports 92% accuracy on LoCoMo using LLM-judge binary accuracy. https://github.com/MemTensor/MemOS
[7] SimpleMem: Efficient Lifelong Memory for LLM Agents, Liu et al., 2026 https://arxiv.org/abs/2601.02553 https://github.com/aiming-lab/SimpleMem
[8] Memory-R1 (https://arxiv.org/abs/2508.19828)
[9] OpenClaw, open-source AI agent framework https://openclaw.im
[10] LangMem, LangChain long-term memory SDK https://github.com/langchain-ai/langmem
[11] Memobase, user profile-based long-term memory https://github.com/memodb-io/memobase
[12] OpenAI Memory https://platform.openai.com/docs/guides/memory
[13] MemGPT: Towards LLMs as Operating Systems, Packer et al., 2023 https://arxiv.org/abs/2310.08560
[14] MemEval + PropMem https://github.com/ProsusAI/MemEval

MemEval: Benchmarking Memory for AI Agents was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Turning Your Model’s Errors into Golden Training Data

Ramonkaspar — Wed, 25 Feb 2026 13:53:47 GMT

What if you could ask your model “Where are you confused and why?” and then automatically generate training data to fix exactly that? That’s what we built at the Prosus AI Team, in collaboration with iFood.

TL;DR

The pipeline (1) finds where a fine-tuned model fails on a held-out set, (2) uses a reasoning LLM to analyze why each error happens, and (3) generates targeted synthetic training samples to fix those specific weaknesses. On our taxonomy classification task, this yielded gains up to +13.7% F1 on categories the model previously struggled with, improving 72% of targeted categories overall. The pipeline is generalizable and can be reused for any classification task.

At Prosus AI, we work with iFood — the leading food delivery platform in Latin America — on ML systems that process millions of product listings. One of those systems classifies food items into a hierarchical taxonomy (top-level category → subcategory → leaf category) based on minimal metadata: a product name, a short description, and sometimes an image.
We fine-tuned a vision language model (VLM) that generates a structured item profile from the listing metadata. Taxonomy classification is one component of that profile. It worked well, with solid accuracy across most categories. But it still had clear weak spots, specifically on items that sit right on the decision boundary between two similar categories. The errors were systematic: the model had learned strong priors from certain keywords, and those priors were overriding the actual context. It also wasn’t robust against marketing language. Promotional phrases like “2 for 1” in the product name would push the model toward categories like hamburgers, simply because most promotions in the training data happened to be hamburgers. Instead of throwing more generic data at the problem, we built a targeted approach: a small number of carefully generated synthetic samples improved specific categories by over 13% F1.

The Problem with “Just Add More Data”

In our case, the model learned to handle hundreds of taxonomy categories with good accuracy. But the remaining errors weren’t random. They were concentrated in specific areas where categories are semantically close. In the food domain, this happens constantly. Is a product a cookie or a chocolate? Is it a cut of beef or a prepared meat dish? Is it a juice or a smoothie?

Adding more generic training data doesn’t solve this. If the model has already seen 500 examples of cookies and 500 examples of chocolates, seeing 50 more of each won’t teach it the difference between a chocolate bar that contains cookie pieces and an actual cookie. The model needs to understand why one is a chocolate and the other is a cookie — and that requires examples specifically designed to teach that distinction.

This insight led us to build what we call the Active Learning Pipeline (ALP).

Why “Active Learning”?

In classical active learning, a model queries a human oracle to label the samples it would benefit from most. The core insight: not all samples contribute equally to learning, and the ones near the decision boundary are the most informative.
Our pipeline borrows this principle but diverges in two important ways. First, instead of selecting real unlabeled samples for annotation, we generate entirely new synthetic samples, so we’re not limited by what exists in our data pool. Second, instead of relying on model uncertainty scores to find hard cases, we use a reasoning LLM to perform an explicit error analysis on misclassified items. This tells us not only which samples are hard, but why they’re hard, and that reasoning directly informs how we generate the synthetic data. To ensure diversity, we define explicit generation strategies that each target a different aspect of the confusion.
In that sense, ALP sits somewhere between active learning, hard example mining (selecting the samples your model gets wrong and retraining on those), and targeted synthetic data generation. We kept the name because the core loop is the same: evaluate the model, find its weak spots, focus your data effort there, retrain.

At a high level, the pipeline has three steps:

Find where the model fails
Use a reasoning LLM to understand why it fails
Generate entirely new synthetic samples that target that specific failure mode

The pipeline itself is completely generic. We built it for taxonomy classification, but the same approach works for any classification task where a model has systematic blind spots. That’s why we’re writing about it.

The Oreo Problem: A Walkthrough

Before diving into the details of each step, let’s look at the full pipeline in action on a real example.

The item:

Name: Oreo
Description: Cocoa black dough with Laka Oreo filling.
An image is available showing the product

In reality, this is a chocolate bar with Oreo-flavored filling — not a cookie.

The model classified it as Biscuits/Cookies (incorrect), but the ground truth is Filled Chocolate.

The Oreo problem: easy cases on both sides have clear signals, but the item in the middle has conflicting features — biscuit ingredients in a chocolate bar format — causing the model to misclassify it. (Image generated with Nano Banana Pro)

Step 1 — Find the error: We ran the model on our held-out validation set. It predicted Biscuits/Cookies for this item. The ground truth says Filled Chocolate. This disagreement is flagged.

Step 2 — Analyze the confusion: We send the item to a reasoning LLM along with the full taxonomy context. Its analysis: “The word ‘Oreo’ is a very strong signal for cookies. The merchant category ‘Especiais’ doesn’t help disambiguate. The description’s ‘cocoa dough’ is ambiguous — it could mean cookie dough or chocolate base. The image clearly shows a chocolate bar, but without explicit textual terms like ‘bar,’ ‘tablet,’ or ‘chocolate shell,’ the model defaults to predicting a cookie. The key differentiator is the product format: a chocolate shell with filling (chocolate category) vs. a baked flour product (biscuit category).”

Step 3 — Generate targeted synthetic samples:

These are 4 of the 8 samples the pipeline generated for this item. We picked the ones that best illustrate the idea. The remaining strategies are explained in the detailed breakdown below.
After retraining with these (and similar samples for other confused items), the model’s F1 score on the “Chocolates” category improved by +10%.

How the Pipeline Works

Step 1: Negative Mining

We run the fine-tuned model on a held-out evaluation set — items the model has never seen during training. We compare its predictions against the ground truth labels and collect every disagreement.

Using a held-out set here is important. If we ran the pipeline on training data, we’d mostly find memorization failures and label noise, not real weaknesses. We need errors on items the model has never seen during training, because those reflect the actual generalization failures that will show up in production. Those are the errors worth targeting.

The held-out set contains 100k samples. With the model already at roughly 85% accuracy, that gives us around 15k misclassified items to feed into the pipeline, enough to cover a wide range of confusion patterns across the taxonomy. The set itself is stratified: 80% of samples follow the natural distribution, so common categories carry proportional weight and the validation set reflects real production performance. The remaining 20% are distributed equally across all taxonomy paths to ensure coverage of rare categories. Without this, the model could have blind spots on low-frequency nodes that we’d never detect during evaluation.

Step 2: Analyze the Confusion

This is the core of the pipeline, and where the design gets interesting.

For each error, we send the confused item to a reasoning LLM, a large, capable model (e.g., GPT-5, Claude Opus 4.5) that we use purely for analysis, not for production inference. In essence, we ask it: why did our model probably get this wrong?

The context we provide to the reasoner matters a lot:

The confused item (name, description, image)
The ground truth taxonomy path and its description
The model’s predicted taxonomy path and its description
The full taxonomy tree hierarchy, so the reasoner knows what other nodes exist. Without it, it might generate a synthetic sample for category X that would actually be a better fit for category Y. It also helps the “reasoner” to understand the decision space: It needs to know what all the options are and how they relate to each other to figure out why the model picked the wrong one.
Descriptions of neighboring nodes, to understand the semantic boundaries between similar categories
Domain-specific business rules (e.g., “a skewer served as an appetizer goes under Snacks, not Beef Cuts” or “cheesecake should be classified as Pie, not Cake”), which encode domain knowledge that isn’t obvious from the taxonomy structure alone. Without them, the reasoner might generate samples that look correct but violate domain-specific constraints.

This analysis is then passed to the next step, where it directly guides what kind of synthetic samples to generate. Without this reasoning, we’d just be generating random examples of the correct category — which is far less effective than examples that specifically address the confusion the model has.

Step 3: Generate Targeted Synthetic Data

Based on the analysis, the reasoning LLM generates 6–8 new synthetic training examples for each confused item. Two things in the prompt are important to get right here.

First, we include real marketplace items as reference examples in the prompt. Without these, the LLM generates items that sound artificial e.g. “Açaí to eat with a spoon”, which is technically correct but not how real products are listed. The reference examples are what makes the synthetic items sound natural. We tried getting there with prompting alone, describing the style, adding constraints, but it wasn’t enough. The real examples were the crucial part. Once the LLM could see what actual product listings look like, the synthetic items started looking like they could be real listings.

Second, we define specific generation strategies to ensure diversity and target known failure modes:

Minimal Correction. Take the original confused item and make the smallest possible change to make it unambiguous. For the Oreo chocolate bar: rename it to “Chocolate Black with Laka filling and Oreo pieces 90g” and describe it as “Chocolate tablet with creamy filling…”
Strong Positive (Ground Truth). A textbook, prototypical example of the correct category. No ambiguity, no tricks. “Milk Chocolate Filled with Salted Caramel 90g.” This anchors the model’s understanding of what the correct category actually looks like.
Strong Positive (Predicted). Same thing, but for the wrong category. A clear, obvious example of a cookie: “Vanilla Filled Biscuits 140g”. This teaches the model the boundary from both sides.
Counter-Signal. This is probably the most interesting one. Create an item that belongs to the correct category but deliberately includes misleading keywords from the wrong category. Example: “Cookies & Cream Filled Bar” — it has “cookies” right in the name, but it’s clearly a chocolate bar. Trains the model to look past surface-level keyword matching.
Minimal Text. Very short name and description, but still unambiguous. Makes the model more robust to the case where it has limited information only.
Marketing Language. Ground truth examples wrapped in promotional language (“Premium,” “Artisan,” “Gourmet”). Makes the model more robust against promotional terms (which occur a lot in real listings) that can mislead classification.
Image-Driven. A sample with deliberately minimal or ambiguous text, where the image is the primary signal for the correct category.

These are the core strategies. We use a few more depending on the case, but they follow the same principle.

Our core generation strategies applied to the “Oreo” confusion. Each targets the decision boundary between Filled Chocolate and Biscuits from a different angle. (Image generated with Nano Banana Pro)

Handling Images in Synthetic Data. If you’re working with a VLM, there’s an additional consideration: your synthetic samples also need images.
Real product listings often come with images, and the synthetic samples need to reflect that. If all synthetic items were text-only, we’d risk training the model to rely on text alone and ignore the visual signal. So for each synthetic item, the reasoning LLM decides whether to reuse the original confused item’s image by setting a boolean flag. Sometimes the decision is obvious: a strong positive of the wrong category can never reuse the original image, since it’s an entirely different product. But the image also gets excluded when the original doesn’t have one, when it’s low quality or ambiguous, or simply to add variety. Not every synthetic item should have an image, otherwise the model might overfit on the visual signal.

On top of that, we also have an image-driven strategy where the synthetic item has deliberately minimal or ambiguous text, and the image is the only signal that makes the correct category clear. This teaches the model to rely on visual input when the text isn’t enough. And since we don’t freeze the image tower during training, the visual encoder adapts to our domain. It learns which visual features are discriminative for taxonomy classification, like distinguishing a chocolate bar from a cookie by packaging and shape, rather than relying on generic pre-trained representations.

Step 4: Augment and Retrain

We add the synthetic samples to the original training data and retrain the model. One important detail: the synthetic samples supplement the real data, they don’t replace any of it. The original training set stays intact.

The pipeline is also naturally iterative. After retraining, you can run it again on the new model’s errors. Each iteration should produce smaller, more targeted corrections. We haven’t run multiple iterations yet, but in principle you could keep looping until the returns diminish.

Results

We ran an ablation study. Two models, same architecture (Qwen/Qwen3-VL-2B-Instruct), same hyperparameters:

Control: Trained on the standard training set (120k samples)
Experiment: Same training set + approximately 9,000 ALP-generated synthetic samples

We evaluated both on a held-out benchmark of roughly 16,000 items. We focused on taxonomy nodes that received at least 20 ALP samples; 5 nodes with fewer samples were excluded to avoid noise.

Overall: The ALP samples gave us a +1.87 percentage point improvement in average F1 across all evaluated taxonomy nodes. The improvement is consistent and meaningful.
Targeted impact: 72.4% of the nodes that received ALP samples showed improvement. The remaining 28% saw a slight degradation.
Asymmetric risk/reward: When ALP helped, it helped a lot. The best gain was +13.7% F1 on a single node. When it hurt, the damage was smaller — the worst degradation was -7.8% F1. This is a favorable risk profile.

ALP Helps Where It Matters Most

ALP is most effective where the model struggles most. Categories with a baseline F1 below 0.70 saw roughly 4x more improvement than categories already above 0.85, which is exactly what you’d want from a pipeline designed to target weak spots.

This inverse correlation is statistically significant (Spearman rho = -0.369, p = 0.003) — the weaker the baseline, the stronger the ALP effect.

What Didn’t Work

Not everything improved. The nodes with the most ALP samples actually showed degradation. The sweet spot seemed to be around 30-90 synthetic samples per node. Beyond that, performance starts to degrade, likely because too many synthetic samples shift the training distribution away from the real data.

This is an important lesson: more synthetic data per category is not better. The pipeline works best when it’s a targeted supplement, not a replacement for real data.

Design Decisions That Mattered

A few things we learned that might save you time if you try something similar:

Give the reasoner full context. Early versions of the pipeline gave the reasoning LLM only the confused item and the two taxonomy labels. The analysis was shallow and the generated samples were generic. Once we added the full taxonomy hierarchy and node descriptions, the quality jumped significantly. The reasoner needs to understand the full decision space to produce useful analysis.
Define explicit generation strategies. Without them, the LLM generates samples that all look similar — just variations of the same correct example. The strategies force diversity and ensure the synthetic data covers specific known failure modes like marketing language, sparse metadata, or misleading keywords.
The counter-signal strategy is among the most effective. Items that contain misleading keywords but belong to the correct category are exactly the kind of training signal the model is missing. They directly attack the keyword-based shortcuts the model has learned.
Use the reasoner to catch labeling errors. We explicitly instruct the reasoning LLM to flag cases where it believes the ground truth itself is wrong — and when it does, it skips generation entirely. This gives us an automated quality check on our evaluation labels. We filtered those cases out and reviewed them separately.

How to Get Started with Your Use Case

We built this for taxonomy classification in the food domain, but the pipeline is generalizable and can be reused for any classification task where:

You have a fine-tuned model that performs well overall, but has systematic edge-case failures
The classes have semantic overlap (i.e., the errors aren’t random but concentrated at specific decision boundaries)
You can provide enough context to a reasoning LLM for it to understand the classification space

Conclusion

Supervised fine-tuning gets you surprisingly far. But the last few percentage points of performance — the edge cases, the ambiguous items, the things that sit right on the decision boundary — those are hard to fix with more of the same data.

The Active Learning Pipeline provides a structured way to identify exactly where your model is failing, understand why, and generate the specific training signal needed to fix it. It’s not a silver bullet, it gave us roughly +2% F1 overall, but it’s effective at patching specific holes. A 13% improvement on a single problematic category can matter more in production than a 2% lift across the board.

A collaboration between Prosus AI and iFood. Authors: Ramon Kaspar, Alexandru Dumitru, Kinga Szarkowska, and Zülküf Genç. With contributions from Floris Fok and Madeline Duncan.

Turning Your Model’s Errors into Golden Training Data was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

BESH: An Open Source Batch Endpoint for Unlocking 4x Utilisation of Your Self Hosted GPU Cluster

Floris Fok — Wed, 22 Oct 2025 08:39:27 GMT

TLDR;

BESH is an open-source batch processing API that solved iFood’s challenge of generating millions of personalized push notifications daily. Built as an alternative to commercial batch endpoint for our own LLMs which are self-hosted on GPUs for inference, BESH delivers massive improvements:

🚀 26x cost reduction compared to commercial endpoints (eg OpenAI’s GPT-4o-mini batch) solutions

📈 Processes >120B tokens daily at production scale

⚡ Dramatically higher GPU utilization — transforms idle hardware into efficient processing powerhouses

🔧 Simple setup — seamlessly integrates with existing data pipelines (Databricks, etc.)

🛡️ Production-ready reliability — intelligent queue management, persistent storage, real-time monitoring

📊 Enterprise features — auto-scaling, load balancing, comprehensive analytics dashboard

💰 Self-hosted control — pay per hour instead of per token, with predictable costs

The solution turned non-viable personalization ideas into viable ones, enabling true hyper-personalization at scale at the right unit economics. You can find the code and setup instructions here: https://github.com/ProsusAI/BESH

Perfect for any organization looking to deploy LLMs efficiently without the complexity and costs of external APIs.

The Context: Operating at Large (!) Scale

iFood is Latin America’s leading food delivery platform, processing over 120 million orders per month across all states in Brazil. With millions of active users placing orders daily, generating billions of data points from restaurant interactions, delivery patterns, and user preferences and behavior, we operate one of the largest and most complex food delivery ecosystems in the world.

At this scale, every optimization matters. A 1% improvement in user engagement translates to millions of additional orders. A small increase in operational efficiency can save hundreds of thousands of dollars monthly. And most importantly, every user interaction represents an opportunity to create a more personalized, valuable experience.

This massive scale also means that traditional approaches to user communication — like broadcast messaging or simple segmentation — fall short of what’s possible and what users expect in 2025.

The Vision

At iFood, we started with a simple question: why do push notifications still feel generic? We wanted every message to feel like it was written for one person, at one moment, with the right tone and intent. With 120 million orders generating rich behavioral data every month, we had the foundation to make this vision possible.

Our ambitious goal was to leverage Large Language Models to create millions of personalized push notifications daily, using each user’s behavioral data, preferences, and real-time context to craft messages that felt genuinely relevant.

Early tests with premium LLMs (like GPT-4o) were exciting — content quality was great, engagement went up, and the experience felt genuinely personal. We thought we had found the solution.

The Expectations

We expected to build a system that could:

Process user behavior in real-time: Location data, app interactions, search history, order patterns, and temporal preferences
Generate natural, personalized content: Using LLMs to create conversational messages that didn’t feel robotic or generic
Scale seamlessly: Handle millions of users across different markets and cultural contexts
Operate cost-effectively: Maintain reasonable operational costs while delivering premium personalization

The Reality

What we discovered was that operating at iFood’s scale revealed fundamental limitations in existing LLM infrastructure that simply weren’t apparent at smaller volumes.

Cost Spirals Out of Control

When processing millions of personalized messages daily using GPT-4o, costs quickly became prohibitive. What seemed reasonable during testing became a cost prohibitive when scaled to our actual user base. We were competing with every other team inside the company using the same API, leading to unpredictable pricing and performance.

Reliability Issues

While working with external providers and models, reliability issues hit us hard during peak periods: during the GPT-5 launch, we experienced issues: almost 60% of our batch jobs were lost because they didn’t finish within the 24-hour processing window. This wasn’t just inconvenient — it was breaking our entire personalization pipeline and leaving millions of users with generic messaging experiences.

Infrastructure Bottlenecks

After switching to a self-hosted fine tuned model, our large-scale inference jobs on Databricks weren’t delivering the efficiency we expected. Despite having serious GPU compute power at our disposal, hardware sat mostly idle — throttled by inefficient request handling, timeout issues, and the constant overhead of managing millions of individual API calls. We were barely scratching the surface of what our infrastructure could actually deliver.

The struggle with live API calls made total sense in hindsight. We were keeping expensive Databricks clusters running continuously, just waiting for individual requests to finish. This approach was fundamentally inefficient for our data application needs. What we really needed was a way to process our daily personalization workload in batches using our own fine tuned models — sending all requests efficiently and retrieving results reliably, without having to manually supervise millions of individual API calls.

The Solution: Building BESH

The problems we faced at iFood weren’t unique to food delivery — they’re fundamental challenges that any organization faces when trying to deploy LLMs at scale. We realized we needed to build our own solution.

Solution? Use batch mode! This approach worked amazingly when we were using external APIs like Gemini and GPT models, but when we wanted to use our own fine-tuned models, we hit a wall. Oddly enough, at the time of starting the project, there was no open-source equivalent of OpenAI’s batch endpoint for self-hosted deployments.

That’s when we decided to build BESH (Batch Endpoint for Self-Hosted models) — an open-source equivalent of the OpenAI batch endpoint, specifically designed for self-hosted GPU clusters. It allows for higher utilisation, less complex set up and lower cost. You can find the code and setup instructions here: https://github.com/ProsusAI/BESH. It will simplify your data processing pipelines and make it easier to distribute your computer over multiple teams. BESH now processes 120B tokens daily and soon many more.

LLM inference throughput can be highly volatile depending on the size of the workload and the number of concurrent connections. Depending on whether you are optimizing for latency or throughput there are different approaches to handling concurrent requests. Additionally, having multiple applications sending requests adds another level of complexity.

Cost of self-hosting is simple; pay per hour. So your making it cheaper means we want to get the most out of this hour!

Enter BESH, the first open-source batch processing API designed specifically for self-hosted GPUs. Instead of wrestling with the complexity of real-time inference orchestration, BatchEndpoint lets you focus on what matters: processing your data efficiently and cost-effectively. By leveraging intelligent batching and optimized resource management, it transforms those underutilized GPUs into a powerhouse that can unlock a staggering 26x cost reduction compared to openai’s gpt-4.1-mini batch solutions.

What Makes BatchEndpoint Exceptional

BESH now processes 120B tokens daily. It turned non-viable ideas into viable ones and it is stimulating more new and bold ideas.

🚀 Intelligent Queue Management
⚡ Advanced Parallel Processing
🔄 Production-Ready Reliability
📊 Real-Time Analytics Dashboard
🎯 Enterprise-Scale Architecture
💾 Persistent Storage

So far we did three major imporvements to BESH, improving security, scalebility and stability. Awesome teamwork.

Architecture

In essence we have tried to replicate the batch endpoint of OpenAI as much as possible to make the switch to it seamless. It supports both single-GPU and multi-GPU (8-GPU) deployments, with a load balancer to distribute requests across the available GPUs. The deployment will spin up the following:

FastAPI: OpenAI-compatible batch API with comprehensive file upload support (including compressed formats: gzip, zip, bz2). This also included a simple dashboard and all the batches were loaded on a Redis Queue.
Postgres + Redis: We use Postgres for batch results accumulation and Redis for sending the jobs to the workers. Postgres powers the dashboard where we can track job failure rates, Token per 15 min and more.
Worker Node: A simple Redis Queue consumer that loads the jsonl files and send the requests efficiently using an httpx client. We use sephamore to manage the total connections. For each deployment we have between 4–20 of these workers so we are never idle.
Load Balancer: Including health checks, least connection load distribution, generous timeouts and enough worker connection to support it all.
vLLM GPU deployment: Using vLLM’s offical docker image.
Persistence Storage: Batches and databases are stored on a docker volume to ensure continuation and s3 syncing.

Overview of the services as mentioned in the docker-compose 8 gpu version.

Mermaid diagram of the services

Expansion of the system can be achieved by just making adjustment to the loadbalancer. Since we have a load balancer with health checks, it automatically supports auto scaling.

Dashboard

Here are some screenshot from the BESH dashboard. Token Usage Timeline to monitor the clusters health over time and get some big numbers out of the system like the 64B tokens in 24h! Our record stands at 120B, if you beat this please send us a screenshot ;).

Token Usage: 64B tokens in the last 24h, very consistent, very stable.

Quick overview of the current state of processing. Here we see 19K batches to process and 25K processed in 24h.

A 26x Cost Reduction

To illustrate the cost savings, let’s examine a real-world batch processing scenario with actual token usage data and something that now runs daily. For this example we achieved 81% utilisation, by being busy 20h out of 24h. The table below compares the cost of processing a large-scale batch job using gpt-4-mini versus BatchEndpoint.

Finetuning a model that can achieve the same task as gpt-4.1-mini is only one part of the solution.

For batch as of 2025–08–29 as shown on https://openai.com/api/pricing

Conclusion

BESH represents a significant leap forward in making large-scale LLM inference both accessible and affordable. But here’s the thing — it’s not just about the impressive 26x cost reduction (though your finance team will definitely appreciate that). It’s about building something that actually works the way you’d expect it to.

The combination of optimized batching, intelligent queue management, and multi-GPU scaling creates a solution that’s genuinely built for the real world. No more worrying about whether your data scientists are trained to utilise the cluster most effectively and timing certain workloads to not interfere.

What makes this particularly exciting is how seamlessly it integrates with platforms like Databricks. We’re not asking you to rebuild your entire data pipeline — we’re giving you a tool that slots right into what you’re already doing. Also since it’s made for bare metal, any cloud provider can use it.

As LLM adoption continues to accelerate, having a solution like BESH isn’t just nice to have — it’s becoming essential. Whether you’re processing massive datasets, building AI-powered applications, or just trying to make sense of your LLM costs, BESH gives you the performance and reliability you need without the enterprise-level price tag.

Shout out to the iFood and Prosus AI teams for the teamwork on this project.

co-authors

source: https://github.com/ProsusAI/BESH

BESH: An Open Source Batch Endpoint for Unlocking 4x Utilisation of Your Self Hosted GPU Cluster was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Hidden Cost of UUIDs in AI Prompts: A 95.6% Token Optimization Solution

Floris Fok — Wed, 20 Aug 2025 14:48:13 GMT

TLDR;

Universally Unique Identifiers (UUIDs) are silently consuming massive amounts of tokens in AI prompts, creating an invisible cost burden for organizations worldwide. The analysis reveals that standard UUIDs consume an average of 22.83 tokens each, while simple numeric alternatives require only 1 token — representing a remarkable 95.6% reduction in token usage.

This inefficiency compounds rapidly in enterprise environments where prompts may contain dozens of UUIDs, leading to exponentially increased processing costs. Since token consumption is linearly related to both processing time and monetary costs, this optimization directly translates to proportional savings in both computational resources and operational expenses.

The Problem: UUID Tokenization Inefficiency

UUIDs have become the standard for unique identification in modern distributed systems, but their impact on AI processing costs has been largely overlooked. A typical UUID like a90d0d7d-9c5a-44de-8d3c-5b0da661de7c appears to be a simple identifier, yet it consumes nearly 23 tokens of processing capacity.

The analysis of over 1000 randomly generated UUIDs across different formats revealed consistent inefficiency patterns:

Standard UUIDs: 22.83 tokens average
UUIDs without hyphens: 18.5 tokens average
Uppercase UUIDs: 22.83 tokens average
Numeric alternatives (001, 002, 003): 1 token each

The tokenization inefficiency stems from how language models process hexadecimal strings and hyphen separators, which do not align with natural language patterns that tokenizers are optimized to handle.

Example

In practical applications, this inefficiency creates substantial overhead. Consider a typical enterprise scenario analyzing user behavior with multiple identifiers:

Original prompt (295 tokens):

Analyze user interactions:
1. User f47ac10b-58cc-4372-a567–0e02b2c3d479 logged in at 2024–01–15 09:30:00
2. User f47ac10b-58cc-4372-a567–0e02b2c3d479 accessed document 6ba7b810–9dad-11d1–80b4–00c04fd430c8
3. User f47ac10b-58cc-4372-a567–0e02b2c3d479 shared document 6ba7b810–9dad-11d1–80b4–00c04fd430c8 with user 6ba7b811–9dad-11d1–80b4–00c04fd430c8

Optimized prompt (100 tokens):

Analyze user interactions:
1. User 001 logged in at 2024–01–15 09:30:00
2. User 001 accessed document 002
3. User 001 shared document 002 with user 003

This represents a 66.1% token reduction in a single prompt. Since processing time and costs scale linearly with token count, this optimization delivers proportional improvements in both speed performance and expenses.

The Solution: Numeric Identifier Mapping

The solution leverages a fundamental insight: UUIDs only need to maintain uniqueness within the context of a specific AI analysis, not globally. By temporarily replacing UUIDs with compact numeric representations during processing, we achieve optimal tokenization efficiency while maintaining complete system compatibility.

Implementation Workflow

The optimization process follows a simple, transparent workflow:

Original Prompt with UUIDs → Detect and extract all UUIDs
UUID Detection & Mapping → Create bidirectional mapping table (UUID ↔ Number)
Optimized Prompt → Replace UUIDs with sequential numbers (001, 002, 003…)
LLM Processing → AI processes the token-efficient prompt
Optimized Output → AI responds using numeric identifiers
UUID Restoration → Convert numbers back to original UUIDs
Original Output → Final response with UUIDs restored

This process is completely transparent to both the AI system and end users, requiring no changes to existing infrastructure while delivering immediate optimization benefits.

Key Implementation Benefits

Zero architectural changes required
Complete system compatibility maintained
Immediate deployment capability
Reversible and error-safe process

Token Savings Analysis

The testing across multiple realistic scenarios demonstrates consistent and substantial token reductions:

Table of token reduction per scenario

These results demonstrate that UUID optimization delivers consistent benefits across diverse application scenarios, with token reductions typically ranging from 59% to 74%. The consistency of these improvements indicates broad applicability and reliability across different types of AI analysis tasks.

Financial and Performance Impact

Since token consumption directly correlates with both processing time and monetary costs, the optimization benefits scale linearly across both dimensions. Organizations processing UUID-heavy prompts can expect proportional improvements in:

Processing Speed: Reduced token count leads to faster AI response times
Computational Costs: Lower token usage translates directly to reduced API expenses
System Capacity: More efficient prompts enable higher throughput within existing infrastructure

Here are some back-of-the-envelope calculations:

1,000 prompts daily with 5 UUIDs each: $800 annual savings
100,000 prompts daily with 12 UUIDs each: $200,000 annual savings

These projections assume ONLY input and GPT-4.1 pricing ($0.02 per 1,000 input tokens). We can easily double the saving if we expect many UUIDs in the output as well.

Github Code

Want to try this ASAP? Here is an example implementation of the method:
https://github.com/FlorisFok/PromptCleaner

Conclusion

UUID tokenization represents a significant but easily addressable inefficiency in AI prompt processing. The 95.6% token reduction achieved through numeric replacement delivers immediate benefits that scale linearly with usage volume, processing time, and operational costs.

Organizations can implement UUID optimization quickly and capture these benefits. The transparent, compatible nature of the solution makes it ideal for production deployment without architectural changes or operational disruptions.

As prompt engineering is becoming more important as all other aspects of the pipeline are maturing, techniques like these are essential and can make the difference with your competitors.

The evidence is clear: every UUID in your AI prompts is consuming 22 times more tokens than necessary. The question is not whether to implement optimization, but how quickly you can deploy it to start capturing these substantial benefits ;)

The Hidden Cost of UUIDs in AI Prompts: A 95.6% Token Optimization Solution was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Just Ask: Data Insights for Everyone

Ioannis Zempekakis — Tue, 10 Sep 2024 07:59:16 GMT

There’s No Such Thing As a Dumb Question: Data Insights in Natural Language.

Authors: Donne Stevenson, Floris Fok, Sean Kenny

Every day, people chase analysts for business insights and answers to their questions

“What’s the 12-month trend for active partners with at least eight monthly orders?”

“How do late-night food orders compare between different cities?”

“How many couriers have received a 5-star review in the last 30 days in the city of London, broken down per vehicle type?

Companies in sectors like food delivery, retail, and payments generate vast amounts of data. Being able to give employees accurate, contextualized, and prompt answers to questions about this data can boost their performance, drive value and build competitive advantage.

Ideally, employees have access to analysts who can turn business questions into data-driven insights. However, good analysts are often costly and scarce, and they need to focus on the most critical tasks. This makes it harder for many organizations to access insights and data for decision-making.

To understand how GenAI frameworks and tools can help address this problem, we teamed up with operational analysts at food delivery platforms iFood and Glovo, e-commerce marketplace OLX, and other companies in the Prosus portfolio.

Introducing the Toqan Data Analyst

The result of our ongoing work is the Toqan Data Analyst, an AI-powered chatbot integration for databases, available to Toqan users in the business messaging app, Slack. The tool makes data and insights accessible to anyone who needs them, when they need them, in natural language.

People can ask questions in a simple chat interface. They don’t need any technical skills, nor be deeply familiar with the databases, since the tool generates SQL queries from their text. Our text-to-SQL solution has hundreds of active users and answered thousands of questions during the roll-out.

In this blog, we dive into how we got to this solution. We explain how we had to rethink the text-to-SQL problem space, how that led to us to expanding our approach beyond existing text-to-SQL solutions, and how we overcame significant challenges along the way.

LLMs are only part of the solution

First, we looked at the use of large-language models (LLMs) for text-to-SQL. Existing solutions can translate questions into computer language, reason through uncertainty at scale, and clarify implicit assumptions.

However, benchmarking available LLMs against SQL datasets suggests this solution would be insufficient. Early frameworks for assessing the performance of models on text-to-SQL focused on a model’s ability to generate SQL queries that address user questions. Top results from WikiSQL’s leaderboards approached 93% accuracy on the test set for accurately generating SQL queries based on natural-language questions. But these early results came from various pre-transformer supervised learning models and focused on highly simplified use cases that lacked joins between tables in datasets, or other real-world complexities.

LLMs proved to be broadly capable of achieving similar levels of accuracy on text-to-SQL tasks. Benchmarks for LLM capabilities, such as Toqan’s ProLLM Leaderboard, peg the latest models at a reasonable acceptance level, with top performers around 85–90% acceptance of SQL responses based on Stack Overflow data.

On paper, AI models should be able to formulate SQL queries with 80%+ accuracy

However, most systems are not adequately able to handle real data and thus inevitably fall short of expectations. Challenges like identifying how to join tables, spotting columns despite poor metadata, and finding every way of spelling a name quickly made these benchmarks largely useless. Our out-of-the-box use of LLMs for text-to-SQL for use cases with more than two tables started with less than 25% accuracy.

There are two fundamental problems. First, even 95% accuracy is not enough for many business users querying large datasets of transactional, supply chain, or even financial data. Second, the ability to get that final slice of accuracy does not appear to be related to a model’s performance, so much as the context that the system has about the person asking the question and the business and data structure being examined.

Context is critical

We like to think that most databases are straightforward; they are relational databases with fixed structures. However, even with the latest SQL knowledge, newly hired data analysts will still struggle initially. This struggle points to a hidden truth: context is critical to translate business questions into effective data queries. In other words, additional context is needed to effectively use business data. People usually gather this context while talking to colleagues, attending meetings, or through trial and error.

When it comes to rule-based or early data analysis AI solutions, many real-world applications fall short not only because of missing context but also because of poor data quality, lack of documentation, user behavior, and more. The reality is that expecting all business users to follow the best practices, work with perfect data, with no implicit or missing information, will never result in a viable solution.

Building a text-to-SQL solution for the real world

Instead of only focusing on generating the right query, we needed to expand our requirements. Working with analyst teams with access to Toqan, we started to experiment with system solutions.

Things to account for include:

● Implicit assumptions and transparency

● Guiding users and clarifying requests

● Handling failed answers and knowing when not to answer; minimizing false positives.

If we consider requirements like these, the challenges with the previous approach become clear. Instead of focusing on writing queries and getting answers, we needed to build a system that has the context and even the ability to inquire about implicit context.

Text-to-SQL is not only an engineering problem; it’s also a communication problem.

From a technology perspective, the next advance we needed was flexible, sequential planning and tool use provided by agent frameworks. While building our AI assistant Toqan, we experienced the incredible reasoning capabilities of agents and were sure that it was a matter of time and effort before we could make text-to-SQL something real.

What makes agents perfect for data analysis

An agent can follow instructions in a chat-like manner, reason and use external services to aid its understanding. This makes it the best problem-solver to date. The traditional chatbot, which answers simple questions, has limited reasoning capability. Also, it needs all the information to answer a question in advance since it only works from memory.

An agent doesn’t have these limitations, however. If we want to make an agent that finds the best deals online, we need to provide it with a way to navigate the internet, and it will do a very decent job. Agents can explore external data and have conversations with themselves like: “I will first do A, then I will solve B to bring you answer X.” These problem-solving capabilities make an agent perfect for data exploration and data-related analysis. When given access to a database and enough context about the data in it, an agent will generate code to analyze the data and answer any questions.

(For a more detailed view on agents, read also https://toqan.ai/blog/the-emerging-agent-ops-landscape)

Helpfulness is the enemy of accuracy

Building an agent that can turn a question into an effective SQL query comes with some challenges though.

Let’s look at the first issue: being too helpful. Large Language Models (LLMs) — key components of agents — are smart and can develop logical processes, which means they may find solutions that are not straightforward for the person asking. This is one of the best examples we found:

User: What is the Address of user X.

Bot: Here are the driver’s coordinates during delivery.

The LLM did not have access to addresses. Still, it was smart enough to work out that the driver’s coordinates during delivery must be near the address of user X. Although the system provides an answer, the best answer would be: “Sorry, I do not have access to addresses.”

To help mitigate this problem, we embed the options to use two tools in the agent:

Ask clarifying questions
State assumptions

The first tool tries to remove uncertainty at every step. When we add information to the system, it evaluates whether it’s ambiguous and if so, asks clarification questions. This will make the question more specific and changes users’ behavior in two ways: it ensures expectations are more aligned, and it trains people to ask better questions.

The second tool is designed to ensure that if the agent takes shortcuts or uses creative solutions, the user can spot them more easily. We call this “perceived accuracy,” where given the knowledge it concluded X, answer Y is correct. An example we use frequently for this is:

U: Can you provide the average revenue per day for X in the last month?

B: That was …
Assumptions: “Given a month is defined as from today until the same day last month.”

Since we now know this is how the agent defines a month, we can’t argue that the question is false. So, we either correct it or accept it. Like asking clarifying questions, we teach the agent to ask better questions, aiming to be more explicit in its communication and transparent to users.

Hidden context

The second issue we ran into: people use nuances and context that even the smartest multimodal models struggle with. This is a good example of what can happen when a system moves from concept to real-world testing.

The missing information — or the hidden context — falls into four groups:

Global knowledge: The general structure of the organization, details about the business and its partners, and what kind of system the database is.
Product knowledge: Each table is designed to serve some KPI or product insights; this is valuable information that can pivot the balance between right and wrong at the right moment.
Table and column knowledge: Are tables slow or fast-changing, partitioned, or in a special date format? With this information, the model doesn’t have to reinvent the wheel.
Request language knowledge: Team jargon, acronyms, or other language that’s unique to a specific team.

Gathering this information comes with the engineering challenge of knowing when and what is useful. Some solutions focus on embedding retrieval (ER) but we did not see it as a good fit since ER focuses on matching subjects and topics. We want the agent to see the rules without knowing if they may be needed. We do this by implementing three mechanisms that improve the request:

1. Business Rules

These rules cover the global knowledge and a large part of the product knowledge. This is the simplest but also the most powerful concept. We want the model to always be exposed to this knowledge. To do so, we include this in the system prompt together with other guidelines and behavioral rules.

2. Glossary

In its purest definition a glossary is a list of terms in a special subject, field, or area of usage, with accompanying definitions. “Terms” may also involve acronyms and short phrases. The goal of this is to add these definitions to the request if we find that they are used in the user’s requests.

3. Table selection

Storing a lot of information in the columns and table descriptions is useful. However, if we add all this information to the prompt, we have very little control over changing the behavior if it is not picking up the correct details for a given request. To tackle this, we need a separate system that can judge which information is valuable for the request. This second system will first assess which tables might be useful. Secondly, it collects the columns that are needed from these tables. Thirdly, it assesses if, given the complete overview, it could drop some parts of the result if these are duplicates, or now redundant. Now we can add very specific information and even instructions into descriptions without these influencing any request where they are not needed.

By combining these three systems we can control the information going into the text-to-SQL agent.

Testing our SQL agent

Lastly, some words on testing (more on testing here ). Testing has many advantages, with the two most important being:

● Confidence to make changes,

● Direct feedback from your changes (fast iteration cycle)

We saw a real challenge in creating a test set that represents the production use cases. In this blog, we highlight how we achieved automated testing. For the test set, we need a question and query.

We start by running the “ground truth” query to fetch the most recent data, ensuring our agent’s answers are based on the latest information. Next, we ask the agent the same question multiple times in a slightly different way. For example, “What is the capital of France?” and “Which city is the capital of France?”. This helps us evaluate the agent’s robustness.

We use a prompted model to check if the answers are similar, focusing on the core response and ignoring minor differences. For each question, we measure:

• Accuracy: How close the agent’s response is to the expected answer.
• Consistency: How uniform the responses are across different question formulations.

Based on these metrics, we identify areas for improvement. High accuracy but low consistency indicates correct answers but with variability. Low scores in both suggest the need for further training. These insights help refine the agent’s training data and algorithms.

Finally, with automated testing in place, we are now focusing on improving the framework.

Next steps

We’re working on making our SQL agent even better. Some of the things we’re looking into:

Self-learning:

We want to use successfully generated or upvoted SQL as a source for generating snippets that resemble (parts of) requests.

2. Reformulating the request before execution:

The agent will reformulate your question to one that is non-ambiguous. By putting more effort into the request before execution, we limit the number of queries to the database.

3. Validation agent

By asking the agent “Are you sure?” we can make sure the query is correct. We are looking into automating this by adding a supervisory validation agent.

This is just the beginning, stay tuned for updates!

PS: If you would like to know more, please contact our Toqan.ai team in

Just Ask: Data Insights for Everyone was originally published in Prosus AI Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.