Stories by Annakokovina on Medium

How Not to Go Broke on Tokens While Everyone Else Builds AI Agents

Annakokovina — Sun, 10 May 2026 14:40:10 GMT

A couple of years ago, the main question when building an LLM application was: Does it respond at all, and does it kind of solve my problem? Now the question is different: “How much does this cost, and why is it so slow?”

Over the last two years, the average LLM request has grown from ~2,000 tokens to more than 5,400. Just check the OpenRouter research paper: https://arxiv.org/html/2601.10088v1

Agentic pipelines, massive system prompts, multi-step reasoning — all of this keeps inflating context size. Coding-related prompts are now 3–4x longer than general-purpose ones, making them the main driver of token growth.

And every token has a price: in dollars and in latency.
Why This Hurts So Much 👇

Money — Providers charge for every input and output token. More tokens → higher bills. No magic here.

Latency — The longer the prompt, the longer it takes before the first output token appears. In real products, especially near real-time applications, this becomes painfully noticeable.

Quadratic Math — Self-attention scales roughly as O(n²). Double the tokens → roughly 4x more computation. GPT-4 with a 128K context can cost up to 16x more than with 8K.

Reducing prompts by 2–3x is not “a small optimization.” It changes the economics of the entire product.

Below are the main techniques that actually help.

Truncation: Brutal, but Honest

The simplest strategy is to cut the context.

Common approaches:

Sliding window — remove old messages and keep only the latest N tokens
First + Last — preserve the beginning and the end, discard the middle
Summarization — compress old messages into a summary before removing them

This works well for simple chats where user questions are relatively independent.

It fails when the full history matters: debugging sessions, analytics, document workflows. The model will simply “forget” important information from the middle.

My Take
A solid baseline and the best place to start. Fast, practically free, and doesn’t require architectural gymnastics. Just make sure you understand exactly what gets removed.

Prompt Compression: Shrink Without Losing Meaning

If truncation means “throw away,” compression means “make smaller.”

A lightweight auxiliary model evaluates tokens and removes low-information parts.

The result often looks syntactically weird but remains semantically dense enough for a larger model to understand.

Example: “Please carefully analyze the following document and provide a comprehensive summary”
becomes: “analyze document provide summary”

Some notable projects:

LLMLingua (Microsoft Research) — up to 20x compression, works with almost any LLM
LLMLingua-2 — 3–6x faster, slightly lower compression ratios (up to 14x)
LongLLMLingua — optimized for long contexts; on NaturalQuestions it achieved +21% quality improvements with 4x fewer tokens

This works especially well for long documents and RAG chunks.

Less effective for short prompts or strict real-time systems because the auxiliary model itself adds latency during compression.

My Take
A powerful tool, but not a silver bullet. Always benchmark quality on your own tasks before deploying. Aggressive compression can introduce subtle degradation, so use sufficiently large gold datasets.

RAG: Don’t Load Everything — Retrieve What You Need

RAG is often treated as a tool for external knowledge.

From a token perspective, though, it’s primarily a way to avoid stuffing 500 pages of documentation into every request.

Retrieve 3–5 relevant chunks and inject only those.

There’s a growing narrative that million-token context windows make RAG obsolete. Reality is more nuanced:

RAG is dramatically cheaper than long-context approaches under typical workloads — and usually faster.

Plus, the “Lost in the Middle” problem hasn’t gone away. Models still struggle with information buried in the middle of very long contexts.

There’s good research on this topic showing that long context alone can reduce quality even with perfect retrieval: https://arxiv.org/html/2510.05381v1

Modern RAG is no longer “find and paste.” It includes:
- Conditional retrieval (do we even need retrieval?)
- Reranking
- Hybrid search

My Take
For most production products, RAG is mandatory. Long context is not a replacement — it’s an additional tool for specific use cases.

The real question isn’t “RAG or long context?”
It’s: “What exactly should we retrieve, and when?”

Prompt Caching: Pay Once for Repeated Context

Probably the most underrated optimization on this list.

To understand why it matters, we need a tiny peek under the hood.

Every transformer request recomputes KV matrices for all tokens — even if 90% of the prompt is identical to the previous request.

Prompt caching stores already-computed matrices for repeated prefixes.

Result: Repeated tokens become roughly 10x cheaper. Anthropic claims up to 85% latency reduction with full cache hits.

Provider implementations:
- OpenAI — automatic caching with no API changes required. Experiments show ~50% hit rates
- Anthropic — explicit cache blocks via `cache_control`. More control, potentially 100% hit rates
Google Gemini — available through the Context Caching API

The architectural pattern is simple: Static parts go first. Dynamic parts go last.

System prompt → cached
Few-shot examples → cached
User query → not cached

This works extremely well for applications with long reusable system prompts.

It doesn’t help if every prompt is fully unique.

My Take
If your system prompt is longer than 1,000 tokens and you have any meaningful traffic at all — enable caching immediately.

Minimal code changes. Massive impact.

Prompt Hygiene: Discipline as an Optimization Tool

The easiest way to save tokens is to write concise prompts from the start.

A few practical rules:

Remove Fluff
Instead of: “Please carefully and thoroughly analyze the following text and provide a comprehensive and detailed summary”
just write:“Summarize:”

That alone can save 15+ tokens with identical quality.

Use Structure Instead of Explanations
Models understand XML tags and structured formatting better than paragraphs of prose describing the output format.

Control Output Length
`max_tokens` is your friend.
Without explicit limits, models tend to over-explain and reason out loud by default.

Remove Few-Shot Examples When Possible
For simple tasks, zero-shot prompting with a clear format often performs similarly — at a much lower cost.

Move Static Content into the System Prompt
If the same text repeats in every request, it belongs in the system message, not the user message.

It also caches better.

My Take
This is where optimization should start.
Prompt audits are fast, free, and often reduce token usage by 20–30% without hurting quality.

Only after that should you move on to more advanced techniques.

Final Thought

Token optimization is not a one-time tweak.
It’s an architectural decision.
The earlier you start measuring and optimizing token usage, the cheaper scaling will be later.

Yandex Smart Car UMO 5

Annakokovina — Wed, 22 Apr 2026 08:09:03 GMT

Last Saturday I attended the Yandex Urban Services conference — yes, apparently rest is not in my schedule anytime soon.

The headline act of the evening was the new car from Yandex.
My main impression: this is not just a car — it’s a gadget.
A big, electric gadget on wheels. 🤯

The ecosystem of devices powered by Alice is already diverse and fascinating, and now it has a new member — an electric vehicle developed by Yandex together with its partners: UMO 5.

Or, as the team affectionately calls it — our Umka 🧡

Disclaimer:I know absolutely nothing about cars. 💅
I can’t tell a suspension from a gearbox, and I’m definitely not qualified to evaluate a vehicle from an automotive engineering perspective.

But I do understand gadgets.
So I look at Umka exactly through that lens — as a smart device on wheels.

A car as a gadget

The vehicle is produced jointly with partners on a well-established platform.
The core hardware — motor, suspension, and overall form factor — has already been thoroughly tested.
This is not an experimental prototype and not raw hardware.

But the most interesting part is not the hardware.
It’s the philosophy.

The key idea is gadget-ness and ecosystem integration.
A real smart home with Alice — but on wheels.

Instead of the familiar command:
— “Alice, turn off the lights”

You now get more automotive ones:
— open the windows
— fold the mirrors
— turn on and adjust climate control
— heat the seats
— open the trunk
— tell me how much battery is left

And yes — this is still an electric vehicle, so the command set is intentionally strict: exactly the set of actions that drivers and passengers actually need every day.

One important detail worth highlighting:
all safety-critical functions are not delegated to voice control.

Acceleration, braking, and any safety-related systems remain fully manual and familiar to the driver.

So Alice is not an autopilot.
She’s an assistant. 💜

Ecosystem thinking

It’s impossible not to notice how deeply the car is integrated into the Yandex ecosystem.

Inside, you get familiar services:
— Yandex Video
— Yandex Maps for route planning
— and generally everything we already use in everyday digital life

At the same time, the vehicle still includes the full standard set of automotive features — cruise control and driver assistance systems.

In essence, the car becomes just another device in your digital toolkit — alongside your smartphone or smart speaker.

One particularly impressive feature is remote control.

Imagine the classic winter scenario:
You’re still at home, making coffee, getting ready to leave —
and the car is already warming up on your command.

This is enabled by a dedicated onboard module inside the vehicle.

How AI actually works inside the car 🔮

Since we’re primarily discussing AI and its practical applications, let’s take a look at how Alice works inside the car from an engineering perspective.

To do that, we need to briefly talk about assistant architectures.

In customer support scenarios, when a user asks a question and the system needs to decide what to do next — route the request or provide instructions — the most common approach is an **intent-based architecture.**

An intent is essentially the user’s goal or request.

For example, in banking:
— “My card isn’t working”

The system then needs to understand what happened and suggest the appropriate action.

This is often solved using a hybrid approach:

1. Take the user request
2. Convert it into an embedding
3. Embed the descriptions of instructions and situations
4. Search for the most semantically similar options
5. Rank them (usually with a separate reranker model)
6. Select the best response

If no suitable option is found, the request is escalated to a human support agent.

Why the car uses a simpler architecture

Now let’s go back to the vehicle.

Even though embedding models are relatively small, installing a full GPU inside a car is still a questionable idea — especially for a mass-market, cost-sensitive electric vehicle.

So the architecture is intentionally much simpler.

UMA listens to user requests through two microphones,
but it does not build embeddings or run heavy models.

Instead, the system uses an approach very similar to smart home devices:

- There are predefined scenarios
- A compact micro-classifier model determines which instruction category the request belongs to
- The system executes the corresponding action

In essence, this is not answer generation.
It’s fast scenario selection.

Of course, this affects the user experience.

For example, to check the remaining battery level,
you need to use a fairly precise trigger phrase.

But in my opinion, this is a perfectly reasonable trade-off — especially when balancing cost, reliability, and response speed.

For a prototype and a mass-market product, the solution looks very sensible.

And one more important point.

Full-scale agent systems — the kind we’re getting used to — rely on orchestration and almost always require significant compute resources, meaning heavy GPUs.

That kind of infrastructure simply wouldn’t work reliably inside a car — or would only function when connected to Wi-Fi.

So the current architecture is not a limitation.
It’s a deliberate engineering decision.

My personal impressions as a passenger 🤭

UMA 5 is a Chinese electric vehicle.
It looks and feels like a typical representative of its segment: a practical family car designed for everyday tasks — going to the supermarket or working in ride-hailing.

I would describe it as a baseline vehicle for its price category.

This is not a sports car that delivers speed, adrenaline, and a bit of rebellious fun.

It’s comfortable urban transportation.

Some observations:

The interior finish is typical for the segment — seat upholstery is synthetic rather than leather
The cabin is fairly plastic-heavy, with textured plastic designed to resemble leather

And here I have a personal quirk.

I genuinely believe a more honest design philosophy is better:
if a material is plastic, it should proudly be plastic — not pretend to be leather or a more expensive car.

Otherwise, it creates a slightly strange feeling of unnecessary imitation.

That said, most cars do this — and in this case it’s simply part of the partner platform.

On the positive side, the production team claims the build quality is very solid:
no loose parts, no squeaks.

And that’s genuinely reassuring.

Many of us have taken early electric taxis at least once and remember the constant rattling and creaking — especially noticeable because electric motors are so quiet.

I really hope this level of build quality holds once production scales.

Another design choice worth mentioning is the driver dashboard.

It consists of two screens:

one for multimedia and external vehicle controls
one for speed, battery level, and core driving metrics

The layout feels logical and modern — without the sense of being overloaded with screens just for the sake of screens.

Thoughtful details ✨

I love well-designed products, and I especially appreciate attention to detail.

This is exactly the kind of case where that attention is visible.

Voice control with Alice is implemented using two microphones located in the roof lining above the front seats.

And it’s not just “two microphones.”

It’s a smart zoning system.

The user experience has clearly been thought through in advance. 👍

When I simply say:
“Open the window”

The system opens exactly the window I need.

If I’m sitting in the driver’s seat — the driver’s window opens.
If I’m in the passenger seat — the passenger window opens.

The microphones dynamically determine the command source location.

I’m not entirely sure how perfectly this will work for passengers in the back seat, but in theory the system should handle that scenario as well.

Another important detail: the developers took the noise problem very seriously.

At home, a smart speaker operates in a relatively calm environment.
In a car, it’s a completely different situation:
movement, road noise, wind, conversations, traffic.

So the speech recognition model was specifically trained for noisy environments.

The idea is that Alice should reliably recognize commands even while driving.

Final thoughts

It feels like we are standing at the very beginning of a major transformation.

The car is gradually evolving from a mechanical device into part of the digital environment around us.
It becomes connected to the home, the smartphone, and services — controlled by the same assistant that already lives in other devices.

In essence, this is a new class of objects:

gadgets embedded into the physical world.
Smart devices made tangible.

And honestly, watching this transition happen in real time is incredibly fascinating.

The Market for Search Infrastructure for AI Agents

Annakokovina — Mon, 30 Mar 2026 15:45:31 GMT

Preface

When Models Stopped Growing — and Started Learning to Act

Or: From Scaling Models to Scaling Capabilities

A few years ago, the industry was almost unanimous in its belief that the quality of artificial intelligence was a function of model size.

More parameters meant better answers, broader capabilities, and higher product value powered by the model underneath.
And for a while, that assumption held true. Each new generation of models delivered a noticeable jump in quality, while companies competed to scale infrastructure and train increasingly large systems — the golden age of distributed inference frameworks.

But gradually, it became clear that this strategy had natural limits.

The cost of computation was growing faster than the utility of additional parameters.
Latency was becoming critical for real-world products, especially in voice scenarios.
And quality improvements were no longer proportional to model size.

At some point, the market began looking for a different path — not increasing model intelligence, but expanding its capabilities.

This is where the pattern of agentic systems emerged.

The model stopped being just a text generator and became part of a system capable of executing tasks: calling tools, retrieving data, making decisions, and acting across multiple steps.

This was not just a technological step.
It was an architectural shift.

The center of the system was no longer the model itself,
but its ability to interact with the external world.

When companies started building agents, they quickly discovered that the main challenge was not the model.

The challenge was the tools it could use —
and how reliably those tools worked.

Gradually, a new infrastructure layer began to form — one that can be described as capability infrastructure for agents.

These are not models, and not frameworks.
They are ready-to-use capabilities that can be connected to a system.

For example:
- memory
- search
- code execution
- multimodal tooling

Each of these capabilities solves a specific practical problem and turns the model from a text generator into an operational system and decision-making point.

As agent architectures started appearing in real products, an ecosystem began forming around these capabilities.
It would be inaccurate to call it a fully mature market — but it is clearly emerging.

Why Search Became the First Standardized Capability

Today we will focus on one of the most stable and well-defined parts of this ecosystem:
search infrastructure for LLM applications — or simply, the ability to search the internet.

If you look at real-world agent use cases, search is present in almost all of them.

An agent needs to obtain information from outside:

- finding answers and information
- verifying facts and real-time data
- collecting lists and performing research
- tracking changes
- analyzing news (one of the most common scenarios)
- clarifying details

Without access to the internet, an agent is limited to the model’s training data — which inevitably becomes outdated.

That is why search became the first capability to standardize as infrastructure.

Not as a feature inside a product,
but as an external service.

And over the past few years, a new market has formed around it —
the market for search infrastructure for AI agents.

What Search Infrastructure Products for AI Agents Actually Look Like

Once you start examining this market closely, one pattern becomes obvious:
architecturally, it stabilized very quickly.

Regardless of the company, almost all solutions include the same set of components.

At the center is web search — the mechanism that finds relevant pages.
In many cases, this layer relies on long-established search engines such as Google.
But there are also companies building their own indexes in relatively short timeframes — for example, Exa.

Next comes content extraction — the process of converting HTML into clean text.

Then comes crawling, which allows systems to traverse entire websites and collect data systematically. This functionality is especially useful for monitoring updates.

Sometimes a deep research layer is added — enabling more complex search and information aggregation workflows.

And more rarely, there are features that sit somewhere between research and search — such as structured list discovery or dataset retrieval.

But in practice, the stack is remarkably consistent:

Search → Extract → Crawl → Research

Differences between products rarely appear at the architectural level.
Instead, they appear in operational characteristics:

speed
reliability
search and ranking quality
cost
output format

After comparing multiple vendors, the same patterns keep repeating.

Observations from the Market

1) Most search systems are not optimized for specific languages

Nearly all solutions are optimized for the English-language internet.
Outside that environment, quality often drops noticeably.

This is not primarily a technical limitation —
it reflects the structure of demand.

The primary market is still the United States or English-speaking ecosystems.

2) Public SLAs are rarely transparent

Companies are willing to discuss latency and search quality.
But standardized benchmarks for search performance in AI agent contexts still barely exist — with the exception of simple QA-style evaluations.

Real guarantees around availability, stability, and performance are usually discussed only in enterprise contracts.

For smaller customers, these metrics are often considered less critical.

3) High-volume usage is almost never priced at public rates

Website pricing is typically a reference point.

But once a system moves into production, pricing conditions are almost always renegotiated directly with sales teams and become customized.

4) Despite marketing, most workloads remain simple

There is a lot of marketing around:

intelligent answers
agentic workflows
deep research

But in reality, the majority of systems repeatedly perform very basic operations:

find a list of links
retrieve page text or a snippet
pass the data to the model

Key Players in the Search Infrastructure Market

If you compile the list of vendors most frequently mentioned in enterprise conversations, it turns out to be surprisingly compact.

Among specialized providers, the most commonly discussed include:

Exa
Tavily
Firecrawl
Parallel
Google
Yandex
Bing

I intentionally excluded Perplexity and You.com.

Perplexity receives a significant amount of negative feedback regarding reliability and quality — which is understandable given its strong focus on consumer products.
And for You.com, search is not a core business — much of the underlying infrastructure is sourced from other providers.

Each vendor has its own product philosophy and target audience.
But the core capabilities are remarkably similar across the board.

What Companies Actually Pay For

After the initial wave of excitement and demos, market expectations start to shift.

At the presentation stage, conversations often revolve around advanced features:

- deep research
- automated analysis
- multi-step agent workflows

But when it comes to use cases that generate consistent revenue, the picture becomes much more pragmatic.

Companies are not paying for intelligence.
They are paying for reliability of basic operations.

More advanced logic is usually built internally.

In practice, three capabilities are consistently in highest demand.

Fast search

Latency becomes a critical parameter because complex systems may execute dozens of sequential tool calls.

Search is rarely the only tool involved.

Every additional second directly impacts user experience.

Reliable content extraction

Not just returning a URL —
but returning URL + usable text— turns out to be central to most systems.

This capability determines whether the model can actually work with the retrieved information.

Predictable load limits

Peak performance is less important than stability and control.

What matters is predictable quotas, rate limits, and operational behavior.

Two Distinct Customer Types

The market clearly divides into two categories of customers — and this division strongly influences product selection.

Type 1 — “Everything as a Service”

These companies want a ready-made solution.

They are not particularly sensitive to small differences in quality.
They do not want to manage search infrastructure complexity.

AI is not their core product — it is an additional feature.

Their primary goal is to get working functionality quickly without large engineering investments.

They are willing to buy:

packaged deep research
ready-made AI answers

Type 2 — Infrastructure Customers

These companies treat AI functionality as a core part of their value proposition.

Sometimes it is their main product — and often they resell it as B2B infrastructure.

In these cases, requirements increase dramatically.

They need to:

- control search quality
- adapt system behavior
- manage request cost
- scale workloads
- guarantee service stability

For these companies, search becomes part of the core architecture.

They are also the ones most likely to:

build their own pipelines
- combine multiple providers
- carefully benchmark latency, quality, and cost

Capability Infrastructure Is Designed for Machines — Not Humans

One defining characteristic of capability infrastructure is that it is designed for agents not only at the input level, but also at the output level.

When you look at this emerging product layer, its uniqueness is shaped not just by how it connects to systems, but by the format of the results it produces.

Yes, integration typically happens through programmatic interfaces:

APIs
MCP

A user interface is helpful —
but mostly as a playground or marketing surface.

The more important distinction lies in the output format.

These systems are designed from the start to produce results for other systems — not for humans.

Ideally, outputs should be:

- structured and compact
- predictable
- easy to process downstream
- token-efficient

Not a document.
Not a webpage.
But a set of facts in a concise snippet.

The Current Market Structure Is a Compromise with Constraints

When looking at today’s search infrastructure ecosystem, it may seem that the market has already stabilized.

But most likely, this structure did not emerge because it is optimal.

It emerged because it is what currently works — technically and economically.

In reality, the market is still adapting to constraints.

For example, I am confident that once reliable, affordable, general-purpose deep research becomes available, the balance of functionality will shift again.

Basic operations — search and content extraction — did not become dominant because they are the most interesting.

They became dominant because they are the most predictable.

Do AI Agents Really Need Memory — or Is It Just Another “Wow Feature”?

Annakokovina — Wed, 14 Jan 2026 12:50:31 GMT

Do AI Agents Really Need Memory — or Is It Just Another “Wow Feature”?

I tested real agent memory in production — not demos. Here’s what actually improves UX, where memory is useless, and why shared memory matters more than personalization.

We’re all used to demos where an assistant “remembers you like sugar-free coffee” and, 20 messages later, surprises you with empathy. Sounds great.

But when it comes to real products — especially B2B — reality is usually far more boring:
one prompt, one function, maximum reliability, minimum surprises.

I ran real product testing of agent memory — not a demo version, but actual pluggable memory with fact storage, retrieval, management, and controls — in production agents.
Here’s what we learned 🤭

— -

🔍 What we tested

A standalone Memory Server (mem0 / MCP-style architecture)
API for `query / save / propose / delete`
Memory stores facts and decides when to retrieve them
Async writes (no UX latency)
Ability to inspect and delete what the assistant remembers
Fully integrated into an already running production assistant

We tested this with active customers who explicitly wanted this functionality — real assistants doing real work, not living in slide decks.

— -

🧩 How to build memory in platforms

Option 1 — Dedicated memory service

A separate architectural component with its own SLA:

- Read/write APIs
- Namespaces
- Sharing support
- Tracing and logs

Why this works:

- Scales independently
- Can be reused across multiple assistants
- Not tied to a specific LLM
- Fault-tolerant — if memory goes down, the system still works

Option 2 — Memory baked into the platform

Memory tightly coupled to the assistant:

Better developer experience
Fewer integration steps
But harder to evolve as a standalone module
Comes with a tax of unnecessary functionality

Option 3 — Custom RAG pretending to be memory

The engineers’ favorite:

Put everything into vectors
Retrieve by similarity
Hope that’s “memory”

Spoiler: sometimes it is enough.
We actually saw this in the pilot.

— -

So… do businesses need personalized agents with memory — or is it just another shiny toy?

— -

🧪 What reality showed

✨ Memory does improve UX

Users consistently said:

the assistant feels smarter
it stops asking the same questions
trust increases

This wasn’t “tech admiration” — it was observable user behavior.

Fact: memory is a direct trust booster.

❌ But… long-term personal memory is rarely needed

This was the biggest surprise.

There are very few real cases where long-term personal memory truly matters:

Developers aren’t building complex multi-turn assistants
Often there isn’t even a stable user ID to separate memories
Most enterprise products live in a “one prompt → one function” world

Even more interesting:
many teams wanted memory not as memory, but as a cache to:

- reduce token costs
- minimize repeated calls
- optimize workflows

Not “remember me as a person”, but remember intermediate stuff to make this faster.

— -

🔎 Transparency matters more than expected

Users want:

- to see what’s stored in memory
- to delete specific facts (not wipe everything)
- simple commands instead of a heavy admin panel

— -

🤝 Memory + multiple assistants = 🔥

One more unexpected (and very strong) insight:
there’s real demand for shared memory used by multiple specialized agents — essentially memory for a workflow or a multi-agent system.

This is no longer “assistant memory.”
It’s organizational product memory.

Sounds boring — but for business, it’s gold.

Honestly, this worked in our test almost by accident.
Turns out, demand is huge.

— -

🎨 Expectation vs reality

Expectation:
Memory like a human’s — growing over time and turning the assistant into a personal digital companion.

Reality:
Memory as a smart, tidy **working notebook** of the product:

- helps
- speeds things up
- increases trust
- doesn’t need to be “human”

And that’s… great.

Because memory isn’t an empathy toy — it’s **infrastructure** that turns assistants into real products, not demo gimmicks.

— -

🧷 Takeaways

Ultra-personalized, long-term memory baked deep into platforms is usually unnecessary.
It makes sense mainly for large B2C products — and even there, it’s built very deliberately around specific business needs.

What does matter:

- Memory is real value, not a marketing feature — it sharply increases trust
- Don’t romanticize “lifelong personal memory” — functional memory is what’s needed
- Shared memory is a very strong use case (hello, context engineering 👀)
- Transparency and control are must-haves
- Market expectations are low — people are happy to ship this “as is”

- take Mem0
-add deduplication
-set memory limits
and you already have something solid for production

If you’re building a platform, start thinking about memory not as a cute LLM feature”, but as a core architectural primitive — just like logging or storage.

Memory for LLM Agents

Annakokovina — Tue, 28 Oct 2025 15:19:31 GMT

Memory for LLM Agents: Why It Matters and What’s Already Been Invented

If you’ve ever talked to an assistant that completely “forgets” who you are by the next day — you already know why agents need memory.

Without it, every interaction turns into an endless first date: you have to explain who you are, what you do, and how you like things — over and over again.
For users, that’s frustration. For businesses, that’s inefficiency and lost value.

Why do agents need memory, really? 🫠

Memory isn’t just about better user experience.
It defines what an agent can actually do — whether it can remember past interactions, adapt to a user, manage long-term workflows, and incorporate feedback.

But not all agents need memory in the same way.

There are two fundamental scenarios 👇

🏢 1. The Agent as an AI Employee

You’re building an assistant to automate a business process.
Its job is to deeply understand the process and act as an impersonal AI worker.

In this case, the “user” is the company itself — or its employees acting on behalf of it.

The north star metric for such an assistant is closing a specific business task as efficiently and accurately as possible.
Focus: execution, process, compliance.

User experience and personalization still matter — but they’re secondary metrics.
A team building an automation-focused LLM system usually won’t prioritize those until much later.

👤 2. The Agent as a Personal Assistant

Here, you’re building an assistant meant to serve a person.
A user delegates tasks to it directly — tasks that are often messy, non-deterministic, and context-dependent.

Such an agent must remember you:
your projects, habits, communication style, and ongoing work.

It should pick up right where you left off yesterday — and know the difference between “send the file to Vasya” and “send Vasya the file.”

That means handling explicit facts the user provides (features) and implicit contextual data — the kind they never say out loud, but that defines them.

Typical examples: personal assistants, HR bots, support agents — any system where context = half the UX.

💙 The Customer Support Case

This one deserves its own category — the final boss of LLM automation.

A support agent must remember a customer’s previous tickets, products, issues, and even their cat’s name — all while following a strict policy or script.

It’s a hybrid scenario that blends both use cases: structured process + personalized context.
And it’s the hardest to do well.

So what kinds of memory are there? 🤯

Let’s break it down (with links to good docs if you want to go deeper).

⚡️ Short-Term Memory

Short-term memory, or thread memory, is usually implemented through sessions.
It’s what the model “holds in its head” right now — the context buffer of the current conversation.

It helps the agent keep track of the dialogue, understand pronouns, and stay coherent.
Example: LangChain ConversationBufferMemory.

The obvious downside: once the context window is full — or the session ends — everything is lost.

🧩 Long-Term Memory

Long-term memory can be recalled at any time, across sessions — but it’s always tied to a specific agent ID.

It typically comes in three subtypes:

🗓 Episodic

A timeline of user interactions, events, and actions — “what happened and when.”
It helps the agent recall how it solved something last time.

🔍 Semantic

Stores facts and user traits in a meaningful (vectorized) way — via embeddings.

⚙️ Procedural

Keeps patterns, skills, and learned routines — how to perform a task, format a report, or trigger a pipeline.

A cool related concept is Reflexion:
an “actor” agent generates a response, then a “revisor” critiques it — pointing out omissions, errors, or suggesting improvements.

What’s Already Been Built 🦆

🔹 Mem0 (OpenMemory MCP)

Built on Python / FastAPI / Qdrant stack.
Implements semantic memory with embeddings and vector search.
Supports standard API ops (add, search, list, delete).
Can run locally or as a cloud endpoint. Integrates with LlamaIndex, LangChain, and OpenAI MCP.

Pros: simple, plug-and-play, easy to integrate.
Cons: very basic — no deduplication, summarization, or fact aging. Not scalable.
Verdict: nice concept (especially in MCP context where memory is optional), but too lightweight for production.

🔹 LangChain Memory

Offers an entire ecosystem of memory types — from simple buffers to EntityMemory and StateMemory.
Supports Chroma, FAISS, Redis, and more.

Pros: flexible, mature abstractions, solid docs.
Cons: can get confusing — too many layers, no deduplication, and EntityMemory struggles at scale.
Verdict: a mature and flexible option — especially when wrapped into a cloud MCP.

🔹 LlamaIndex

Implements long-term memory with vector store integration.
Includes a fact extraction layer (LLM prompting across sessions).

Pros: easy integration, fits into existing pipelines.
Cons: needs careful setup for token limits and fact expiration; memory can “bloat” fast.
Verdict: a solid, no-frills long-term memory base — but you’ll need to extend it if you want more than semantics.

🔹 MemGPT

More research project than product.
Uses tool-calling where memory logic (deduplication, priorities, recall) is delegated to the agent itself.

The most interesting idea: memory pressure — letting the agent decide when to offload facts to memory as the token window fills.

Pros: elegant concept, dynamic memory management, prioritization of important vs stale facts.
Cons: unstable, expensive, scalability concerns.
Verdict: promising research, far from production-ready.

🔹 Semantic Memory in Yandex’s Alice

Based on Anatoly Glushchenko’s research on LLM-based personalization in Alice.
Not open-source, but a great reference — basically a masterclass on how to build high-quality semantic memory.
Watch the 36-min recording here: ODS.AI Fest 2025 — Yandex MSC

🔹 AWS Bedrock Agents (Memory)

Memory is built right into the agent API.
Each session is automatically summarized and stored — no manual setup.
Integrates with Guardrails for filtering, validation, and data safety.

Pros: frictionless, everything just works.
Cons: locked into AWS ecosystem.
Verdict: perfect if you’re already in AWS and want something “simple but works.”

🔹 Google Vertex AI (Agent Engine + Memory Bank)

Memory is part of the Agent Engine; data lives in the Memory Bank.
Supports cross-session recall via search_memory.
Treats memory as a tool, with automatic summarization.

Pros: deeply integrated, minimal setup.
Cons: limited to Google’s ecosystem, still pre-GA.
Verdict: strong “memory as a service” direction.

🏆 What’s Next

Right now, everyone’s reinventing the wheel:
some teams store everything in a vector DB, others hack together Redis-based memory or summarize via GPT.

But the trend is clear — memory is becoming a native service inside agent platforms.
And we, my friends, are all just trying to keep up in this sprint.

MCP in Enterprise: Challenges, the Desired Outcome, and Ideas

Annakokovina — Mon, 13 Oct 2025 18:52:33 GMT

When we talk about MCP (Model Context Protocol) outside of the business context, it sounds amazing — almost like a future where you have your personal assistant whose skills can be extended simply by connecting ready-made tools.

But as soon as the discussion turns to implementing MCP in a large corporation, the usual questions arise: security, authorization, scalability, and context management.

In this article, we’ll try to understand why MCP is more than just hype — and how it can reshape the enterprise landscape.

MCP Today

At the moment, MCP is gaining massive popularity — a large community has grown around it, and the MCP server ecosystem is expanding literally every day.

The protocol allows extending the capabilities of LLM applications by providing a standard way to interact with external data and tools.

But along with that come new challenges — especially when it comes to enterprise environments.

Key Security Questions

There are no standard verification mechanisms for MCP servers (which means potential malicious code risks).
It’s necessary to ensure safe use of servers inside the company.
Clear usage scenarios and access control are required.

What is MCP

For those who read the preview and got interested but found the term unfamiliar:

MCP is not a framework or a tool — it’s a protocol, similar to:

HTTP for the internet
SMTP for messaging
LSP (Language Server Protocol) for programming language support

Anthropic aptly describes MCP as “the USB-C port for agentic systems” — a universal interface that standardizes interactions between different AI ecosystem components regardless of their vendor.

MCP Architecture

Host (LLM applications) — creates and manages multiple clients
Client — maintains 1:1 connections with servers, manages the protocol
Server — provides context, tools, and prompts to clients

What to Read to Understand MCP

Available on Habr: https://habr.com/ru/articles/893482/
Article from the creators: https://www.anthropic.com/news/model-context-protocol

A Dry Product Intro: Why MCP Matters for Companies (or the Journey Every Russian Big Tech Has Taken in Some Form)

A year ago, the industry started talking about agents — a powerful approach to leveraging large language models (LLMs), allowing not just text-based responses but full-fledged automation systems with external function, API, and tool calls.

Companies believed in the agent concept and began building infrastructure enabling LLMs to act not as toys but as real assistants with access to actual actions and internal company services.
And they quickly ran into a storm of custom integrations, uncontrolled context sharing between tools, duplicated connectors, and violations of basic information security practices — even risks falling under Article 137 of the Criminal Code (if you know, you know).

By “agent,” we mean a system where the LLM doesn’t just answer questions but can perform actions — call functions, trigger APIs, run external services.
This approach, known as tool calling (or function calling), greatly extends the model’s capabilities.

The model becomes not just a text generator but an interface to a set of skills, enabling it to solve more complex tasks — pulling data from CRMs, generating reports, interacting with internal systems, etc.

But to do that, the LLM must have access to those functions and services.

It’s logical to create a centralized “tool registry” for agents — a repository of functions, APIs, and microservices that can be connected to the agent.

If you developed some skill — say, an integration with an internal service or external platform — you could describe it as a tool and add it to the catalog. Then it becomes available to LLMs.

The idea was great: a single source of skills from which agents could be built like LEGO.

The first attempts went in this direction — for example, using OpenAI’s responses API.
Good idea, sure — but not one that took off.
We all understand that for a catalog to work, you first need an adoption point — an added value from registering your tool. That value could come from reusing other existing tools.
In short: you must fill the catalog first, then onboard users. A simple idea, often forgotten when speed to market is the main goal.
Also, responses API doesn’t help with security and authorization at all.

That’s where MCP comes to mind — it offers, right out of the box, a significant library of integrations generously written by the community, while allowing segmentation of tools by MCP servers and fine-grained access control.

While working with LLM agents, it becomes clear that once an agent goes beyond a single user and fixed toolset, systemic complexity begins.

The first major issue is authorization.
An agent calling external APIs must do so on behalf of the user, but the LLM itself cannot “own” an authorization context.
You can’t just pass an auth token into the model (even if it’s deployed within the company’s perimeter) and expect it to inject it safely into requests without leaking artifacts.

In enterprise environments, this becomes critically important: any action (creating tasks, accessing storage, reading data) must be tightly bound to the current user’s context.

If we want the agent to be universal (i.e., serving different users with different permissions), we must solve several problems:

Token management: where are tokens stored, how are they passed and validated?
User context: what does the agent know about the user? How is that passed to the tool?
Access control: what actions are allowed or forbidden?
Security: how to prevent privilege leaks, especially if the agent is a public assistant?

These are non-trivial issues, and the industry still hasn’t reached a unified solution.
Most existing implementations are custom patches within specific projects or teams.

For Those Getting into the LLM Application Zoo: A Simple Difference Between Responses API and MCP

This section would have helped me a lot when I was starting out — maybe it’ll help you too.

Let’s Recap

Responses API is the interface you use to communicate directly with the model (e.g., GPT-4, GPT-5, etc.).

It manages how the model responds, what tools it can use, and in what format it returns results.

📘 In simple terms:

Responses API = “How the model thinks and speaks.”

Key Features:

Used for text/code/answer generation
Supports output streaming
Can connect plugins/tools (like python, image_gen, web)
Universal JSON format with fields like role, content, tool_calls, etc.
It’s a low-level API for dialogue and reasoning

MCP, on the other hand, is a protocol that defines how external systems and data can be “connected” to the model safely and in a standardized way.

It’s like a universal bridge between the model and external data sources/tools.

📘 In simple terms:

MCP = “How the model interacts with the outside world.”

Key Features:

Standardizes interaction between the model and third-party systems
Allows connecting plugins, internal databases, APIs, tools, IDE plugins, etc.
Works through the Responses API — meaning Responses API remains the communication channel, while MCP adds the integration layer
Provides security (access control, context isolation)

🧩 Responses API — receives user input and manages the response stream
🧠 Model — decides it needs external data
🔗 MCP — connects to Notion (or another system)
📊 External service — returns raw data
🧠 Model + Responses API — format and deliver final output to the user

Back to MCP in Corporations: The Facts

All the above sounds visionary — but in practice, the topic splits into two tracks:

Local development environment: MCP servers are run directly on developers’ machines for testing and development.
LLM-ization of corporate APIs: MCP servers serve as an intermediate layer, exposing internal APIs to LLMs safely and uniformly.

Let’s explore the requirements for each track.

Local MCP, Open Source Ideas and Solutions

This track is designed for secure, observable, and fast use of ready-made MCP servers from the industry — typically in coding agents or designer workflows (yes, Figma MCPs are still relevant). The main feature of this sub-track is that the agent acts strictly on behalf of the owner and usually within their local environment.

Who Defines the Requirements and What Are They?

In this zone, three things are critical for your information security:

Cataloging MCP servers — a clear list of what is allowed and what is forbidden.
Validation and isolation — no accidental rm -rf / or calls to blacklisted MCP servers or tools.
Tracing and auditing — because security isn’t only about blocking actions but also about being able to restore the full context of what happened.

The local track is about developers, convenience, and security.
It’s simple: if you want to use ready-made MCP tools — go ahead, but under control.

The goal of this section isn’t to prescribe “the right way,” but to highlight tools and ideas that might help you address MCP challenges inside a corporation.

ETDI: Mitigating Tool Squatting and Rug Pull Attacks in MCP (OAuth-Enhanced Tool Definitions + Policy-Based Access Control)

https://arxiv.org/pdf/2506.01333

Goals and Functionality

ETDI is a security extension for the Model Context Protocol (MCP), targeting two key threats:
tool squatting/poisoning (malicious or impersonated tools) and rug pull attacks (hidden behavioral changes in previously approved tools).

The approach is multi-layered:

Cryptographic identification and signing of tool definitions.
Immutable, versioned definitions — any significant change = a new signed release.
Explicit rights and permissions, often mapped to OAuth 2.0 scopes and/or transmitted via JWT.
Additionally, policy-based access control (PBAC): fine-grained authorization of requests through an external PDP (e.g., OPA/Rego or Cedar/Amazon Verified Permissions) considering context (who/what/when/where/resource/action).

Example flow:

An MCP client retrieves a list of tools but, instead of trusting the description blindly, it:

Verifies the tool definition signature
Checks the version and hash of the contract (e.g., OpenAPI)
Compares the required OAuth scopes
Then, on each invocation (or by policy), queries the PDP for an allow/deny decision for that specific action and resource.

Drawbacks

Complexity: key management, version rotation, OAuth integration (including consent UX), and PDP deployment → significant operational overhead.
Single point of delay: each call may require a runtime check with the PDP → potential latency.
Not part of the core MCP specification: this is an extension, not an official standard — clients and servers need explicit support.
Maturity: research paper from June 2025 — not production-ready yet, but good for conceptual adoption.

Summary

Discovery/Catalog: ETDI itself isn’t a registry, but its metadata (signed definitions + versions + scopes) can be stored and validated within your own catalog.
Compatibility with Docker/Gateways: ETDI integrates well with gateways like Docker MCP Gateway, ContextForge, or Unla — the gateway/client can act as the PEP (Policy Enforcement Point), validating signatures and querying the PDP.
Access Control: combine OAuth (who and what is allowed in general) + policies (when and on what specifically) + strict versioning (what exactly was approved).
Audit Readiness: storing logs (tool_id, version, scope, policy decision), signatures, and version chains simplifies both internal and regulatory audits.

MCP Guardian (eqtylab/mcp-guardian)

https://github.com/eqtylab/mcp-guardian

A homemade solution — created mostly for experimentation and learning.

What it does and general impression:

Very minimal in functionality — more of an MVP than a real product. Commits and releases are made by one developer over a few months, last updated in April.
Functionality includes:
Saving and launching servers listed in a configuration (likely via subprocess)
Adding a guard profile to intercept calls to and responses from MCP servers, prompting the user for confirmation through a simple UI.

Interesting ideas:

Query logging
A human-in-the-middle mode for manual approval of MCP requests and responses

Unla

https://github.com/AmoyLab/Unla

Pros

Its main (and currently the only mature) feature: automatic wrapping of APIs into the MCP protocol.
Written in Go — strong performance and scalability.
Logs all calls.
Has an OpenAPI-based configuration generator.

Cons and Limitations

Very basic and early-stage support for authentication in third-party APIs.
No plugin system; the project is young and in active development — potential maintenance overhead for forks.
Retries, rate limiting, and error handling must be implemented on your side.

Why Use It

It allows you to wrap APIs into MCP quickly and with minimal effort.

MCPX (MCP Gateway by Lunar.dev)

https://docs.lunar.dev/mcpx
https://github.com/TheLunarCompany/lunar/tree/main/mcpx#readme

Developed as a component of Lunar’s own ecosystem. Initially, the company built an API gateway for call monitoring; now the focus has shifted to access control for external LLM providers.
Features include rate limiting, quotas, monitoring, etc. MCPX is open source and designed for integration.

License: MIT (for MCPX)

Open Source Features

Very simple user authorization mechanism: a single constant string in ENV (shared by all users).
Supports connecting MCP servers, including subprocess stdio connections. Configuration is defined in JSON:
For stdio/local: uvx/npx + command and envs; critical envs set in the Docker container running MCPX.
For remote MCP: streamable-http/sse. Via the MCP SDK, standard OAuth 2.0 authorization is used if required. MCPX UI supports triggering the OAuth flow in an external system; all users then share the obtained credentials.
Exports metrics via Prometheus.
Tool customization support.
ACL (Access Control List) — but weak security, as group assignment is handled through a simple text header.

Paid / Proprietary Features

IAM integration with internal providers.
Manual review and approval workflows.
Integration with Lunar’s main AI Gateway product.
SLA support, etc.

Interesting Ideas

Tool customization

Creating new tools with specific parameters
Overriding selected parameters
Modifying tool descriptions in MCP endpoints

2. ACLs

Grouping tools into subsets of services and endpoints
Rules defining allowed or denied groups
Assigning rules to specific MCP clients

Kong AI Gateway for Protecting, Observing, and Managing MCP Servers

NOT OPEN SOURCE
https://github.com/Kong/kong

What it is: An API gateway — a platform for centralized configuration, publishing, management, and monitoring of APIs.
It consists of the API Gateway itself, a system of Plugins handling various operational aspects, and the Konnect management console.
The console and some plugins are commercial. It supports multi-service gateways, Kafka integration, and built-in API development tools.

Kong now also offers AI Gateway capabilities based on its platform, implemented through a system of specialized plugins for AI and LLM workloads.

Notable Features in AI Gateway

There are two mentions of MCP in the documentation:

The AI Gateway can expose external MCP servers (e.g., GitHub) via the gateway.
There is also an internal MCP server for managing the gateway itself.
→ https://developer.konghq.com/mcp/kong-mcp/get-started/

Example plugins:

ai-prompt-guard: a prompt guard plugin that stores regex-based rules to allow/deny LLM requests.
Semantic prompt guard: license required.
ai-proxy: proxy access to different model providers.
openid-connect: general-purpose, suitable for any API.

Interesting ideas:

The concept of a modular platform.
The flexible plugin system.

ARCH

https://github.com/katanemo/archgw

ArchGW (or simply Arch) is a proxy server for handling LLM requests and agent workflows.
It takes care of infrastructure aspects: prompt routing, security (guardrails), LLM connection management, and observability.

Key Capabilities

Prompt routing to target agents or LLMs based on intent.

How it works:

ArchGW receives an incoming prompt (from a chatbot, agent, or API).
The Intent Resolver layer analyzes the prompt:
Either through semantic matching (LLM or vector embeddings)
Or via rule-based configuration (regex, keywords).
The request is then routed to the appropriate prompt target:
An agent (e.g., “customer-support-agent”), or
A specific LLM backend (e.g., OpenAI GPT-4 or local LLaMA).
Envoy handles the network routing (HTTP/gRPC), while Arch adds logic for intelligent destination selection.

Guardrails — centralized definition and enforcement of safe behavior (e.g., preventing jailbreak attacks).

How it works:

All prompts go through a policy engine within ArchGW.
Policies are defined via configuration (e.g., “block DB deletion commands,” “filter PII”).
Mechanisms used:
Regex filters for quick heuristics.
LLM-based classifiers for semantic risk detection (e.g., jailbreaks, toxicity).
If a policy is violated:
The prompt is rejected, or
It is modified (e.g., masking PII).

Unified LLM Access — connects to multiple LLMs via a uniform interface; supports models and functions (e.g., function calls) via configuration.

Observability and Metrics — supports OpenTelemetry, W3C Trace Context, integrations with Jaeger, Signoz, Honeycomb, and more.

Built on Envoy Proxy — Arch operates as a containerized gateway alongside applications, extending Envoy for LLM-specific workloads.

Drawbacks

No built-in API control: guardrails exist, but OAuth, PBAC, or other enterprise-grade access policies are not provided out-of-the-box — external integration is required.

Remote MCP: Product Rationale and Open Source

This section is about modernizing APIs and connecting internal master systems of a company through MCP servers.
In other words, it’s not just “I, as a user, invoked a tool locally,” but rather “we have a corporate assistant that aggregates access to multiple systems and acts on behalf of the company.”

That’s an entirely different set of requirements:

Cataloging is no longer about restriction but about discovery — understanding which APIs and services are available.
The key topic is authorization, and it’s two-level: what a user can do within the assistant and what the assistant can do on behalf of the user in the master system.
There may also be a runtime — an environment where you can not only assemble tools but also run them, share access with colleagues, and test solutions.

If we break down potential stakeholders, we’ll see that MCP can serve different needs:

Large B2B automation teams see it as a way to reuse existing tools without endless custom integrations. MCP provides a standard that makes tool creation transparent and reproducible, and it also allows delegating integration logic to the systems that own the APIs.
Pilot teams and assistant creators want to focus on business logic, not infrastructure. MCP lets them grab ready-made tools, assemble an MVP agent easily, and test hypotheses without heavy setup.
Master systems and business domains gain the ability to offer “out-of-the-box” automation: if your API is already described in an MCP catalog, hundreds of assistants can use it without dedicated integration work.

A note on runtimes

Large teams usually don’t need a runtime — they already have their own environments where everything runs, and MCP just integrates into that.

However, pilots and experimenters really do need it: they want a space where they can quickly build an agent, test it, and show it to users without deploying a full infrastructure.

It’s logical to look toward lightweight, minimal runtimes — not over-engineered for high load, but capable of fast iteration and validation.
Such a runtime can be deployed using open-source tools like n8n or OpenWebUI, ensuring a very low entry barrier.

Let’s take a look at what’s interesting in the open-source ecosystem — or at least already available today.

MCP Registry by Anthropic

Purpose and Functionality

The MCP Registry is a centralized catalog of MCP servers.
It provides a RESTful API that allows you to discover, register, and manage MCP server descriptions.
Each server in the registry contains metadata (name, description, repository link) and information about images/packages (for example, a Docker image with options).
This makes it easier for developers and organizations to find existing integrations (for example, MCP servers for Google Drive, GitHub, Slack, etc.) without manually gathering them from scattered sources.

Pros

Simplifies discovery of MCP servers
Provides a unified API for search
Supports both MongoDB and in-memory storage
Community-managed (Anthropic and other contributors), allowing for verified, expanding server lists
Easy to self-host (Dockerized deployment)

Cons and Limitations

The registry itself is just a catalog; it does not run servers or provide execution infrastructure
No built-in security guarantees for all entries
Reliability depends on the stability of the web server and MongoDB; scaling relies on MongoDB replication

Why Use It

The registry is useful when you need to quickly find an existing MCP server for a specific task, check its parameters, and get its image or source code.

MCP Catalog by Docker

Purpose and Functionality

Docker MCP Catalog is a repository of containerized MCP servers on Docker Hub.
Each MCP server is packaged as a Docker image and centrally hosted.
The catalog groups MCP servers by category (DevTools, Data, Finance, etc.), making it easy to find and launch them directly from Docker Desktop.

According to Docker, the catalog includes over 100 verified MCP servers (for example, Stripe, Grafana, and others), divided into Docker-built and community-built (unverified) ones.

Each tool runs in an isolated container, which improves reproducibility and security.
The catalog is accessible through Docker Desktop and the Docker Hub web interface, complete with search, tags, and filters.

Pros

Ease of launch: just have Docker installed — docker run mcp/ starts the desired server with pre-configured parameters.
Security: Docker-built servers are verified, signed, and scanned for vulnerabilities. Container isolation minimizes risk (each runs with limited privileges).
Standardization: unified management (Docker), consistent interface (MCP), and built-in support in MCP Toolkit.
Solves the “where do I find an MCP server?” problem — everything is in one place.

Cons

Dependency on Docker: requires Docker Desktop or Engine — you can’t run images without it.
Limited ecosystem: the catalog contains many, but not all, MCP servers; missing ones must be containerized manually.
Potential costs: Docker Desktop with MCP Toolkit is part of a commercial product; private repositories or licensing may incur additional costs.

Why Use It

The catalog allows for fast experimentation and integration of services.

MCP Gateway by IBM (ContextForge)

Purpose and Functionality

IBM ContextForge MCP Gateway is a comprehensive gateway/registry/proxy.
Its goal is to serve as a central control point for tools, resources, and prompts for MCP clients.
ContextForge combines a gateway (merging multiple backends into one endpoint), a registry (federating data about servers/tools), and adapters.

Features

Protocol-flexible gateway: sits in front of any MCP server or REST API, providing a unified interface.
Supports multiple MCP protocol versions and allows clients to interact through a single HTTP/SSE endpoint with multiple tools.
Federation and registry: can automatically discover and merge multiple MCP gateways/registries (peer-to-peer) into one network view. Supports mDNS and Redis caching for synchronization and failover.
API virtualization: wraps ordinary REST/gRPC services into virtual MCP servers. Tools are declared via configuration, and the gateway can extract JSON schemas, handle HTTP headers/tokens, and manage retries and rate limiting.
Unified registries: stores not only tools but also prompts and resources (e.g., Jinja2 templates, static files, links) with versioning and rollback.
Security and observability: built-in authentication, authorization, auditing, rate limiting, automatic retries, and a live management UI.
Scalability: runs as a Python (FastAPI) app, deployable via Docker or cloud (Azure/AWS/Red Hat) with PostgreSQL/Redis for multi-cluster setups.

According to the README, ContextForge targets serious enterprise deployments, though the current version is alpha/beta and not production-ready without thorough testing.

Cons

Complexity: rich functionality makes configuration non-trivial
Maturity: still in early-beta stage without official IBM support
Infrastructure requirements: full functionality requires Docker/containers and databases (PostgreSQL, Redis)

Why Use It

ContextForge is valuable if you need centralized management of numerous MCP integrations and/or seamless integration with legacy systems.
It covers “LLM-ization of APIs” (creating virtual tools from existing APIs) and enterprise-grade security requirements (auth, auditing, etc.) through policy and isolation features.
In essence, ContextForge acts as a “neural bus” for communication between LLMs and services — ideal for building an internal assistant platform, similar to GPT plugins but within your own infrastructure.

Final Thoughts

I hope this helped you structure the underlying need that’s in the air and provided a few practical ideas for implementing MCP in corporate environments.

Rerankers in the RAG Pipeline

Annakokovina — Mon, 29 Sep 2025 10:44:30 GMT

A purely overview-style article

Even the best retriever can’t guarantee good answers if your pipeline is missing two critical components: reranker and rejector.
These are the parts that:

prioritize the most relevant documents,
filter out noise and duplicates,
and, when needed, confidently say: “I don’t have enough information to answer.”

In my new overview article, I break down:
🔍 why rerankers are essential in a RAG pipeline and the main approaches used,
🛑 what rejectors are and how they increase reliability,
📊 results of experiments on real Russian-language datasets,
⚙️ and which models performed best in practice.

Part 1 — The Absolute Basics

Retrieval-Augmented Generation (RAG) has become the de facto standard for building question-answering systems, chatbots, and enterprise assistants. However, even the best search systems and vector databases can’t guarantee that the top-k retrieved documents will all be equally useful for generating a final answer.

That’s where re-ranking comes into play.

In this article, we’ll break down:

why you need a reranker in a RAG pipeline
the main approaches to reranking
what “rejectors” are and why they matter
pros and cons of different techniques
results from tests on Russian-language datasets from major tech companies

⚠️ Disclaimer: This article is not meant to be a comprehensive survey or benchmark of reranking methods. It’s more of a research summary aimed at finding a near-universal solution for a platform — and exploring how to design a contract that allows integrating a domain-adapted reranker.

⚠️ Another important note: A “universal” reranker will never outperform a domain-specific solution built for a particular use case. Reranking is highly dependent on your data and context.

Why Do We Even Need a Reranker? 😳

A classic RAG pipeline looks like this:

The user query is converted into a vector.
The top-k closest documents are retrieved from a vector index.
A generator model (LLM) uses those documents to produce an answer.

The problem is:

Some documents might be only indirectly relevant.
Others could be too general.
And some may contain noise or duplicates.

A reranker helps prioritize the most useful context and feed the LLM only what’s truly helpful.

Part 2 — Still Basic, but Getting More Interesting

Approaches to Reranking

1. Semantic Re-ranking (Cross-Encoder)

💬 Comment: The most common go-to solution once you read a few papers on reranking.

A separate model takes a (query, document) pair and outputs a relevance score.
Popular models: MS MARCO cross-encoders, MonoT5, etc.
Accuracy is higher than with a bi-encoder, but it’s computationally expensive — each candidate must be processed individually.

When to use: When you care about high accuracy and the candidate set is relatively small (k ≤ 100).

2. Hybrid Scoring (BM25 + Embeddings)

💬 Comment: Kinda old-school and not exactly “21st century” vibes.

Combines classic retrieval (BM25, TF-IDF) with vector search results.
The reranker simply aggregates the two scores, often via a linear combination.

Pros: Fast, no additional model required.
Cons: Doesn’t always capture subtle semantic differences. A budget-friendly but low-quality approach.

3. Using an LLM as a Reranker

💬 Comment: The most obvious next step after “just sort by score” — usually appears when people still treat the LLM as a magic black box and don’t calculate inference costs. It can work with a fine-tuned smaller model, though.

Instead of a dedicated reranker, the LLM itself estimates how much a document helps answer the question.
Can be done with zero-shot or few-shot prompting, e.g.:
“Rank the following passages by their relevance to the query.”

Pros: Flexible, understands semantics and context.
Cons: Expensive and slow on large candidate sets.

4. Learning-to-Rank Models

💬 Comment: A solid approach if you’re building a local, domain-specific solution.

Trained on custom datasets (e.g., enterprise Q&A).
Methods: LambdaMART, RankNet, LightGBM-ranker.
Typical input features: embedding similarity, BM25 score, document length, keyword positions, etc.

Pros: Can be tailored to domain-specific needs.
Cons: Requires labeled data with “document relevance” annotations.

5. Post-Filtering with Heuristics

💬 Comment: In my opinion, an essential step that complements reranking rather than replacing it.

Sometimes, simple rules go a long way:

Remove duplicates.
Filter out documents that are too short or too long.
Exclude low-trust sources.

Pros: Fast and free.
Cons: Doesn’t solve semantic retrieval errors.

Отлично 👍 — вот продолжение твоей статьи (Part 3 & 4) в хорошем английском переводе, готовом к публикации на Medium. Я сохранил стиль, структуру и даже «токс-комменты», адаптировав их под англоязычную аудиторию, чтобы текст выглядел естественно и не потерял оригинальный тон:

Part 3 — Still Basic, But Getting More Interestingeeee

What Are Rejectors?

Sometimes, you shouldn’t try to generate an answer at all costs. If the user’s query is too far from the knowledge base — or no relevant documents were found — a rejector can save the day.

What Does a Rejector Do?

Analyzes the relevance of retrieved documents.
Determines whether there’s enough information to generate a meaningful answer.
If confidence is too low, it “rejects” the query or returns a fallback response — e.g., “Sorry, I don’t have enough information to answer that.”

Approaches to Building Rejectors

1. Threshold-Based Filtering

💬 Comment: Works only for homogeneous query types. If you have different “buckets,” you’ll need separate thresholds — or rephrasing.

If the maximum or average similarity score is below a certain threshold → reject the query.
Pros: Simple and fast.
Cons: The threshold must be tuned empirically and doesn’t account for semantic nuances.

2. LLM Classifier

💬 Comment: Same trade-offs as with using an LLM as a reranker.

The model receives the query and retrieved documents, then decides: “Can we answer this?”
Great for complex scenarios where rejection quality matters.
Cons: Expensive inference.

3. Binary Classifier (ML/DL)

💬 Comment: A very niche approach — I’ve rarely seen it used in production. Intuitively, the quality tends to be low.

A dedicated model (e.g., logistic regression, XGBoost, neural net) predicts reject / accept based on features like similarity score, text length, number of keyword matches, etc.
Cons: Requires a labeled dataset.

Why Use Rejectors?

✅ Improves trust: It’s better to honestly say “I don’t know” than to hallucinate an answer.
✅ Reduces LLM load: Don’t waste compute if the knowledge base has no relevant content.
✅ Enables flexible fallback logic: For example, route the query to a human operator or trigger a different pipeline.

In the end, a well-designed RAG pipeline typically has three quality control points:

Retriever — retrieves candidate documents.
Reranker — sorts them by relevance.
Rejector — decides whether there’s enough information to generate a valid answer.

Part 4 — Real-World Numbers

In the attached section, you’ll find the models we tested and the results of our benchmarks.

Let’s quickly revisit the datasets we used:

Dataset 1: Questions about internal documentation from support specialists who don’t know the answers to complex queries.
Dataset 2: General customer support questions — used to improve automation.

We evaluated the models using standard RAGAS metrics (they have excellent documentation — you can read more here: RAGAS Metrics).

Results & Takeaways

Unfortunately, there’s no single “best” universal reranker model — honestly. 😅
However, based on our research, we recommend deploying two models:

BAAI/bge-reranker-v2-minicpm-layerwise
jinaai/jina-reranker-v2-base-multilingual

Results summary:

jina performed exceptionally well in terms of speed and correctness.
minicpm-layerwise achieved higher retrieval_relevance and context_recall.
Both were among the fastest models we tested in terms of runtime.

✅ Final Thoughts:
Reranking and rejection are often overlooked stages in a RAG pipeline — but they’re crucial for building reliable, production-grade retrieval-augmented systems. A strong retriever is only the first step. To truly improve answer quality, you need a reranker to sort the context and a rejector to decide whether you should answer at all.

Paraphrasing in the RAG Pipeline ⚙️

Annakokovina — Fri, 12 Sep 2025 12:51:55 GMT

Everyone loves the “naive RAG.”
It works great in theory — but in practice, it often stumbles over the way people actually phrase their questions.

In this series, I’ll show how to fix that. Let’s start with the most common custom piece of a production RAG pipeline: paraphrasing.

Why retrieval alone isn’t enough 🤔

RAG lives or dies by retrieval quality.

If a user’s query doesn’t align with the wording in your knowledge base — because of synonyms, phrasing differences, or grammar quirks — you’ll often get poor or even empty results.

Example:

Query: "Vacation?"
DB:    "How to take additional paid leave"

Without paraphrasing, the system may completely miss that these are the same intent.

What is paraphrasing (and why bother)?

Paraphrasing = generating multiple alternative versions of a query while keeping the meaning intact.

🎯 Its role in RAG:

Boost recall → more ways to match relevant docs.
Improve robustness → less sensitive to how users phrase things.
Reduce zero hits → even if one query fails, another may succeed.

Where it fits in the pipeline ⚙️

Generate paraphrases

LLMs (multiple variations per query).
Specialized models like T5 or mBART.

2. Run parallel retrievals

One search per paraphrase.
Merge + re-rank results.

3. Filter & normalize

Drop duplicates and weak matches.
(Optional) use a reranker, e.g. cross-encoder.

4. Feed the generator

Pass the top-ranked docs across all paraphrased queries.

Common paraphrasing strategies 🛠️

Structured outputs → control model outputs via prompting or DSLs.
HyDE (Hypothetical Document Embeddings) → generate a “fake answer,” embed it, and use it as the query for semantic search.

Dont do quick experiment 🐱🦆

At RAG Platform, we compared two approaches — starting small, with kittens and ducklings.

Dataset:

TREC Conversational Assistance Track (CAsT) 2022 → multi-turn QA dialogues.
Stored in Qdrant with:
provenance (for answer validation).
text (QA pairs with embeddings).

Setup:

Embedding model: multilingual-e5-large (solid open-source baseline).
Paraphrasing model: gpt-4o-mini (cheap + simple).

Results 📉

Surprisingly, in our “perfect vacuum” test, HyDE didn’t beat the baseline.
In fact, it slightly decreased performance compared to plain RAG (no paraphrasing).

Next time, we’ll dig deeper into strategies — and share more practical setups. Stay tuned 😉

Advanced Paraphrasing Strategies in the RAG Pipeline 🎯

Simple paraphrasing is great.
But let’s go deeper — not just testing on some random dataset from outer space 🚀, but on the real queries of your stakeholders.

That’s where the real value is. After all, we want to improve our system for end users, not just add a pipeline block that looks good on synthetic benchmarks.

Before diving into practice, let’s review the main prompting strategies for paraphrasing.

Paraphrasing strategies with prompting 🧩

1. Original (baseline)

The user’s raw query is sent directly into the pipeline.

2. Paraphrase

Generate 3 paraphrases from the original query.

Run retrieval for each paraphrase in the vector DB.
Collect unique chunks across all results.
Sort them by score → take the top_k.
Pass these, along with the original query, into the LLM.

You are an expert in query expansion and natural language processing.
Your task is to generate an optimized search query based on the user’s input query.
Follow these guidelines:

1. Analyze the input query for key concepts and intent.
2. Identify any ambiguous terms or phrases that could be clarified.
3. Consider common synonyms, related terms, and alternative phrasings to improve the search.
4. If applicable, expand acronyms or abbreviations.
5. Incorporate any relevant context or domain-specific knowledge.
6. Ensure the expanded query maintains the original intent of the user’s question.
7. Prioritize clarity and specificity in the rewritten query.
8. If the original query is already optimal, you may return it unchanged.

[Structured Output Format]
Return 3 options.
Return the output in the following JSON format:
{
“expanded_queries”: [
“First optimized query…”,
“Second optimized query…”,
“Third optimized query…”
]
}
[Constraints]
Do not return keyword lists — only full natural language sentences or questions.
Each query variant should be unique but still faithful to the original intent.
Do not include any commentary or explanation — only return the JSON object as output.

3. Paraphrase-v2

Generate 2 paraphrases from the original query.
Retrieve chunks for the original + both paraphrases.
Merge paraphrase chunks → select the top 5.
Combine them with the top_k original chunks.
Send the merged set to the LLM.

4. Stepback

Reformulate the query into a more general or intermediate question.
Retrieve results for the stepback query.
Mix those results with chunks from the original query.
Pass everything together to the LLM.

You are an expert at generating step-back questions that help retrieve relevant information.

Your task is to take a specific user query and generate a broader, more general “step-back” question that will help find background information and context that might be useful for answering the original question.

Follow these guidelines:
1. Identify the specific topic or domain of the original query
2. Think about what broader concepts or background knowledge would be helpful
3. Generate a more general question that captures the fundamental concepts
4. The step-back question should be broader but still related to the original intent
5. Focus on foundational knowledge that would help understand the specific query

Examples:
- Original: “How do I configure OAuth2 with JWT tokens in Spring Boot?”
Step-back: “What are the key concepts and components of OAuth2 authentication?”

- Original: “Why is my Docker container running out of memory?”
Step-back: “How does memory management work in containerized applications?”

[Structured Output Format]
Return the output in the following JSON format:
{
“expanded_queries”: [
“Your step-back question here…”
]
}

[Constraints]
- Generate only ONE step-back question in the array
- Make it broader and more general than the original
- Keep it as a natural language question
- Do not include any commentary or explanation — only return the JSON object

5. Query Decomposition

Break down a complex query into simpler sub-queries.
Retrieve relevant chunks for each.
Merge and consolidate results before passing to the LLM.

👉 In the next part, we’ll share experiments with these strategies — not on synthetic datasets, but on stakeholder gold sets (real-world benchmarks from two of our core clients).

Let’s move to practice

As mentioned earlier, we tested not on synthetic datasets, but on gold sets from two of our key stakeholders.
The goal: deliver real value in our system — not just add a fancy pipeline block that only looks good on benchmarks.

Metrics we used 📐

We relied on standard RAGAS metrics (their docs are excellent — highly recommend reading: RAGAS metrics guide).

Stakeholder datasets 🗂️

We had two very different support scenarios:

1. Internal documentation support
Questions from support specialists who need answers from internal docs but don’t know them upfront.

Results:

Paraphrase-v2 → best at context recall/precision.
HyDE → lowest hallucination rate, but poor retrieval relevance.
Stepback → best balance between context accuracy and relevance.

2. External client support

Typical customer support queries, where automation is a priority.

Results:

Stepback → top performer on overall relevance + stable faithfulness.
Paraphrase-v2 → leads in faithfulness and retrieval relevance.
Query decomposition → weakest, struggling across all context-precision metrics.

Key takeaway 🔑

No single paraphrasing strategy works well across all types of data.

If you’re building a local solution, you can tailor a fixed strategy.
But in a general-purpose RAG platform, you need:
tools to measure performance per strategy,
the ability to switch/manipulate strategies via prompting,
or even dynamically select strategies at runtime.

👉 Just picking “one strategy that’s OK everywhere” doesn’t work well — same story as with rerankers.

A practical idea (borrowed from the X5 team):
Use a classifier at the pipeline entry → decide whether the query needs paraphrasing, and only run paraphrase on complex cases.

Models for Paraphrasing in the RAG Pipeline 🧠

Alright, now the big question: which models should we use to generate paraphrases?

Which models did we test? 🔍

We focused mostly on open-source, but one closed-release model slipped in: t-pro-it-1-2-fp8.
(It’s not available publicly.)

Fair note: our list is missing Llama-2–7b, which I suspect would perform well — fast, strong, and small enough to be practical. The only caveat: structured output isn’t officially supported, so it may or may not work.

Here’s the lineup we tested:

t-pro-it-1-2-fp8
qwen25-coder-32b-instruct
t-pro-it-2-0-fp8
gemma-3-27b-it
t-pro-it-1-0
t-lite-it-1-0

💡 If you have budget to burn, you can always go with closed models (OpenAI, Anthropic, etc.).

Evaluation factors ⚖️

We looked at three main aspects:

Size / scalability → ideally ~7B with alignment. 32B can also work, but bigger size means higher serving costs (inference isn’t free, even in-house).
Speed → critical for RAG pipelines, since most latency comes from LLM calls. Faster models = smoother pipeline.
Performance in our use case → relevance and robustness of generated paraphrases.

The winner 🏆

Our best performer was t-pro-it-1-2-fp8 — likely because it’s well-aligned for Russian and domain-specific phrasing.

My personal take 👀

The ideal paraphrasing model (for me) would be:

an open-source 7B model,
aligned specifically for paraphrasing + structured output,
small enough for fast inference and cheap scaling on A20 GPUs.

⚠️ And don’t be alarmed by the 0 results for t-lite-it-1-0 — it simply doesn’t support structured outputs.

Вот итоговый блок статьи на английском — в стиле Medium, чтобы выглядело как завершение серии:

Wrapping Up: Paraphrasing in the RAG Pipeline 💅

So, what did we learn?

✅ Paraphrasing is a powerful way to customize and clean up user input before retrieval.
✅ In production, at scale, you can optimize with a classifier that decides whether paraphrasing is needed — since real users write queries in very different ways.
✅ It’s easy to experiment: strategies can be swapped with prompting, which works much better in practice than hypothetical embeddings.
✅ The best strategy depends on the specific use case — there’s no universal winner.
✅ For the paraphrasing model itself, even a relatively small model is enough. In MVP setups, a 32B model with solid structured output performs very well.

🙏 Huge thanks to the LLMP team ❤️ — this research wouldn’t have been possible without them.

From PoC to Production: The AI Platform Playbook for 2025

Annakokovina — Wed, 13 Aug 2025 17:41:32 GMT

About the B2B AI Platform Market

By 2025, generative AI has moved beyond being a mere experiment or a deep-venture investment — it has become a mandatory element in digital transformation strategies.
According to McKinsey, over 65% of companies are already deploying LLM-based solutions, and Deloitte reports double-digit ROI growth among those that have brought projects into production.
However, behind these headline success stories lies another statistic: up to 70% of PoCs never scale due to lack of expertise, weak tooling, and the absence of a “bridge” between hypothesis testing and industrial deployment.

The market for agent platforms and RAG (retrieval-augmented generation) systems is growing explosively: Global Market Insights and MarketsandMarkets both estimate a CAGR of 35–40%, with the segment projected to reach tens of billions of dollars by the 2030s. The key question is no longer “Should we integrate an LLM?” but rather “How do we create a safe, observable, and scalable environment where the team can go from idea to production without losses at every stage?”

Demand and Market Needs

Product teams everywhere are tasked with “embedding AI features into services” — as part of strategies for increasing efficiency, reducing costs, and creating new products. Enterprises and SMBs are actively experimenting with LLM features (assistants, text auto-generation, case analysis, routine process automation), but scaling remains hard due to the lack of processes, tools, and expertise.

Industry reports show that generative AI adoption surged in 2024–2025, with companies reporting increased benefits — but also significant operational costs and security risks.
Key technology patterns today are RAG and agent orchestration (multi-tool calling / agents). RAG quickly became the standard for production applications, and the agent solutions market is showing high growth with multi-billion-dollar forecasts. This changes the platform requirements: it’s not enough to “hook up an LLM” — integration, observability, and security must be guaranteed.

Main Concerns

Misunderstanding LLM limitations and application types. Many users are misled by the idea that “LLM = solves everything.” They don’t distinguish between scenarios where one prompt is enough, and where RAG, integrations, or human validation are required.

Lack of technical resources for quick hypothesis testing or missing expertise (data/ML/infra) in product teams; experiments often stall until a formal PoC process is available. Research and case studies point to insufficient upskilling and the need for corporate training programs.

Misguided stakeholder requests. Examples: “Give me RAG — I’ll summarize requests there,” or “We gave the LLM our business description and asked it to detect fraud” — signs of pseudo-expertise and poorly defined tasks.

Security risks. Context leaks, uncontrolled external calls, lack of auditing and explainability. Surveys show that data security remains a top concern for executives.

No “path-to-product” funnel. Platforms are often built for advanced developers, leaving beginners without an easy path from hypothesis testing to value validation.

Stakeholder Segmentation & Typical Requests

“No expertise — just testing”
Profile: Employees, assistants, small teams looking to automate daily tasks.
Request: “AI in plain English,” fast onboarding, ready integrations and templates, low entry barrier, minimal settings, clear scenarios. Customization not a priority.

“Partial expertise — testing a hypothesis”
Profile: Product managers, analysts, early-stage R&D teams.
Request: Customization options, case-specific cookbooks/guides, pipeline visualization (low-code / n8n-like UX), basic quality control (Eval libraries, metrics).

“Full expertise — going into production”
Profile: ML teams, engineering departments with proven business value.
Request: Advanced tools — observability, tracing, fine-grained customization, independent model serving, A/B testing, CI/CD integration. Minimal “magic,” maximum control.

Key takeaway: Platforms are often targeted at segment 3, but to scale AI adoption inside a company, there must be a funnel from simple use cases to mature production solutions — enabling a smooth transition between segments (on-ramp → grow → scale).

Adoption Funnel: On-ramp → Grow → Scale

Mapping Needs to Tools

One-prompt, markup, simple automation → Prompt playground, prompt registry, prompt testing, ready metric library.
AI assistant (interactive, tool-calling, memory) → Toolset, MCP (company and external services), Assistants API, session manager, RAG, dialogue evaluation. API integration, session control, and security are crucial.
RAG / knowledge-driven scenarios → RAG / vector DB, document ingestion pipelines, retriever tuning, context management. Often comes as a standalone request (“build RAG on our data”).

Market Trends & Key Figures

RAG is now a mainstream pattern in production LLM scenarios.
Agentic AI market valued at ~$6–7B in 2024, projected to reach tens of billions by 2030+.
RAG-specific market expected to grow at ~40–50% CAGR in coming years.
Low-code/no-code remains a vital onboarding channel for non-technical users — valued in the tens of billions with double-digit annual growth.

So You’ve Decided to Build Your Own AI Platform — Key Focus Areas

Provide an on-ramp for segment #1 — ready-made automation templates, UI playground, “AI in plain English.”
Support segment 2 with cookbooks and low-code pipelines (visual design, RAG/assistant examples).
Maintain focus on segment #3 — observability, verifiability, custom deploys, CI/CD integrations.
Integrate Eval and prompt management as baseline features.
Build in compliance & security (RBAC, audit logs, data redaction, sandbox for external calls).
Measure progress as a funnel — track PoC→pilot→production conversion, average time and cost of hypothesis validation.

Principles for Building Such a Platform

Unified identity and access management
- Single sign-on across tools.
- Policy-based access control to data and tools with auditing.

Fast, affordable access to LLM providers
- Ability to test new models without lengthy approvals.
- Secure on-prem defaults.
- Hypothesis testing within fixed token limits (e.g., 10M tokens/month). Smooth scaling to dedicated serving.

Prompt management as a first-class entity
- Central registry with versioning and storage.
- Automatic baseline generation and prompt optimization per model.
- Prompt/model version comparison, test dataset generation.
- Auto-selection of optimal model for a request.

AI assistant support
- Prompt libraries and base tools.
- Custom tool connection via UI.
- Long-term user memory setup.
- End-to-end and online evaluation via traces.

Built-in RAG platform layer
- Indexed corporate sources with versioning.
- Support for documents, tables, and code.
- Transparent answers with cited sources.
- Dataset generation and metric collection from traces.

Unified Assistants layer as the integration hub
- Each tool, agent, or RAG pipeline is an agent domain.
- Standardized invocation, session, and memory management.
- Single “agent layer” for all LLM tools.

Quality & observability
- Metrics at model, prompt, agent, and pipeline levels.
- Benchmarks and critical function datasets built from real traces/domains.
- Alerts and model degradation detection.
- Online evaluation using real user traces.

What about ready-made solutions?

Everyone knows the giants: Azure, Google AI Studio, Yandex Cloud AI Studio. But what about smaller projects — what do their evaluations and capabilities look like? I’ve put together a selection of different projects that, in my opinion, provides a decent reflection of the market situation beyond the giants.

It’s important to note here that by “agent” I mean an LLM agent with tool calling, and by “assistant” — a one-prompt setup with memory/history.

From what I can sense about what’s happening in Russia right now

More or less, large businesses — and especially big tech — are building their own solutions. The events of recent years have hit hard for companies that relied on foreign vendors. For large businesses, it’s also simply easier to recoup investments at scale, since the Russian market doesn’t really have a proven commodity-level agent platform solution. Integrating open-source often turns out to be more expensive in the end. By the way, here’s my article on when it’s worth choosing open-source and when to develop your own solution: link. That said, big tech doesn’t shy away from buying small startups that create solid local solutions — especially if they can be deployed within a company’s secure perimeter.

Small and medium-sized businesses are split into two camps:

Those who build their business from the ground up on AI — for example, a one-prompt service for writing resumes.
Those operating in the physical world — beauty salons, gas stations, etc.

The first group usually develops their own solutions and has small but strong technical teams.
The second group doesn’t yet have a direct need for agents. They are beginner users, impressed by one-prompt and RAG solutions. Most likely, in a few years they will grow into a demand for agents.

There’s also a special segment in Russia: resource-extraction and government enterprises. These players are desperately trying to implement AI, but from my conversations with their representatives, it’s clear that AI rarely gets beyond pitch decks and strategies. With such companies, you need to start small and be especially careful during implementation.

Thanks for reading the article to the end)

Review of Open Source Solutions for Visual Agent Pipelines and LLM Flow Editors in the Enterprise…

Annakokovina — Sun, 29 Jun 2025 17:39:58 GMT

Review of Open Source Solutions for Visual Agent Pipelines and LLM Flow Editors in the Enterprise 🤖

Today I’d like to share an overview of open source solutions for building visual editors for agent pipelines and LLM flows — from the perspective of enterprise integration.

This post is a natural follow-up to my earlier article on “Build vs Open Source: When to Write Your Own and When to Reuse”. If you haven’t read it yet — now’s the perfect time:
🔗 https://medium.com/@annakokovina21/when-to-use-open-source-and-when-to-build-your-own-5cc0d53c2327

Part 1 — The Basics

Let’s begin by clarifying a few key concepts:

🔩 LLM Flow refers to an architecture where an LLM interacts with external data sources, tools, and memory in a step-by-step, controlled workflow.
These flows typically consist of sequential blocks such as:

Prompt → Response interactions
External API calls (to a database, CRM, search engine, etc.)
Control logic (conditions, loops, filters)

⚙️ Agent Pipeline refers to an architecture where the LLM acts as an intelligent agent that autonomously makes decisions and takes actions based on:

Instructions (goal definition)
Access to tools and APIs
Memory or history
Planning and reasoning mechanisms

🧠 How does an agent actually work?

Receives a goal (e.g., “Find a hotel in Paris and book it”)
Plans the steps (analyze → search → filter → book)
Uses external tools/APIs to complete each step
Adapts its behavior during execution (error handling, clarifications)

🔧 What does a typical agent system include?

LLM: the “brain” making decisions
Tools: APIs, DBs, web services
Memory: historical context or conversation logs
Prompt/Goal: the objective to achieve
Planner/Executor: logic for choosing the next action (manual or automatic)

Goal: Find an open source solution that’s easy to integrate with enterprise authentication systems, scalable to a large user base, and capable of supporting at least basic LLM flows — ideally multi-agent systems.

Part 2 — Solutions That Didn’t Quite Fit 📚

I reviewed 9 open source tools and started by analyzing those that didn’t fully meet our needs. Here’s a summary of each, along with enterprise integration “green” and “red” flags.

Node-RED

✅ Strong community (17k+ stars)
✅ Lots of nodes (especially IoT)
✅ Simple architecture, on-prem
🚩 Limited scalability
🚩 Weak auth support
🚩 Not designed for enterprise embedding
🚩 Focused on IoT

Huginn

✅ MIT License
✅ Self-hosted agent workflows
🚩 No external auth support
🚩 Not embeddable
🚩 Poor UX/UI

Budibase

✅ Low-code UI
✅ Custom auth (SSO, OAuth2)
✅ On-prem/Cloud
🚩 OSS version limited to 20 users
🚩 Enterprise features are paid only

Automatisch

✅ Simple, Zapier-like
✅ Open source, on-prem
🚩 Weak enterprise auth
🚩 Not scalable
🚩 Not built for AI agents

Temporal

✅ Highly scalable
✅ Multi-language SDK
✅ Ideal for microservices
✅ On-prem/Cloud
🚩 No visual editor
🚩 Requires DevOps skills

Part 3 — The Strong Contenders 📚

Let’s look closer at the most promising finalists.

Langflow 🎪

Focused on building LLM and agent pipelines
Integrates with LangChain and other AI tools
Great for rapid prototyping
Easy to extend but still maturing

License: MIT
GitHub Stars: 5k
Auth: Custom (FastAPI/LangChain), no built-in SSO
Hosting: On-prem
UI/UX: Visual editor, requires LangChain knowledge
Scalability: Docker-based, but poor docs for horizontal scaling, lacks built-in monitoring

n8n 🦄

Flexible visual editor for complex AI workflows
Active community and plugin ecosystem

License: Now under BSL (not OSS-compliant for embedded/SaaS use)
GitHub Stars: 100k+
Auth: Basic in OSS, full SSO/RBAC in paid Enterprise edition
Hosting: On-prem/Cloud
UI/UX: Drag-and-drop; no simple embedding (requires forking for UI customization)
Scalability: Docker, Redis-based scaling, manual CI/CD and monitoring setup

Appsmith 🐙

Ideal for building AI dashboards and internal tools
Not focused on LLM/agent pipelines

License: Apache 2.0
GitHub Stars: 30k+
Auth: Supports custom auth, OAuth2, Google, OIDC
UI/UX: UI editor, but lacks true flow-based editing
Scalability: Enterprise-ready, horizontal scaling, CI/CD support

Flowise AI 🦞

Designed for multi-agent orchestration with drag-and-drop LLM flows

License: Apache 2.0
GitHub Stars: 39k+
Auth: Role-based access, workspace isolation, API tokens, custom middleware
Hosting: Docker, Kubernetes-ready
UI/UX: Drag-and-drop with support for agents, tools, vector stores, etc.
Scalability: Microservice-based, supports LangChain, OpenAI, Chroma, HuggingFace

Windmill 🍄

Great as a backend automation engine
Higher technical barrier

License: MIT
GitHub Stars: 30k+
Auth: OpenID, OAuth2, SSO, RBAC
Hosting: Docker, Kubernetes
UI/UX: Hybrid of code and visual; mostly script-driven
Scalability: Well-documented Docker/Helm deployments, CI/CD ready

Part 4 — Conclusions 🥰

Goal Recap: We’re looking for an open source solution with enterprise auth support, scalable architecture, and a visual editor for at least LLM flows — ideally multi-agent orchestration.

✅ Most Promising: n8n and Flowise AI
Both are architecturally similar (Redis-based), support visual flows, and allow agent-like behavior.

n8n is mature and widely known, but has serious licensing issues for production use (especially for embedding or SaaS).
Flowise AI is newer and less battle-tested, but very promising. If it can be extended with custom auth easily — it may become a strong alternative.

Finalists at a Glance

n8n

Green Flags:
✅ Flexible visual editor
✅ AI agent support
✅ Extensible via plugins/API
✅ On-prem/Cloud

Red Flags:
🚩 BSL license as of 2024 — no embedding/SaaS without commercial license
🚩 No white-label/embed UI out of the box
🚩 Enterprise features (SSO, RBAC) are paid
🚩 Requires manual CI/CD, monitoring

Flowise AI

Green Flags:
✅ Apache 2.0 license
✅ LangChain/Hugging Face support
✅ Drag-and-drop LLM pipelines
✅ Role-based access and team collaboration
✅ Kubernetes-ready

Red Flags:
🚩 Young platform, still evolving
🚩 No built-in enterprise auth
🚩 Requires UI/UX customization

Appsmith

Green Flags:
✅ Apache 2.0
✅ Enterprise-friendly
✅ OAuth2/OIDC supported
✅ AI integrations
✅ On-prem/Cloud

Red Flags:
🚩 No visual flow editor
🚩 More focused on dashboards than LLM flows

Windmill

Green Flags:
✅ MIT license
✅ SSO, RBAC
✅ Strong backend engine
✅ CI/CD, on-prem/cloud

Red Flags:
🚩 Requires technical knowledge (code + visual)
🚩 Not fully no-code

Langflow

Green Flags:
✅ MIT license
✅ Visual editor for LLM pipelines
✅ Good for quick prototyping

Red Flags:
🚩 Weak documentation
🚩 No built-in auth
🚩 Limited scalability — requires custom setup

Thanks for reading till the end! ❤️
If you’re evaluating tools in this space or have experience with any of the above — I’d love to hear your thoughts.

#opensource #LLM #AIagents #enterpriseAI #n8n #flowise #langflow #nocode #automation #AIengineering