Sitemap
Data Science Collective

Advice, insights, and ideas from the Medium data science community

NoEncode RAG with MCP—A new FedRAG core abstraction

--

FedRAG now supports building knowledge stores that connect to MCP servers (or sources)—fine-tune RAG systems to better adapt to third-party MCP provided context.

Hello, again!

It’s been a couple of weeks since my introductory blog post about FedRAG—a new framework for fine-tuning RAG systems across both centralized and federated architectures. More than a few things have happened during that span of time:

  • added an Unsloth.ai integration for fast fine-tuning of your generator models
  • added the start of a new fed_rag.evals module which supports benchmarking RAG with popular benchmarks like MMLU, HotpotQA, SQuADv2, etc.
  • cleaned up import patterns and added a new cookbooks section to the docs!

Today, I’m excited to share the latest addition to FedRAG: a new core abstraction for building a different kind of RAG system, namely NoEncode RAG systems.

In this blog post, I’ll first introduce the concept of NoEncode RAG systems and compare them to traditional RAG. Then, I’ll walk through a specific implementation that has been added to FedRAG, which utilizes Anthropic’s Model Context Protocol (MCP).

NoEncode RAG vs Traditional RAG

Traditional RAG systems comprise three main components: a retriever (or embedding) model, a knowledge store containing knowledge context or chunks, and finally a generator LLM model. Knowledge contained in the knowledge store has previously been encoded by the retriever model and stored for future retrieval. Specifically, when a user queries the traditional RAG system, the query gets encoded and the most relevant knowledge chunks are retrieved from the knowledge store according to a specified distance measure between the encoded query and contexts. These knowledge chunks are then passed, along with the query, to the LLM generator for response generation.

In contrast, with NoEncode RAG, knowledge chunks or context are retrieved from the knowledge store using the original, natural language query versus an encoded representation of it. In other words, assembling a NoEncode RAG system involves only defining a NoEncode knowledge store and a usual generator model—no retriever model necessary!

Note: NoEncode knowledge stores may connect to knowledge sources that are themselves traditional RAG systems (which do involve encoding internally). However, the key distinction is that the main RAG system developer doesn’t need to manage any encoding — that complexity is abstracted away.

The MCP Knowledge Store and MCP Knowledge Sources

In FedRAG, one special instance of the NoEncode Knowledge Store is the new MCPKnowledgeStore class. In technical terms, this new knowledge store is an MCP Client Host that creates clients—supporting both stdio and streamable HTTP transports—and interacts with its attached MCP Knowledge Sources. As such, MCP Knowledge Sources are technically MCP Servers! For more details on the overall MCP architecture and makeup, please see the official docs.

from fed_rag.knowledge_stores.no_encode import (
MCPKnowledgeStore,
MCPStdioKnowledgeSource,
)
from mcp StdioServerParameters

server_params = StdioServerParameters(
command="uv",
args=[
"run",
"my_awesome_mcp_server.py"
],
)

mcp_source = MCPStdioKnowledgeSource(
name="my-awesome-mcp-server",
server_params=server_params,
tool_name="KnowledgeTool",
query_param_name="query",
)

knowledge_store = MCPKnowledgeStore().add_source(mcp_source)

Retrieving from the MCP Knowledge Store

You can query the MCP knowledge store using natural language, which will return a list of ~fed_rag.KnowledgeNode.

# keeping the code from the last code block...
from fed_rag.data_structures import KnowledgeNode

# MCP knowledge stores and sources are async-first classes
result: list[KnowledgeNode] = await knowledge_store.retrieve("What is MCP?")

Under the hood, the MCP Host (i.e., knowledge store) creates a client (in this case, a stdio client) and connects with the MCP server called my-awesome-mcp-server. It performs a tool call for KnowledgeTool that exposes an interface taking a parameter called query and returns text content. This text content gets bundled in an ~mcp.CallToolResult, which then gets converted to a list of ~fed_rag.KnowledgeNode.

Retrieving Directly from the MCP Knowledge Source

On the other hand, retrieving from the MCP knowledge source itself returns the raw ~mcp.CallToolResult. This is helpful when you know the default conversion to KnowledgeNode won’t be suitable for the MCP server you might be working with. By examining the exact format of the text contents contained in the returned ~mcp.CallToolResult, you can define your custom converter function and attach it to the MCP knowledge source.

# keeping the code from the first code block
from typing import Any
import json

call_tool_result: CallToolResult = await mcp_source.retrieve("What is MCP?")
print(call_tool_result.content) # a list of Content types

# or, see the default converter in action
knowledge_nodes = mcp_source.call_tool_result_to_knowledge_nodes_list(
call_tool_result
)

# define a custom converter
def my_custom_converter(
result: CallToolResult,
metadata: dict[str, Any] | None = None,
) -> list[KnowledgeNode]:
return [...] # some nodes extracted from result

# attach the new converter to the mcp_source
mcp_source = mcp_source.with_converter(my_custom_converter)

Adding a ReRanker To the Knowledge Store

Currently, the MCP Knowledge Store sends queries to all of its attached MCP knowledge sources. To better prioritize the various nodes returned from different sources, we support the defining and using a custom reranker callback. This reranker takes all knowledge nodes received from the various MCP sources and performs re-ranking against the original query.

# keeping code from previous code blocks
from sentence_transformers import CrossEncoder

# define the reranker callback
def reranker_callback(
nodes: list[KnowledgeNode], query: str
) -> list[tuple[float, KnowledgeNode]]:
model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L2")
# Concatenate the query and all passages and predict the scores for the pairs [query, passage]
model_inputs = [[query, n.text_content] for n in nodes]
scores = model.predict(model_inputs)

# Sort the scores in decreasing order
results = [(score, node) for score, node in zip(scores, nodes)]
return sorted(results, key=lambda x: x[0], reverse=True)

# use it in the knowledge store
knowledge_store = knowledge_store.with_reranker(reranker_callback)

Assembling a NoEncode RAG System

With a NoEncode knowledge store in hand, you can combine it with a generator LLM of your choosing to build a NoEncode RAG system!

# keeping code from all previous code blocks
from fed_rag import AsyncNoEncodeRAGSystem, RAGConfig
from fed_rag.generators import HFPretrainedModelGenerator

# define your generator
generator = HFPretrainedModelGenerator(
model_name="Qwen/Qwen2.5-3B",
...
)

# assemble your RAG system
rag_system = AsyncNoEncodeRAGSystem(
knowledge_store=knowledge_store,
generator=generator,
rag_config=RAGConfig(top_k=2),
)

The interface for NoEncode RAG systems remains the same as that for traditional RAG systems in FedRAG.

# keeping code from all previous code blocks
res = await rag_system.query("What is MCP?")

# final RAG response
print(res)

# a peak at the retrieved source nodes from MCP knowledge store
for ix, sn in enumerate(res.source_nodes):
print(
f"SOURCE NODE {ix}:\nSCORE: {sn.score}\n"
f"SOURCE: {sn.metadata['name']}\n"
f"TEXT: {sn.text_content[:500]}\n\n"
)

And, what about Fine-tuning?

You can use any of the already supported GeneratorTrainer classes in FedRAG to fine-tune your NoEncode RAG system to better adapt to the context emanating from the MCP knowledge stores!

# keeping code from previous blocks
from datasets import Dataset
from fed_rag.trainers.huggingface import HuggingFaceTrainerForRALT

# define a train dataset
train_dataset = Dataset.from_dict(
# examples from Commonsense QA
{
"query": [
"The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?",
"Sammy wanted to go to where the people were. Where might he go?",
"To locate a choker not located in a jewelry box or boutique where would you go?",
"Google Maps and other highway and street GPS services have replaced what?",
"The fox walked from the city into the forest, what was it looking for?",
],
"response": [
"ignore",
"populated areas",
"jewelry store",
"atlas",
"natural habitat",
],
}
)

# the trainer object
generator_trainer = HuggingFaceTrainerForRALT(
rag_system=rag_system.to_sync(), # trainers only work with sync objects
train_dataset=train_dataset,
# training_arguments=... # Optional ~transformers.TrainingArguments
)

You can also transform the fine-tuning task to a federated one as before:

# keeping code from previous blocks
from fed_rag.trainer_managers.huggingface import HuggingFaceRAGTrainerManager

manager = HuggingFaceRAGTrainerManager(
mode="generator",
generator_trainer=generator_trainer,
)
train_result = manager.train()
print(f"loss: {train_result.loss}")

# get your federated learning task (optional)
fl_task = manager.get_federated_task()

In Conclusion

NoEncode RAG paves the way for tapping into many knowledge sources, such as those provided by MCP servers. FedRAG’s new MCP knowledge stores and sources classes enable direct access to these servers. With just a few lines of code, you can connect to multiple knowledge sources, apply custom rerankers, and fine-tune your models — all while working exclusively with natural language throughout your pipeline.

Links to checkout!

To learn more:

--

--

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Andrei
Andrei

Written by Andrei

Founding software/machine-learning engineer at LlamaIndex. (https://ca.linkedin.com/in/nerdai)

No responses yet