Episode-XXVII: Lessons Learned from a Telco MCP BackEnd Experiments
Authors: Ian Hood, Robert Shaw, Fatih E. Nar
Deploying generative AI models in live environments can be challenging, especially when dealing with models that push the boundaries of available hardware. This article documents our experimental journey leveraging tested & short-listed genai image set on Red Hat OpenShift AI platform using vLLM runtime with Nvidia GPU, navigating through various hardware constraints, and ultimately finding an optimal solution for a Telco Model Context Protocol (MCP) backend implementation to build an experimental agent driven autonomous networks.
Architecture
We’re building a experimental MCP Proxy Server implementation for our diagnostic, planning and validation agents as part of the Telco-AIX AutoNet experiments. The MCP Proxy enables seamless integration between different GenAI models and external tools offering via MCP protocol, making it well suited to complex telecom diagnostics.
MCP proxy provides a plug & play way, for agents to interact with different ai models easily. In our telecom context, this means the autonomous network agents can access network telemetry, configuration databases, and monitoring systems through a unified backend access. GenAI model serves as the reasoning engine behind our MCP Server, processing diagnostic queries and orchestrating tool interactions.
Requirements -> Choice
Telecom network diagnostics require a model that can handle complex technical contexts. Network troubleshooting often involves analyzing extensive log sequences, correlating events across multiple systems, and understanding domain-specific protocols and configurations. Hence parameter size can/may matter for usefulness. For model sizes vs where they may make sense , please see our previous article; The AI Rings. We have evaluated multiple GenAI models (DeepSeekR1, Gemma, Ollama etc) with various optimized versions, and landed on to use Qwen3–32B.
The 32-billion parameter Qwen3 model has provided the necessary reasoning depth necessary for the tasks we care for Telco Autonomous Networks. The model’s strong performance on technical content was also particularly important. Qwen3–32B demonstrated exceptional capabilities in understanding these varied formats & contexts during our evaluations.
-> See some of our Telco SME GenAI benchmark/test work details here: Link.
Integration
The Telco-AIX autonet experiment aims to create autonomous network infrastructure capable of self-diagnosis <-> self-healing. The MCP Proxy Server acts as the bridge between the AI reasoning layer and the network infrastructure.
When a network anomaly is detected, the system can automatically initiate a diagnostic session through the MCP Proxy Server, with Qwen3–32B analyzing the situation and determining appropriate remediation actions.
The Challenge
Goal: Deploy original Qwen3–32B (a 32 billion parameter model with reasoning) on OpenShift for production inference using vLLM as the backend for our MCP server-based diagnostic agent. Also please remember that the vision of agentic autonomous network architecture is based on distributed systems, where some (likely most of) agents will run at different tiers/layers (places with more constrained resources as we walk from center to region to edge). See our AI Rings article for more details on what model can serve where and for what capacity.
Initial Hardware:
- 2x NVIDIA RTX 4090 GPUs (24GB VRAM each)
- Total: 48GB VRAM
- Model requirement: ~64GB in bfloat16 precision -> We thought we can squeeze 64GB into 48GB on the fly 😜
The Journey: Trial and Error
Attempt 1: Default Configuration
Our first attempt was optimistic, hoping the model would somehow fit with basic optimizations:
Configuration:
— gpu-memory-utilization=0.95
— kv-cache-dtype=fp8
— enforce-eager
— tensor-parallel-size=2
— max-model-len=16384Result: CUDA Out of Memory (OOM)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 126.00 MiB.
GPU 0 has a total capacity of 23.53 GiB of which 62.56 MiB is free.The reality check was immediate (no kidding, math is inevitable lol). The model weights alone require approximately 64GB in bfloat16 precision, far exceeding our available 48GB of VRAM.
Attempt 2: CPU Offloading to the Rescue
After researching vLLM’s capabilities, we have discovered CPU offloading could enable deployment by splitting the model between GPU and system memory:
Configuration:
— gpu-memory-utilization=0.90
— max-model-len=8192 # Reduced from 16384
— tensor-parallel-size=2
— cpu-offload-gb=20 # Added CPU offloadingResult: Success! The model loaded:
Model loading took 10.5123 GiB and 39.387292 seconds
GPU KV cache size: 75,744 tokens
Maximum concurrency for 8,192 tokens per request: 9.25xHowever, we noticed something concerning — each GPU was only using 10.5GB out of 24GB available. This meant we were wasting over half of our GPU memory!
Attempt 3: Maximizing GPU Utilization
Determined to use my hardware more efficiently, we tried eliminating CPU offloading:
Configuration:
— gpu-memory-utilization=0.95
— cpu-offload-gb=0 # No CPU offload
— max-model-len=4096
— tensor-parallel-size=2Result: OOM during sampler warmup
RuntimeError: CUDA out of memory occurred when warming up sampler
with 256 dummy requests. Model loaded: 22.5168 GiB per GPUThe model loaded successfully, but vLLM needs additional memory for the KV cache and inference operations. The default configuration tries to allocate space for 256 concurrent requests, which pushes us over the limit.
Attempt 4: Finding the Sweet Spot
After several iterations, We found a configuration that balanced all constraints:
Configuration:
— gpu-memory-utilization=0.95
— cpu-offload-gb=12
— max-model-len=4096
— max-num-seqs=8 # Reduced concurrent requests
— tensor-parallel-size=2Result: Success! The deployment was running:
Model loading took 18.4209 GiB and 23.447887 seconds
GPU KV cache size: 22,912 tokens
INFO: Started server process [4]
INFO: Application startup complete.Performance Analysis
The Reality Check
With the model finally running, We tested inference performance:
curl -X ‘POST’ \
‘http://qwen3-32b-vllm-latest-tme-aix.apps.sandbox01.narlabz.io/v1/completions' \
-H ‘Content-Type: application/json’ \
-d ‘{
“model”: “qwen3–32b-vllm-latest”,
“prompt”: “Ericsson 5G RAN signal loss can be caused by a”,
“max_tokens”: 100,
“temperature”: 0
}’The response took over 3 minutes for 100 tokens! The logs revealed the painful truth:
Engine: Avg generation throughput: 0.4 tokens/sAt 0.4–0.5 tokens per second, this was barely usable. The CPU offloading created a severe bottleneck, with every token generation requiring memory transfers between GPU and CPU.
Understanding the Bottleneck
CPU offloading works by keeping less frequently used model layers in system memory. During inference, these layers must be transferred to the GPU for computation, then potentially moved back. This creates a pipeline:
- GPU → CPU memory transfer
- CPU computation (if needed)
- CPU → GPU memory transfer
- GPU computation
- Repeat for each layer
With 12GB of the model offloaded to CPU, roughly 30% of computations involved these expensive transfers.
The Game Changer: NVIDIA Blackwell
Just when we were considering alternative models, we discovered our node also had a new Nvidia Blackwell GPU with 96GB VRAM! This “may” change everything (see below why we are sceptical, as Nvidia being Nvidia for HW <> SW backward compatibility).
First, we identified the GPU indices:
$ watch oc exec -it nvidia-driver-daemonset-418.94.202505191717–0-m4nt2 — nvidia-smiThe RTX PRO 6000 Blackwell Workstation Edition features:
- 24,064 CUDA cores
- 5th Gen Tensor Cores with FP4 support
- 4000 AI TOPS
- 96GB GDDR7 memory with 1.8 TB/s bandwidth
- 600W TDP with double-flow-through cooling
Then updated the deployment configuration:
Environment Variable:
CUDA_VISIBLE_DEVICES=2 # Select A6000 as it is marked with vLLM Arguments:
— gpu-memory-utilization=0.97
— cpu-offload-gb=0 # No CPU offload needed!
— max-model-len=8192 # Full context window
— max-num-seqs=64 # Many concurrent requests
— tensor-parallel-size=1 # Single GPU
— enforce-eagerWith the entire model on a single GPU, expected performance “would/could” improved dramatically — from 0.5 tokens/second to 20–100 tokens/second, a 50–200x improvement!
The Plot Twist: Bleeding Edge Has Its Price!
However, we encountered a fundamental compatibility issue at deployment time:
logNVIDIA RTX PRO 6000 Blackwell Workstation Edition with CUDA capability sm_120,
is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities
sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.The Blackwell architecture (SM 12.0) is too new for the current PyTorch build in the vLLM container. This cutting-edge hardware requires PyTorch 2.5+ with cuda 12.6+, which isn’t yet available in the standard vLLM images. 😭
Note: Our beloved friend Doug Smith, has walked the similar path with new GPU(s), see his work here for resolutions: Link.
Alternate Universe: Use Quantized Versions
We looked around for the quantized versions earlier but they were not yet available, if they were the entire journey would have been different. (Now they are!) Instead of wrestling with CPU offloading and bleeding-edge hardware compatibility, we could/would have leveraged quantized versions that fit comfortably on our RTX 4090s (and did eventually).
Available Quantized Options
Red Hat AI provides pre-quantized versions of Qwen3–32B specifically optimized for RHOAI deployments:
- FP8: RedHatAI/Qwen3–32B-FP8-dynamic
- Model size: ~34GB (50% reduction)
- Would fit entirely on dual RTX 4090s without CPU offloading
- Minimal quality loss (~1–2%)
- INT4: RedHatAI/Qwen3–32B-quantized.w4a16
- Model size: ~16GB (75% reduction)
- Could run on a single RTX 4090
- Moderate quality loss (~3–5%)
To use these models, you need to build your own ModelCar images to load them as catalog items on RHOAI. See details: Build and Deploy ModelCar Container on OpenShift AI.
-> We did the work for you so you can just consume them from here: Quantized Qwen3 with ModelCars.
With the FP8 version, we could achieve 20–50 tokens/second on our dual RTX 4090s without any CPU offloading bottlenecks.
With Qwen3–32B-int4 quantized version in single RTX 4090 24GB:
— gpu-memory-utilization=0.95
- kv-cache-dtype=auto
- enforce-eager
- tensor-parallel-size=1
- max-model-len=8192Simple Telco Test:
% curl -X ‘POST’ ‘http://qwen3-32b-vllm-latest-tme-aix.apps.sandbox01.narlabz.io/v1/completions' -H ‘Content-Type: application/json’ -d ‘{“model”: “qwen3–32b-int4-vllm-latest”,
“prompt”: “Nokia 5G RAN connection issues can be observed by “,“max_tokens”: 400,“temperature”: 1}’
{"id":"cmpl-3393fb8d812d404a8fa7e5ea7aae84f3","object":"text_completion","created":1750116679,"model":"qwen3–32b-int4-vllm-latest","choices":
[{"index":0,"text":"5G RAN network users, and are related to problems with
connecting to 5G networks, or problems with 5G coverage or availability of 5G
services in a particular area.
\nCommon causes for the Nokia 5G RAN connection issues include software bugs,
outdated firmware or software, network congestion, hardware problems,
or configuration errors.
\nTroubleshooting steps for the Nokia 5G RAN connection issues may include
checking for network coverage, updating software, checking hardware,
restarting the device or router, contacting your service provider, or
resetting network settings.
\n1. What are some common causes for Nokia 5G RAN connection issues?
\n2. How can I check if I have a network coverage problem with my Nokia 5G
RAN connection?
\n3. What should I do if my Nokia 5G RAN connection is experiencing software
issues?
\n4. How can I fix a Nokia 5G RAN connection that is experiencing hardware
problems?
\n5. What can I do if my Nokia 5G RAN connection is experiencing configuration
errors?
\n6. How can I improve the speed and reliability of my Nokia 5G RAN connection?
\n7. What should I do if my Nokia 5G RAN connection is experiencing high
latency?
\n8. How can I optimize my Nokia 5g RAN connection for gaming or streaming?
\n1. What are some common causes for Nokia 5G RAN connection issues?
\nThere are many possible causes for Nokia 5G RAN connection issues, including:
\n• Network Congestion: Too many users in a given area can cause network
congestion and affect the performance of 5G services.
\n• Outdated Firmware or Software: If your device's firmware or software is
outdated, it may not be compatible with the latest 5G technology.
\n• Configuration Errors: Incorrect settings in the device's configuration can
cause problems with the 5G RAN connection.
\n• Poor Signal Strength: If your 5G signal is too weak,",
"logprobs":null,"finish_reason":"length","stop_reason":null,
"prompt_logprobs":null}],"usage":{"prompt_tokens":14,"total_tokens":414,
"completion_tokens":400,"prompt_tokens_details":null}}%
Inference Result:
INFO 06–16 23:17:08 [async_llm.py:228] Added request cmpl-00fc0a80c2ca4d8597baec749b95d706–0.
INFO 06–16 23:17:13 [loggers.py:87] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
INFO 06–16 23:17:23 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
INFO 06–16 23:17:33 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.5%, Prefix cache hit rate: 0.0%
INFO 06–16 23:17:43 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.4%, Prefix cache hit rate: 0.0%
INFO 06–16 23:17:53 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 8.3%, Prefix cache hit rate: 0.0%
INFO: 10.128.0.2:53254 - "POST /v1/completions HTTP/1.1" 200 OKAs is, it is not too bad (tbh — model outputs are in gray zone though due to we ended up using quantized version of the selected model) for a MCP Backend to be used for simple Telco Diagnostic/Planning Agent.
Key Findings and Best Practices
Memory Calculations Matter
Understanding memory requirements upfront saves debugging time:
Model Memory = Parameters × Precision × Overhead
Qwen3–32B = 32B × 2 bytes (bf16) × 1.1 ≈ 70GBThe 1.1x overhead accounts for activation memory and other runtime requirements.
GPU P2P Warning — Not a Showstopper
During deployment, you might see:
WARNING: Custom allreduce is disabled because your platform lacks GPU P2P capability
This warning about peer-to-peer communication between GPUs only impacts performance by 5–10%. The GPUs still work together through system memory. In our case, with dual RTX 4090s, this was likely due to the lack of NVLink bridges (However this would be nice to have for faster context processing for larger problems tackling.)
OpenShift-Specific Considerations
Deploying on OpenShift required special attention to:
- Environment variables vs arguments: CUDA_VISIBLE_DEVICES must be set as an environment variable, not a vLLM argument
- Device selection syntax: The — device flag only accepts device types (cuda, cpu), not specific indices
- Route timeouts: Default timeouts are too short for slow inference.
$ oc annotate route qwen3–32b-vllm-latest — overwrite haproxy.router.openshift.io/timeout=600s4. Resource allocation: Ensure your deployment requests appropriate GPU resources
Optimization Decision Tree
When facing similar challenges, follow this hierarchy:
- Consider quantized versions first — Often offer the best balance of quality and efficiency, however, this may lower your outcome relevance/accuracy.
- Avoid CPU offloading if possible — The performance penalty is severe for token/s.
- Reduce context length before reducing GPU utilization, that may break your RCA/troubleshooting sort of work for remediation path(s).
- Limiting concurrent requests to fit KV cache in remaining memory, may break the ability of multi-agents to reach the same backend.
- Verify hardware compatibility before assuming newer is better!
Implications for Telecom AIOps
The journey taught us valuable lessons about deploying large models for production telecom operations. While CPU offloading enabled deployment on constrained hardware, the performance wasn’t suitable for real-time diagnostics. In telecom operations, where network issues require immediate attention, inference speed directly impacts mean time to repair (MTTR).
The MCP Server architecture proved flexible enough to work with various deployment configurations. However, for production use, proper hardware sizing is crucial. While the RTX PRO 6000’s 96GB VRAM would provide the ideal headroom for real-time diagnostics, the lack of software support makes it currently unusable. This highlights the importance of balancing cutting-edge hardware with software ecosystem maturity.
For distributed autonomous network architectures, where agents run at different tiers with varying resource constraints, quantized models offer the best path forward:
- Core/Central: Full precision or FP8 models on high-end GPUs
- Regional: FP8 or INT4 models on mid-range GPUs
- Edge: INT4 or smaller models on constrained hardware.
Conclusion
Deploying a “usefull” GenAI model with vLLM runtime was a journey of discovery, revealing both the possibilities and limitations of running large models on constrained hardware. While it’s technically possible to run a 32B parameter model on 48GB of VRAM using CPU offloading, the performance penalty makes it impractical for production use, especially for latency-sensitive applications like network diagnostics.
The experience reinforced several key principles:
- Understand your model’s memory requirements before deployment.
- CPU offloading is a last resort, not a performance optimization.
- Sometimes the best optimization is better hardware… if that’s actually supported.
- Quantized models often provide the optimal balance for production deployments.
- MCP Server architectures benefit from models with strong reasoning capabilities.
For teams building similar systems, consider your performance requirements carefully. If real-time inference is critical, ensure your hardware can accommodate the entire model in GPU memory, or better yet, use quantized versions that fit comfortably within your constraints. For batch processing or scenarios where latency is less critical, CPU offloading might be acceptable.
The autonomous network vision requires not just intelligent models but also infrastructure capable of running them efficiently. As we continue developing the Telco-AIX experiments, proper hardware sizing will remain a critical consideration for achieving truly AI driven network operations with the distributed systems across the network fabric (where at some tiers/layers resources are more constrained and expensive than others).

