Data Science Collective - Medium

Survivorship Bias For Humans: How to Spot the Data That Isn’t There

Kalle Georgiev — Tue, 26 May 2026 12:17:05 GMT

How the data you can’t see affects everything you can

Continue reading on Data Science Collective »

Evaluation Sets Have a Half-Life. Most Teams Pretend They Don’t.

Zenefa Rahaman, PhD — Mon, 25 May 2026 11:22:46 GMT

Why the benchmark you trusted six months ago measures yesterday’s problem — and what eval maintenance actually looks like.

A familiar pattern shows up across production AI systems. A team builds an evaluation set during launch, validates the system against it, ships, and watches the dashboard hold steady for months. Then a production incident appears that the benchmark never caught.

The benchmark didn’t fail. It was measuring an earlier version of reality.

Evaluation sets are not abstract measurements. They are operational artifacts. They are artifacts — and artifacts age. They capture what failure looked like the day they were authored. Production failure modes don’t stand still.

What makes evaluation decay dangerous is that it’s silent. Dashboards remain green while production behavior drifts underneath them. The benchmark remains valid under the conditions it was authored under — and increasingly irrelevant to current ones.

This piece is about why evaluation sets decay, what they decay into, and what evaluation maintenance actually requires in production.

Figure 1 — What the benchmark hides: pass rates stay flat while production failure modes grow underneath them. (Image created by author)

KEY TAKEAWAYS

Evaluation sets decay structurally — not because they were authored poorly, but because the environment they were authored against changes continuously
Four mechanisms drive most evaluation decay: data drift, user adaptation, distribution shift in the world, and emergent failure modes
The most dangerous evaluation set is not the inaccurate one. It is the one that keeps passing while quietly losing coverage of current production failures
A useful diagnostic: when did your team last add a test case for a failure mode that did not exist when the eval was originally authored?
A passing benchmark is not evidence that the system is working. It is evidence that the system is working against yesterday’s questions.

WHY EVALUATION SETS FEEL PERMANENT WHEN THEY AREN’T

Teams treat evaluation sets as fixed assets because evaluation creation usually happens as a discrete engineering project. During a launch sprint, someone authors a benchmark, samples representative examples, defines labels, validates scoring logic, and operationalizes reporting. The eval has a clear creation moment, a clear owner, and a visible engineering cost.

Once deployed, it quietly transitions into an assumed-correct state.

Over time, the benchmark stops being treated as one sample of reality and starts being treated as reality itself. Passing tests raises organizational confidence. Stable metrics become proxies for system reliability. Green dashboards acquire institutional authority.

But the deeper issue is structural: evaluation decay produces none of the signals teams are wired to act on. Latency spikes trigger pages. Services emit error codes. Pipelines fail loud assertions. Evaluation decay does none of these things. There is no automatic alert telling you that the benchmark itself is becoming outdated.

It simply keeps passing while production moves away from it.

The asymmetry is the real problem. Teams receive continuous reinforcement that the system is healthy precisely because the benchmark has stopped exercising the system against current production conditions.

The most dangerous evaluation set is the one that has been passing for six months.

THE FOUR MECHANISMS OF EVALUATION DECAY

1. Data drift

The most familiar decay mechanism is distribution drift in production inputs. The queries users send today are usually not the queries they sent six months ago. New customer segments arrive. Product launches create new interaction patterns. Business priorities shift. Workflows emerge organically through usage.

The evaluation set, meanwhile, samples a historical moment.

The gap between what the benchmark measures and what production actually experiences widens gradually enough that aggregate metrics conceal it.

Offline accuracy can hold steady while failure concentration grows in regions of the input space that the benchmark no longer represents.

This becomes acute in any system exposed to large-scale user interaction. Recommendation systems, retrieval systems, conversational interfaces, and agentic workflows all run on continuously evolving distributions. Static evaluation sets assume stability where none exists.

The subtle version is just as damaging: old inputs become overrepresented relative to current usage. Benchmarks accumulate historical assumptions about what “normal” usage looks like long after production has moved on. The longer an eval remains static, the more heavily it is weighted toward historical traffic relative to current production behavior

If you don’t refresh the eval against current production data, you are measuring a system against a customer base that no longer exists.

2. User adaptation

One of the most underappreciated decay mechanisms is behavioral adaptation by users themselves.

Users learn systems. They discover which prompts work, which workflows fail, which phrasing produces better outputs, and which interaction patterns get ignored. Over time, their behavior changes in response to what the system rewards and punishes. That adaptation fundamentally alters the population generating production inputs.

A conversational AI assistant may initially receive broad, exploratory prompts from inexperienced users. Six months later, experienced users may employ compressed, domain-specific shorthand that simply didn’t exist during launch evaluation. Recommendation system users may become increasingly strategic in how they interact with ranking systems, shifting click behavior, search behavior, and engagement patterns.

Internally, dashboards may appear unchanged. Aggregate metrics can remain stable while the underlying generating process evolves substantially. From inside the system, operational behavior looks continuous. From the outside, the population producing that behavior may have changed dramatically.

Most evaluation strategies implicitly assume stationary users. In practice, production users are dynamic participants who co-evolve with the system.

If you don’t account for user adaptation, your evaluation set is measuring how a system behaves against users who no longer exist.

3. Distribution shift in the world

Sometimes the production environment changes even when neither the users nor the model do. The world itself shifts.

Product catalogs evolve. Policies change. Terminology updates. Regulatory frameworks move. Knowledge bases are revised. Organizational structures change. Entire categories of information become obsolete or get redefined.

Evaluation cases authored against an earlier state of the world become factually incorrect over time. A retrieval pipeline evaluated against last year’s documentation fails because the underlying source documents changed. A customer support classifier routes correctly against historical policy definitions while misrouting against updated procedures. A recommendation system optimizes against outdated inventory assumptions.

Teams often discover the problem only after a customer reports an “obviously incorrect” answer that turns out to be correct relative to the older knowledge base the eval was authored against. The benchmark didn’t fail. It became stale.

This creates a versioning problem most teams handle poorly. Evaluation sets are usually versioned as datasets, but not versioned against the state of the world they assume. As a result, teams lose the ability to determine whether benchmark degradation reflects model failure or environmental change. Operationally, that distinction matters: one requires retraining or architectural change. The other requires evaluation and maintenance.

If you don’t version the eval against the state of the world it was authored under, you are measuring a system against a knowledge base that no longer reflects reality.

4. Emergent failure modes

The final decay mechanism is the most operationally dangerous: systems begin failing in entirely new ways.

The benchmark was authored when failure mode X existed. Since then, the model was upgraded, prompts were modified, routing logic changed, tools were added, orchestration policies evolved, or the architecture became more agentic and multi-step. Failure mode Y now exists instead. But Y is absent from the evaluation set.

This is particularly important in agentic systems, where architectural complexity continuously introduces new coordination failures, reasoning inconsistencies, state-management issues, and tool interaction errors.

Every modification to orchestration logic changes the space of possible failures — and benchmarks continue testing for historical failure patterns while production systems generate entirely new categories of behavior.

The result is a dangerous illusion of robustness. Regression tests pass because the benchmark still covers yesterday’s bugs extremely well. Meanwhile, new categories of failure emerge completely outside its observational boundary.

These compounds. Organizations tend to preserve old test cases indefinitely while adding a few new ones.

Over time, the benchmark transforms into an archive of historical incidents rather than an active defense against current production risk. In agentic architectures, evaluation decay accelerates because every new tool, routing layer, memory policy, or orchestration strategy expands the failure surface.

A benchmark that never gets new cases for new failure modes is not a defense against tomorrow’s bugs. It is an archive of yesterday’s.

Figure 2 — Four mechanisms quietly erode the alignment between your eval set and production reality (Image created by author)

A CONCRETE PRODUCTION SCENARIO

Consider a customer-facing retrieval and classification system launched with a carefully constructed 500-case evaluation set. At launch, the benchmark looked comprehensive — known failure categories, representative customer queries, edge cases, escalation scenarios. Nothing in the dashboard indicated the benchmark itself had lost coverage.

Six months later, the benchmark still reports a stable 94% pass rate. Operational dashboards, therefore, suggest the system is healthy.

Meanwhile, customer support tickets have climbed steadily. Internal reviewers report degraded retrieval quality. Escalations involving incorrect routing are rising.

The numbers below are illustrative — designed to make the decay mechanisms concrete, not to benchmark any specific system. The shape of the audit, not the specific figures, is the point.

An audit against current production reveals three issues.

First, roughly 18% of the original evaluation cases are now stale. Some reference products that no longer exist. Other test workflows were deprecated during later platform releases. Several cases assume policy definitions superseded months earlier.

Second, approximately 12% of recent production failures involve user query patterns absent from the benchmark entirely. Users learned how to interact with the system in more compressed and domain-specific ways than the launch eval anticipated.

Third, the organization upgraded the underlying model three months earlier and added no evaluation cases targeting the new architecture’s failure characteristics. The benchmark is therefore measuring regression relative to historical behaviors rather than coverage of newly introduced risks.

The evaluation set is not broken. It is faithfully measuring the system that existed six months ago.

That distinction matters because many organizations interpret stable evaluation metrics as evidence that the production environment itself is stable. In reality, the benchmark may simply have lost enough overlap with current conditions that it can no longer detect modern failures.

The green dashboard isn’t evidence of reliability. It’s evidence of historical alignment between the benchmark and an earlier production state.

QUICK DIAGNOSTIC

If your team:

hasn’t added new test cases to the evaluation set in the last month
can’t estimate how much of the eval still reflects current production
tracks the evaluation pass rate but not the evaluation coverage of recent production failures

The evaluation set is almost certainly decaying — and the green dashboard is probably hiding it.

WHAT EVALUATION MAINTENANCE ACTUALLY LOOKS LIKE

The operational implication is straightforward: evaluation sets are production assets and must be maintained like production assets — with schedules, owners, versioning, and explicit workflows.

Continuous sampling refresh

At regular intervals — quarterly at minimum in most production systems — sample new cases directly from current production traffic. Incorporate them into the benchmark after review and labeling. Without a systematic refresh, the benchmark progressively diverges from actual usage.

Coverage auditing

Most teams know whether their eval is passing. Very few know whether their eval would have caught last month’s production failures. That is the only question that actually measures whether the eval is doing its job.

Tracking accuracy without tracking coverage is the central failure mode here. A passing eval that doesn’t cover current failures is producing organizational confidence that the system has not earned. Coverage, not accuracy, determines whether the evaluation infrastructure remains useful.

Sunsetting obsolete cases

Teams continuously add tests but rarely remove outdated ones. Over time, stale cases distort benchmark metrics toward historical conditions that no longer matter. Evaluation sets need periodic pruning, just like any other production dataset.

Figure 3 — A quarterly cycle that keeps the benchmark aligned with the system it claims to evaluate (Image created by author)

These processes imply a broader shift in mindset. Evaluation should not be treated as a static certification artifact completed during launch. It should be treated as a continuously evolving observability layer for production behavior — especially in systems involving user interaction, evolving knowledge environments, or rapidly changing architectures.

This is the same conversation as the cost piece, asked at the evaluation level. Evaluation overhead was the fifth category in that cost taxonomy. The work described here is what that category actually buys you.

An evaluation set that doesn’t change is an evaluation set that has stopped working.

WHEN STATIC EVALUATIONS ARE ACTUALLY FINE

Static evaluations measure models. Living evaluations measure systems operating in production. Most organizations need both — and currently maintain only one.

Not every benchmark requires continuous refresh. Some evaluation tasks are genuinely stable and benefit from static datasets. Mathematical reasoning benchmarks, code correctness suites, and fixed factual datasets with durable answers can function effectively as long-lived regression tests. In those contexts, the goal is to measure model capabilities rather than production adaptation.

Some evaluation suites are explicitly designed to detect regressions under controlled conditions. Their value derives precisely from being stable over time.

The distinction isn’t whether static evaluation is inherently flawed. The distinction is what the evaluation is meant to measure. Static evaluations work well when the task is fixed, the underlying concepts are stable, and the objective is controlled capability comparison.

Living evaluations are necessary when users interact continuously with the system, the environment changes, or the architecture itself evolves.

Figure 4 — Two kinds of evaluation, both necessary. Most organizations need both. Most currently maintain only one. (Image created by author)

FINAL THOUGHT

Evaluation sets are usually discussed as though they are objective measurements. In practice, they are historical artifacts created under specific assumptions about users, environments, architectures, and failure modes. Those assumptions age.

The organizations that recognize this early treat evaluation maintenance as continuous engineering work rather than a one-time setup. They refresh cases, audit coverage, version assumptions, and retire obsolete tests. Most importantly, they understand that stable benchmark metrics do not imply stable production systems.

The organizations that fail to recognize this usually find out through production incidents.

A passing benchmark tells you the system has not regressed against yesterday’s questions. It tells you almost nothing about whether the system is working today.

The benchmark stayed green because the benchmark stopped looking at where the failures moved

Evaluation Sets Have a Half-Life. Most Teams Pretend They Don’t. was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Memory Wall Is Strangling Your LLM: Why GPUs Are Faster Than You Think and Slower Than You Need

Feroz Khan — Mon, 25 May 2026 11:22:27 GMT

The Memory Wall Is Strangling Your LLM: Why GPUs Are Faster Than We Think and Slower Than We Need

There is a number that should bother anyone who has spent time thinking seriously about LLM inference: 62,000 tokens per second.

That’s the theoretical throughput ceiling for an 8B-parameter model running on a single NVIDIA H100 GPU. You can derive it purely from the chip’s peak compute capacity of one quadrillion floating-point operations per second (1 petaFLOP/s). It’s the number you would include in a slide deck if you wanted to sound optimistic about AI infrastructure.

The actual number, across virtually every production inference engine in use today, sits somewhere between 100 and 300 tokens per second.

That’s a 200x gap between theory and reality. And it’s not a software bug, a framework inefficiency, or a failure of engineering ambition. It’s a structural property of how modern hardware is built and closing that gap is one of the more interesting systems problems in AI right now.

The Compute Illusion

Modern GPUs are genuinely extraordinary compute engines. The H100’s tensor cores can perform matrix multiplications at a rate that would have seemed impossible a decade ago. If inference throughput were purely a function of arithmetic throughput, we would be living in a very different world, one where latency was effectively free, and model serving was a solved problem.

But inference throughput is not a function of arithmetic throughput. It’s a function of memory bandwidth. Specifically, the rate at which a GPU can transfer data between its high-bandwidth memory (HBM) and its on-chip compute units. And that number, while impressive in absolute terms (3.35 TB/s on the H100), is nowhere near sufficient to keep those tensor cores fed.

To understand why, you need to think carefully about what actually happens during autoregressive decoding.

What Decoding Actually Costs

An LLM with 8 billion parameters, stored in 16-bit precision, occupies roughly 16 GB of memory. During inference, generating each new token requires a full forward pass through the model. That means every single set of weights (all 16 GB of them) must travel from HBM to the on-chip SRAM and into the processor registers, get used for a matrix multiplication, and then be discarded to make room for the next layer’s weights.

This isn’t a one-time cost. It happens for every token generated.

If you want to generate a 1,000-token response, you need 1,000 complete weight transfers. With 3.35 TB/s of bandwidth and 16 GB of weights per transfer, the math is almost embarrassingly direct: you can afford roughly 200 transfers per second, giving you approximately 200 tokens per second (3350 GB/s divided by 16 GB, assuming compute time is negligible). The compute units, capable of 1,000 TFLOP/s, are sitting idle for most of this time, waiting for data to arrive.

This is the memory wall, which is not a new concept (Williams et al. described the roofline model formally in 2009 [1]), but newly relevant in a way that’s defining the economics of AI infrastructure.

Memory Wall showing gap between GPU computation speeds and IO speeds. Made by author using Excalidraw

The Memory Hierarchy Problem

To appreciate why this is hard to fix, it helps to understand the structure of GPU memory.

At the innermost level are registers, which are tiny, blindingly fast memory directly accessible to execution units. These hold the immediate operands of whatever operation is running. Above that is SRAM (static RAM), the on-chip cache. The H100 has about 50 MB of L2 cache. It’s fast, low-latency, and manufactured directly on the chip, which is why it is expensive and scarce.

Then there’s HBM (high-bandwidth memory). This is what people mean when they say “GPU memory.” The H100 ships with 80 GB of HBM3, using a stacked die architecture that trades some latency for dramatically higher capacity. HBM is where model weights live.

The problem is that 50 MB of SRAM cannot hold a model with billions of parameters. So weights must be continuously streamed from HBM into the cache and registers on demand, layer by layer, as computation proceeds. The bandwidth of this HBM-to-SRAM path is the bottleneck.

Compute hardware has improved at roughly Moore’s Law pace over the past few decades. Memory bandwidth has improved much more slowly. This divergence, which is sometimes called the memory bandwidth gap , is what makes LLM inference structurally difficult.

Data flow between a GPU, On-chip SRAM, and High Bandwidth Memory. Made by author using Excalidraw

The Roofline Model: A Framework for Thinking About Bottlenecks

The roofline model [1] gives us a clean way to reason about where inference algorithms fall in the compute-vs-memory spectrum.

The key metric is arithmetic intensity. It’s the ratio of floating-point operations to bytes of memory transferred, measured in FLOPs per byte. If you perform a lot of computation per byte read from memory, your arithmetic intensity is high; if you read a lot of bytes but do little with each one, it’s low.

On a roofline plot, the x-axis represents arithmetic intensity and the y-axis represents achieved performance (FLOPs/s). The relationship is linear up to a ridge point: performance scales with arithmetic intensity times memory bandwidth. Beyond the ridge point, the limiting factor is no longer memory bandwidth but raw compute throughput. You have crossed into the compute-bound region.

Example roofline plot showing encoder and decoder stages. Made on by author on their iPad

Autoregressive decoding sits far to the left of the ridge point. Each weight byte is read from HBM and used for a small number of multiplications before being evicted. The arithmetic intensity is low, which means the memory bandwidth slope dominates, and the processor is underutilized.

What’s interesting is that not all LLM inference stages are equal here. The prefill (encoder) stage, where the model processes the input prompt to generate attention-based dense embeddings, is actually compute-bound. Because all prompt tokens are available simultaneously, the Transformer can process them in parallel through a single forward pass. For a prompt of N tokens, you get N token-equivalents of computation from a single model stream. Arithmetic intensity scales linearly with sequence length. For prompts of hundreds or thousands of tokens, prefill easily crosses the ridge point.

Decode has no such luxury. Each new token requires its own forward pass, so the ratio of output tokens to model streams is always 1:1. Arithmetic intensity stays constant regardless of response length, and it stays low. Decode is almost always memory-bound.

This asymmetry between encoder and decoder is a structural property of autoregressive generation, not an engineering oversight.

KV Caching: Necessary but Costly

One optimization that every serious inference framework implements is the KV cache. In Transformer attention, computing the attention scores for a new token requires access to the keys and values produced by every previous token. Rather than recomputing these from scratch each step (which would require re-running all prior tokens through the model), inference engines cache them in HBM.

The KV cache is essential for making autoregressive inference tractable. Without it, the computational cost would scale quadratically, O(N²), with sequence length N. With it, each decoding step is relatively cheap in FLOPs, but the cache itself must be stored and streamed alongside the model weights.

The problem is that KV cache memory scales with inference batch size (processing multiple queries at inference) times sequence length times model depth (number of parameters). As batches grow larger and sequences get longer, the KV cache consumes an increasing share of HBM. This limits how much you can amortize the weight-streaming cost across multiple queries, which brings us to the tension at the heart of batched inference.

LLM batch inference pipeline and memory architecture. Made on Excalidraw by author.

Batching is the most obvious lever for improving arithmetic intensity during decode. If you process 64 queries simultaneously, a single model stream extends 64 responses. This means the numerator of your arithmetic intensity ratio grows while the denominator stays fixed. But as you increase batch size, the KV cache grows proportionally, eating into the HBM capacity that you would otherwise use for larger batches. Eventually, KV cache pressure becomes the binding constraint.

vLLM addressed part of this with PagedAttention [2], where it borrowed ideas from operating system virtual memory to manage KV cache blocks non-contiguously. This reduces fragmentation and allows HBM to be used more efficiently, effectively enabling larger effective batch sizes. TensorRT-LLM provides fused kernels and multi-head attention optimizations that reduce per-token overhead.

These are meaningful wins. But they are all operating within the same fundamental constraint: a memory-bound regime where the limiting factor is how fast you can push bytes from HBM to compute.

Speculative Decoding: Buying Compute-Bound Behavior Through Architecture

Speculative decoding [4] is a cleverer approach. The core insight is that if you can predict the next several tokens cheaply, you can verify them all at once with the large model, amortizing one expensive forward pass across multiple tokens.

In practice, this means running a small draft model (sometimes as small as a few hundred million parameters) for several steps to generate candidate token sequences. The large verifier model then processes the entire drafted sequence in a single forward pass (similar to an encoder where multiple tokens are handled simultaneously), and either accepts or rejects each draft token using a carefully designed acceptance criterion that preserves the target distribution [4].

When the draft model is well-chosen and achieves high acceptance rates, speculative decoding effectively reduces the number of large model forward passes per output token, pushing the system toward higher arithmetic intensity. The verifier starts behaving more like a compute-bound workload.

Smaller ‘Draft LLM’ generates multiple potential tokens that are then validated by a larger ‘Verifier LLM’ in a single forward pass.

The challenge is calibrating the draft model. Too large, and it loses its speed advantage over the verifier. Too small, and acceptance rates collapse and you end up doing more total work than naive decoding. The acceptance rate is also sensitive to temperature, prompt distribution, and draft length. In practice, speculative decoding requires tuning and does not deliver consistent speedups across all task types.

There are variations worth noting: self-speculative decoding [5] uses early exit layers of the same model as the draft mechanism, avoiding the need for a separate model entirely. Lookahead decoding [6] uses n-gram speculation derived from the input. These methods trade generality for deployment simplicity.

Diffusion LLMs: A Structural Escape from Memory Bounds

Speculative decoding is fundamentally a patch on an autoregressive paradigm. The real question is whether the paradigm itself can be changed.

Diffusion language models [7][8] represent the most structurally distinct alternative currently in serious development. Rather than generating tokens left-to-right, one at a time, diffusion models operate on the entire output sequence simultaneously. They start with a fully masked or noisy sequence and iteratively refine it across multiple denoising steps until a coherent response emerges.

From an arithmetic intensity perspective, this is a meaningful shift. During each denoising iteration, the model performs a forward pass that updates every token in the context window, not just one. With a context window of length L, a single model stream contributes L token-update operations instead of one. Arithmetic intensity scales with context length, which is the opposite of autoregressive decoding.

Contrasting generation processes of Auto-Regressive LLMs and Diffusion LLMs. Made by author on Excalidraw.

For typical response lengths of several hundred to a few thousand tokens, this pushes diffusion models well into compute-bound territory on the roofline. The compute units are no longer waiting for weights to arrive; they are processing operations as fast as the HBM can feed them and often the HBM isn’t the bottleneck at all.

This is the theoretical appeal of diffusion for inference. But theory and practice diverge here in important ways.

The Problems with Naive Diffusion

Early diffusion LLMs like LLaDA [8] were often slower than their autoregressive counterparts in wall-clock time. Being compute-bound sounds good until you remember that being compute-bound with wasted computations is not the same as being efficient.

The central waste in vanilla diffusion is that most refinement steps don’t actually update most tokens meaningfully. Empirically, at any given step, the model is highly confident about perhaps 10% of positions. The rest are ambiguous because multiple valid tokens could occupy those slots. Yet the model dutifully computes updates for all positions, burning FLOPs on outputs it’s uncertain about and will likely revise in subsequent steps.

A separate problem is context window sizing. Standard diffusion allocates a fixed context window, often several thousand tokens, for every query, regardless of whether the response will be three tokens or three thousand. Generating a yes/no answer through a 2048-token diffusion window is wildly inefficient.

Adaptive context window techniques address this by starting small (64 tokens) and extending the window dynamically based on the probability mass assigned to the end-of-sequence token at each position. When the model starts placing high probability on EOS across the current window, it locks in the response length and continues refining. This prevents the system from over-allocating computation for short responses.

Block Diffusion: The Hybrid Architecture

The most promising current direction is block diffusion [9]. It’s a hybrid that combines the throughput advantages of diffusion with autoregressive models.

The idea is straightforward: partition the output into fixed-size blocks. Within each block, tokens are decoded using diffusion (all positions refined simultaneously). Across blocks, generation proceeds autoregressively, where each block conditions on all previous blocks.

Block Diffusion illustration. Made on Excalidraw by author.

This structure recovers two important properties that vanilla diffusion sacrifices. First, early stopping becomes possible: once any block generates an end-of-sequence token, generation terminates without diffusing subsequent blocks. For short responses, this prevents the fixed-iteration overhead that makes vanilla diffusion slow. Second, autoregressive block ordering means you can apply KV caching across blocks. The KV states from completed blocks are cached and reused exactly as in standard transformer inference.

Block diffusion also keeps the system in the compute-bound region within each block (because the diffusion over block-length sequences is still high arithmetic intensity), while recovering the operational efficiency of standard inference frameworks. Most optimizations developed for autoregressive inference like speculative decoding, paged attention, continuous batching, etc., can be grafted onto block diffusion without significant rearchitecting.

From a systems perspective, block diffusion feels like where the field is converging. It’s the first architecture that seriously addresses both the memory bandwidth problem (via high arithmetic intensity within blocks) and the wasted computation problem (via adaptive stopping and KV caching across blocks).

What This Means for Inference Economics

Today, inference costs are dominated by compute time, and compute time is dominated by the memory-bandwidth ceiling during decode. GPU utilization for inference is structurally low compared to training. This means you are paying for peak FLOP/s hardware but using a small fraction of it productively. The cost per token is higher than it needs to be, and latency is harder to reduce than the raw hardware specs suggest.

Architectures that shift workloads toward the compute-bound regime, whether through batching, speculative decoding, diffusion, or block diffusion, directly reduce cost per token by making better use of silicon that’s already paid for.

There’s also a latency angle. For interactive applications, time-to-first-token matters enormously. Prefill is fast because it’s compute-bound. Decode is slow because it’s memory-bound. Any architectural change that collapses the distinction between prefill and decode, either by batching tokens into chunks (speculative decoding) or processing all output tokens simultaneously (diffusion), has the potential to flatten the latency curve.

Looking further out, specialized inference hardware is already emerging in response to these dynamics. Chips designed with higher HBM-to-compute ratios, or with processing-in-memory architectures that reduce the HBM-to-chip transfer distance, could shift the ridge point substantially.

Closing Thoughts

The 200x gap between theoretical and realized token throughput is not going away through incremental improvements to existing infrastructure. It reflects a structural mismatch between the compute capabilities of modern accelerators and the memory access patterns demanded by autoregressive generation.

What’s interesting about the current moment is that solutions are emerging simultaneously at every level of the stack. At the systems level: paged attention, continuous batching, prefix caching, quantization [10]. At the algorithm level: speculative decoding, lookahead methods. At the architecture level: diffusion LLMs, block diffusion, mixture-of-experts with sparse activation.

None of these is sufficient on its own. The future of inference efficiency is probably a layered combination: block diffusion architectures served by inference engines that apply KV caching and speculative block proposals, running on hardware designed with memory-bandwidth as the primary optimization target.

The memory wall has been a known problem in high-performance computing for a long time before LLMs existed. What’s new is that language model inference has made it a first-order economic problem, one where solving it translates directly into cheaper, faster AI for everyone building on top of it.

That’s a different kind of pressure than academic interest. And it tends to produce results.

References

[1] Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), 65–76.

[2] Kwon, W., et al. (2023). Efficient memory management for large language model serving with PagedAttention. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.

[3] Zheng, L., et al. (2024). SGLang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems 37 (NeurIPS 2024).

[4] Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. International Conference on Machine Learning (ICML).

[5] Zhang, J., et al. (2024). Draft & verify: Lossless large language model acceleration via self-speculative decoding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024).

[6] Fu, Y., Bailis, P., Stoica, I., & Zhang, H. (2024). Break the sequential dependency of LLM inference using Lookahead decoding. arXiv preprint arXiv:2402.02057.

[7] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. (2021). Structured denoising diffusion models in discrete state-spaces. NeurIPS 2021.

[8] Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y., Wen, J.-R., & Li, C. (2025). Large language diffusion models (LLaDA). arXiv:2502.09992.

[10] Arriola, M., et al. (2025). Block diffusion: Interpolating between autoregressive and diffusion language models. ICLR 2025 (Oral). arXiv:2503.09573.

[11] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS 2022.

The Memory Wall Is Strangling Your LLM: Why GPUs Are Faster Than You Think and Slower Than You Need was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.

MCP vs Function Calling: Which Is Better for AI Agents

Alan Jones — Mon, 25 May 2026 11:22:09 GMT

Tools are an AI agent's window to the real world. Without them, the agent would be restricted to what it already knows.

Continue reading on Data Science Collective »

What Is Data Integration? A Complete Guide for 2026

Saurav Singh — Mon, 25 May 2026 11:21:53 GMT

Your data is everywhere. That’s the problem. Here’s how smart teams are finally solving it.

Continue reading on Data Science Collective »

OpenAI, Grafana, and Half of France Were Breached the Same Way This Month and Here Is Why

Han HELOIR YAN, Ph.D. ☕️ — Mon, 25 May 2026 11:21:18 GMT

The npm supply chain worm and your AI agent turned out to be the same attack

Continue reading on Data Science Collective »

A Qwen 3.5 122B LLM on a 16 GB Mac mini: MoE Expert Streaming with TurboQuant-MLX

Manjunath Janardhan — Mon, 25 May 2026 11:20:49 GMT

Per-expert disk streaming runs a 122-billion-parameter Mixture-of-Experts model — 3× bigger than RAM — on the cheapest Mac Apple sells…

Continue reading on Data Science Collective »

Securing Fabric Data Agents Where It Matters: Row-Level Security Behind Natural Language Answers

Luca Zavarella — Mon, 25 May 2026 11:18:49 GMT

Why Fabric Data Agent security shouldn’t live in the prompt, and how SQL Row-Level Security works across users

Continue reading on Data Science Collective »

Let them compete! kWTA Ensemble Neural Network

Abien Fred Agarap — Mon, 25 May 2026 11:18:37 GMT

An accompanying article for the paper “k-Winners-Take-All Ensemble Neural Network” by A.F. Agarap and A.P. Azcarraga presented at the 2021…

Continue reading on Data Science Collective »

Autonomous AI Agents Are Redefining Data Workflows: The Rise of Self-Optimizing Analytic Pipelines

Aasir Waseer — Mon, 25 May 2026 11:18:30 GMT

In May 2026, the biggest shift in data work isn’t about choosing between tools. It’s about replacing the question “what should I do with…

Continue reading on Data Science Collective »