Persimmons AI - Medium

Why memory is the future of AI infrastructure

Joel Eriksson Enquist — Mon, 28 Apr 2025 17:37:49 GMT

At Persimmons, we’re very excited about what our friends at Meta have been cooking. Meta’s LLaMA 4 family represents a leap forward in the evolution of foundation Open Source models in several ways — with larger architectures, multimodal capabilities, and ultra-long context windows. This seem to be the direction we are heading, even if this particular family of models did not fully meet expectations. The lineup includes:

LLaMA 4 Scout: 10 million token context, 109B total parameters
LLaMA 4 Maverick: 1 million token context, 400B total parameters
LLaMA 4 Behemoth: 2 trillion parameters — the largest open-source model released to date

These models demand not just more compute, but a new class of infrastructure, where memory and sustained bandwidth become the critical enablers of performance. This shift is something the founders of Persimmon anticipated years ago, drawing on their deep experience working with AI systems. Now, that future is here.

And while we’ll come back to what this means for infrastructure, let’s first unpack why these models, and the hardware they require, matter.

Why large models like Behemoth matters

“When you take a model and you dumb it down, that is, you make it smaller — believe it or not, it loses resilience. It becomes more fragile. So, strangely, if you are focused on quality, you should try to get the largest model possible. But the largest models are too expensive to serve [computationally].”

— Eric Schmidt, Mar 7, 2025 https://www.youtube.com/watch?v=KX9sVRB_hG0

As we enter the agentic era of AI, where systems act autonomously, collaborate across tasks, and operate over long horizons, we need high-quality models that can reason, retain, and react in real time. Models like LLaMA 4 Behemoth are not just bigger, they’re qualitatively more capable.

Trillion-scale models like Behemoth offer deeper reasoning, greater resilience, and the ability to process rich multimodal inputs. They’re also the foundation for distilled models that power many of today’s consumer-facing tools. But why aren’t we using the highest quality models across the board?

Here’s the issue: these models are incredibly expensive to serve using current infrastructure. That’s why we distill them, i.e. compress a large “teacher” model into a leaner “student” that runs faster and cheaper. The problem is, distillation trades off quality for efficiency, and for many use cases, that trade-off is too steep.

Enterprises and developers want access to the full performance of foundation models. But unless we solve the infrastructure bottleneck, they’ll remain out of reach.

This vision is precisely what Persimmons’ founders anticipated and strategically pursued. While most companies concentrated on training and smaller-scale models, the Persimmons team proactively built towards ever-larger models and optimized specifically for inference. Persimmons was founded, and remains driven by, a clear mission: making trillion-parameter models like Behemoth both affordable and practical for everyday workloads. In doing so, Persimmons accelerates AI’s evolution from R&D into widespread commercial adoption, ensuring near-flawless user experiences and enabling organizations to achieve positive unit economics.

The cost of 90% accuracy

Across industries, we hear the same thing:

Builders have working models and AI products, but can’t launch them cost-effectively
Users are impressed by AI tools, but say they’re only “90% there” — which makes them more frustrating than helpful

And that last 10% matters. If you can’t trust the model, you end up doing the work yourself anyway. For AI to go from prototype to product, we need models that don’t just perform well in benchmarks, they need to earn user trust in production.

That means serving large [high-quality] models, with low latency, in a cost-efficient way.

The infrastructure we need now

The challenge is infrastructure. Today’s systems, primarily general-purpose GPUs, weren’t built to support trillion-parameter models or 10M-token context windows. Even with 180GB B200 based datacenters will require half a rack just to serve one Llama 4 Behemoth.

The reason why this is challenging is that inference has two very different phases that stress the hardware in opposite ways:

Prefill: the model ingests the entire prompt/context in parallel with the primary bottlenecks being FLOPs (raw compute)
Decode: the model produces one token at a time, re-reading KV-cache for every layer and where the primary bottlenecks are memory bandwidth and latency

So what do we need?

Much more memory: to support both large models and long contexts
Sustained memory bandwidth and low latency: critical for high-quality user experience
Simplicity: avoid complex multi-GPU orchestration, which adds cost and fragility
Energy and cost efficiency: to ensure a sustainable TCO and scalable unit economics
High performance at both low and high batch sizes: to serve real-world traffic patterns
Smart scalable infrastructure: that makes it easy to deploy and scale the inference infrastructure as needed, and;
Native support: for multimodal models, and agentic workloads to increase efficiencies further

It’s no longer enough to optimize for FLOPs. The real constraint is memory, bandwidth, and deployment efficiency.

The gap between innovation and adoption

We’re also seeing a shift in how the market perceives these models.

Two years ago, the idea of trillion-parameter models was met with skepticism. Just six months ago, many assumed that only players like OpenAI would deploy them. Now, open-source models are pushing past the trillion mark.

What’s holding things back now isn’t the model quality. It’s whether we can serve them in production: affordably, at scale, with a responsive user experience.

Real-world deployment needs real-world, and future-proof, hardware

Some new hardware players are innovating, but often at the cost of practicality. Wafer-scale systems, proprietary fabrics, and exotic cooling setups may work in labs, but they’re hard to deploy and expensive to scale.

What the industry really needs is:

Standard rack compatibility
Air-cooled by default (liquid optional)
Future-proof modular design that scales incrementally as needs grows, or changes
Deployable in traditional cloud, on-prem, or industrial environments

We need hardware that works in the real world — not just in academic papers or demos. That’s the only way to unlock the full potential of models like LLaMA 4 Behemoth.

The future of AI = model + infrastructure + parallelization algorithms co-design

What’s becoming increasingly clear is that model, infrastructure architecture and parallelization algorithms co-design is the future of AI.

The LLaMA 4 family isn’t just compute-hungry, it’s memory-intensive, optimized for:

Sparse compute patterns (e.g., MoE)
Long context windows (1M–10M tokens)
Multimodal inputs (text, image, video)

These stress memory systems far more than compute.

We believe the next generation of intelligent systems, especially agentic workflows, will depend on infrastructures that are not just compute-dense, but memory-rich.

Models will continue to grow. Context windows will continue to expand. Closed models from OpenAI, Google, Anthropic and xAI continue to grow larger, and the trend is clearly moving toward supporting increasingly extensive context windows. For instance, Google’s Gemini Ultra is reported to have 1.56 trillion parameters, with plans to support a 2 million token context window. Meta already is talking about “infinite” context window support. The workloads of the future are already here, and we need to build infrastructure that can meet them head-on.

Closing thoughts

The most powerful models in the world now require more than just FLOPs. They demand:

High memory density
Sustained bandwidth
Low latency at scale
… and cost-efficiency, and modularity, as well as flexibility, to enable optimization for specific use-cases, and to be future proof in a rapidly, ever-changing, environment

This is what it takes to power the next generation of AI.

When we deliver this, we’ll unlock agentic, trustworthy, and high quality open-source AI at scale — and bringing the promise of AI into every product, workflow, and enterprise system in the world. This is what drives us at Persimmons, and why we are so excited about the work our friends at Meta are doing with the LLaMA family.

Why memory is the future of AI infrastructure was originally published in Persimmons AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Role of AI Compute in the next phase of AI

Joel Eriksson Enquist — Mon, 07 Apr 2025 22:52:14 GMT

As AI continues to revolutionize industries, it’s crucial to understand AI compute — the computational resources required for both training and inference in artificial intelligence systems. AI compute enables machines to process data, learn patterns, and make decisions in real-time. Both training and inference rely on these resources, and as AI becomes more complex, the need for efficient AI compute grows. While this is obvious for some of you today, I’ve noticed that these terms are far from clear for everyone. Here is my [non-AI generated] attempt to explain it in simple terms.

“10x a year, every year, for almost a decade” — Microsoft AI CEO Mustafa Suleyman (when talking about growth in AI compute for the past decade)

What is AI Compute?

AI compute refers to the processing power, memory, and storage required to train, fine-tune, and run AI models. This includes everything from adjusting the internal parameters during training to making real-time predictions during inference. AI compute is powered by hardware platforms like GPUs and TPUs, and can be delivered via the cloud, or on the edge (nowadays edge is referred to everything from true edge use cases to decent size on-premise data centers).

Pre-training: The process where AI models learn from large datasets, adjusting their internal parameters to recognize patterns and make decisions. This creates a foundation of knowledge.
Fine-tuning: Adjusting a pre-trained model on a specific task or dataset to improve its performance, using fewer resources than training from scratch.
Inference: The phase where a trained AI model makes predictions or decisions. Inference happens when the model is deployed in real-world applications, such as answering queries or interpreting sensor data, or when you are asking ChatGPT to generate an image for you.
Reasoning-time compute / inference-time compute / test-time compute: frequently used interchangeably today, however, inference-time compute (or test-time compute) refers more generally to the compute used during inference (see above), and reasoning-time compute specifically emphasizes the compute used specifically for potential thought processes (Chain of Thought, etc.) and logical deduction during the inference phase.

Why Does AI Compute Matter?

While we are seeing indications that pouring compute resources into pre-training is hitting the Law of Diminishing Returns, the industry consensus is still that model size is the number one driver of model performance (echoed by Mark Zuckerberg, Jensen Huang, etc.) and that the next generation of models are aiming for a 10x increase in model scale, requiring 300k GPUs. Personally, I think both are true, e.g., as long as compute resources and access to quality training data is in balance, i.e. as long as we can access and or generate more quality training data, then we will keep seeing a return in pre-training by adding more compute, but we will see even better returns short-term with less explored avenues, such as inference-time compute.

Inference-Time Compute: The Next Frontier in AI Compute

Training large AI models is expensive and time-consuming, often requiring millions of dollars and months of compute time. With data becoming scarce and energy costs rising, researchers are increasingly focused on inference-time compute (or test-time compute) — optimizing models during the inference phase when they are actively used to make predictions. This allows models to generate and evaluate multiple possibilities in real-time (essentially asking itself “is this the best possible answer I can come up with?” before sharing the answer), improving decision-making without the need for additional costly training.

OpenAI’s new o1 model exemplifies this approach, using multi-step reasoning to tackle complex tasks more efficiently. For instance, giving the model 20 seconds to think through a problem can boost its performance more than scaling up the model size by 100,000 times. This method, which builds on base models like GPT-4, is being adopted by other AI labs as well. For clarity, the “20 seconds” example is just to paint a picture of how it works; it is more about “thinking cycles”, rather than the actual time it takes to spit out an answer.

The focus shift from massive pre-training clusters to cloud-based inference clouds — distributed systems optimized for real-time processing — could significantly lower deployment costs and reshape the AI hardware landscape, moving away from reliance on Nvidia’s chips and toward more efficient, sustainable, and scalable solutions. This is one of the main reasons why I decided to join Persimmons.ai last year.

Sustainability, and the future of this blog series

AI compute, which powers both training and inference, is essential for modern AI systems, and future advancements. However, this growth also brings sustainability challenges as AI compute requires a substantial amount of energy. What I love with Persimmons.ai, is that we’re enabling AI to advance, and in a sustainable manner.

In future posts I will focus on interviewing thought leaders to explore these topics in greater depth and discuss the compute innovations shaping the future of AI, and inference in particular.

The Role of AI Compute in the next phase of AI was originally published in Persimmons AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Underrated Backbone of Generative AI: Why Hardware Innovation Is More Critical Than Ever

Valerie C — Fri, 07 Mar 2025 00:43:39 GMT

Generative AI has captured the world’s imagination. Every day, startups and tech giants race to push the boundaries of large language models, multimodal systems, and creative AI applications. Buzzwords like “Enterprise-grade AI,” “Inference time Compute”, “CoR/CoT,” or “Multimodal” fill pitch decks and headlines, but in many cases, they boil down to calling an API or tweaking an existing approach and repackaging it as innovation. While it’s true that game changing AI products can define entire markets, a persistent refrain has taken hold: “The real winners are in the application layer.”- WRONG

Skeptics claim that AI infrastructure investments are overhyped, overpriced, or simply out of runway. But this couldn’t be further from the truth.

That view, often echoed by leading large VC firms, ignores a foundational truth: the hardware infrastructure, especially for inference, underpins all the magic at the software layer.

Think about the sheer volume of data being generated every second: emails, messages, videos, enterprise workflows, research papers, economic transactions, healthcare diagnostics. A growing fraction of this data will be processed through AI models, extracting insights, automating tasks, and driving efficiencies.

And with the emergence of next gen models like Deepseek R1, featuring more parameters, greater context windows, and advanced domain specialization, the need for robust, scalable hardware will only intensify.

Inference as the Workhorse of Generative AI

Training tends to hog the spotlight in AI. But it’s inference, where models actually respond to real-world queries, that drives daily value. Every chatbot prompt, every image and video generation request, every enterprise AI workflow: they all depend on inference.

As new models from OpenAI push the envelope with hundreds of billions or even trillions of parameters, these already hefty computational demands balloon. Throwing more off-the-shelf general purpose GPUs at the problem simply isn’t sustainable. Energy usage goes up, data center footprints expand, and costs can become exorbitant. Specialized, next-generation hardware, purpose-built for massive-scale inference becomes critical to keep pace with user expectations and business needs.

Generative AI models are so compute intensive that they often become victims of their own success. The second an AI-powered feature gains traction, latency demands and costs skyrocket.

For consumer applications, a fraction of a second delay kills engagement and the UI experience.

For enterprise AI, compute inefficiency can make entire business models unviable.

Inference infrastructure is the ONLY way to fix this. Faster, cost-effective, and energy-efficient hardware is the key to:

✔️ Reducing cloud costs for AI across the board
✔️ Eliminating bottlenecks in enterprise AI workflows
✔️ Powering real-time AI experiences that don’t feel sluggish

Cost & Latency: The Twin Challenges

When it comes to real-time AI experiences, cost and latency go hand in hand:

Latency: A fraction of a second can be the difference between delightful interactions and “spinning wheel” frustration. Real-time (or near-real-time) responses are crucial for user adoption and satisfaction.

We’ve all seen it:

🔄 “Processing… please wait.”
🌀 The spinning wheel of doom.
⏳ “You’re in the queue.”- woohoo! ?

Cost: Large language models and multimodal systems are notoriously resource-intensive. Across hundreds of thousands or millions of daily inferences, even small inefficiencies add up to staggering operating costs.

Specialized AI accelerators, memory architectures, and chip designs directly address these challenges. They slash the cost per inference while trimming critical milliseconds off response times. As advanced models like Deepseek R1 pack in more parameters, the payoff from optimized hardware only grows bigger.

The Energy footprint of AI is becoming unsustainable.

AI success isn’t just about performance and user satisfaction; it’s also about sustainability. While training a huge model already consumes a vast amount of energy, at scale, the sum total of inference can be even greater over a model’s lifecycle.

That’s why hardware innovation is pivotal. From more efficient transistors and specialized AI processors to photonic chips, greener hardware is the next frontier. Organizations that take the lead here don’t just lower operating costs , they also burnish their reputations with customers, investors, and regulators, who increasingly demand that companies meet sustainability targets.

Hardware as a Competitive Differentiator

Throwing more General Purpose GPUs at the problem isn’t the answer.

Amid the frenetic rush to capture AI market share, companies that invest in hardware R&D are building an unrivaled competitive moat. They can deliver faster, cheaper, and more energy-efficient AI, enabling use cases previously considered impossible.

Big tech players like Meta, Microsoft, Google, and AWS already have substantial in-house hardware teams, precisely because they recognize that software breakthroughs alone won’t cut it. Owning or co-developing the hardware layer can create a critical performance edge, setting the stage for the next wave of AI applications, especially as models like Deepseek R1 raise the bar for compute demands.

The Application Layer Can’t Succeed Alone

The argument that the “real winners” sit exclusively at the application layer overlooks a critical reality: even the slickest AI feature is only as good as the infrastructure beneath it.

If the data center struggles under the weight of increased inferences, performance tanks.
If you can’t keep up with user demand cost-effectively, your business model crumbles, good luck with your ROI
If energy requirements balloon out of control, sustainability pledges become empty words.

Even the most compelling software application remains bound by the limits of the hardware. As advanced models become more commonplace, these limits will be tested, and those lacking robust infrastructure will feel the strain first.

A Holistic AI Stack, Now More Than Ever

Generative AI’s full potential will only be realized when hardware and software innovators work in lockstep.

We need:

Customized Accelerators: Purpose-built for large-scale AI workloads rather than just reusing general-purpose GPUs.
Optimized inference systems that deliver speed with lower power consumption
Advanced Memory Architectures: To mitigate data-transfer bottlenecks, which are often the silent latency killers.
Efficient Cooling & Power Management: Essential to keep data centers sustainable and operational costs in check.
Close Hardware-Software Integration: Deep compiler and framework optimizations that align model execution with hardware features.

As models such as Deepseek R1 become the new baseline, the call for hardware and software, co-design grows louder. A piecemeal approach is no longer enough.

Why Deepseek R1 Raises the Stakes

Deepseek R1 represents the possibilities of the next generation of generative models, more parameters, more intricate reasoning, and deeper multimodal capabilities. With that comes greater demand for memory, bandwidth, and raw computational power at inference time.

Bigger Parameter Counts: Means significantly higher memory requirements and more data to move per query.
Longer Context Windows: Users expect more context, which translates to more tokens per request and heavier compute loads.
Advanced Domain Specialization: This often involves additional sub-modules or fine-tuned knowledge bases that need to be accessed in real time, increasing the complexity of inference pipelines.

In other words, Deepseek R1 (and similar emerging models) is the poster child for why next-generation hardware, especially purpose-built inference infrastructure, must evolve. It’s no longer enough to rely on GPU brute force. We need specialized solutions that can handle the spike in complexity and deliver results instantaneously, all while managing power consumption responsibly.

AI at Scale Will Fail Without Sustainable Infrastructure

Generative AI has already dazzled the world with art, conversation, videos, and other creative feats, but we’re just scratching the surface. Behind each jaw dropping AI application stands a foundation of hardware innovation that too often goes unrecognized.

As cutting-edge models push computational boundaries, they underscore an urgent truth: the future of AI won’t be won by software alone. Infrastructure, particularly for large-scale inference, is becoming the decisive factor. Those who invest now in robust, efficient hardware tailored for AI’s intense demands will ultimately define the next era of generative AI, rewriting the playbook for what’s possible in performance, sustainability, and transformative potential.

Right now, training a large-scale AI model burns enough energy to power a small city. But the real issue? Inference can be even worse.

The real moat is hardware and compute capacity.

The Underrated Backbone of Generative AI: Why Hardware Innovation Is More Critical Than Ever was originally published in Persimmons AI on Medium, where people are continuing the conversation by highlighting and responding to this story.