Stories by Marcelo Emmerich on Medium

Why Decoupling Prompts From Your Codebase Is the Next Big Step for AI Applications

Marcelo Emmerich — Fri, 09 Jan 2026 14:29:54 GMT

Whether you’re building a simple chatbot or a multi-agent system that coordinates tools, prompts are no longer “magic text” you paste into your source files. They’ve become infrastructure. And like infrastructure (databases, APIs, config), they deserve their own lifecycle and tooling.

In this post I’ll explain why separating prompts from your codebase improves agility, collaboration, governance, and agentic behaviour, and how a tool like Promptman helps teams manage prompts as first-class artifacts.

Prompts Deserve a Lifecycle of Their Own

In traditional software engineering, we treat things like source code, configuration, and database schemas differently because they change at different rhythms and have different stakeholders. Prompts are more like content and config than compiled code: they evolve fast, often daily, and the person best suited to refining them isn’t always a developer.

Yet in most early AI stacks, prompts live buried inside source files — large blocks of text inside strings, hard to diff, impossible to work on without touching code. This creates several problems:

Slow iteration cycles: every prompt change requires a deploy.
Developer bottlenecks: non-technical contributors can’t easily improve prompt wording.
Version confusion: it’s hard to track which prompt version is running where.
Merge pain: teams editing prompts in parallel collide in Git.

Prompt management experts now advocate decoupling prompts from application code and treating them as versioned assets managed independently. This unlocks faster iteration without redeploying core application logic.

Faster Iteration and Safer Deployments

When prompts are stored inside your codebase, you inherit all the constraints of your application’s deployment process. Fix a typo? Change behaviour? You need to cut a release.

By storing prompts outside of your core code, i.e. in a managed registry or API, you can update them independently of your app. This reduces cycle time and isolates risk: prompt tuning becomes a lighter-weight practice rather than a full software release task.

Promptman, for example, provides a REST API where you can fetch prompt versions at runtime, separate from your code deployments. Your app just calls the Promptman API and receives the right prompt content based on environment (development, staging, production). That means your app logic stays stable while prompts evolve.

Collaboration Across Teams

Prompt engineering isn’t purely a developer discipline anymore. Product managers, UX writers, domain experts, even legal teams often need to tune how an AI behaves; tone, safety, instructions to follow or not follow.

When prompts live in the code, only developers with repository access can change them — and each change goes through the standard PR/deploy pipeline.

Decoupled prompts open the door to cross-functional collaboration. With a tool like Promptman:

Prompt editors can refine text without touching code.
Teams can review diffs and version histories.
Rollbacks are just another API version change.

This improves alignment and dramatically speeds up experimentation.

Continuous Experimentation & Governance

One of the least appreciated benefits of decoupling is observability and governance. When prompts are managed outside of code:

You can track which prompt version drove which load or outcome.
You can audit changes over time.
You can A/B test different prompt versions and measure performance.
You can enforce review and access policies.

This mirrors best practices in config management and infrastructure as code — except now applied to behavior instructions for large language models.

Promptman makes this tangible by managing prompts across stages (i.e. dev/staging/prod) and letting you plug different versions into your pipeline without changing your app code.

Essential for Agentic & Multi-Agent Workflows

The need to decouple becomes even more obvious in agentic applications, where multiple agents orchestrate tools or act autonomously over time.

In agentic systems you might have:

Complex state and context passed between agents
Multiple models using different prompt templates
Dynamic behaviours emerging from prompt changes

Here the prompt isn’t just a static instruction, it’s the logic that tells an agent how to behave. Hardcoding it means every behavioural tweak results in a full release cycle.

Separating prompt definitions lets you evolve agent behaviour without changing application code. Agents read their instructions from a managed prompt catalog, so your orchestration layer stays unchanged while behaviour evolves quickly.

This dramatically reduces friction when creating sophisticated multi-agent flows.

Treat Prompts Like Infrastructure

At the end of the day, decoupling prompts from your codebase is about recognizing the role prompts play in AI applications. They are not incidental strings — they are the rules and context for your AI’s decision-making.

Putting them in your source code is like hardcoding your database schema into your app logic. It might work for prototypes, but it won’t scale.

Tools like Promptman help you take the next step:

Version control for prompts
Runtime fetching via API
Team collaboration and lifecycle management
Stage-aware deployments

All of this makes your AI application more maintainable, team-friendly, and ready for the real world of fast-moving agents.

If you’re building anything beyond a simple one-off prompt today — especially agentic applications — it’s time to liberate your prompts from your codebase. Your dev team (and your future self) will thank you.

The package manager of the future…

Marcelo Emmerich — Fri, 19 Sep 2025 21:52:56 GMT

… will manage AI prompts!

I know this will sound crazy at first, and I know I will get roasted again, but hear me out on this.

The world of developer tooling is on the cusp of a radical transformation. For decades, we’ve relied on package managers like npm, pip, and NuGet. These tools revolutionized development by making it simple to share and reuse code. But they also created a new kind of software supply chain, one that’s now under fire.

A new Era of Developer Tools

What if we could bypass this entire system? The rise of powerful AI coding assistants isn’t just about writing code faster; it’s about shifting the very nature of how and when we build things.

Imagine a world where the standard model of “Download, Install, and Hope for the Best” is replaced by “Prompt, Generate, Test and Review.”

Instead of publishing a package to a public registry, package providers would publish a canonical prompt. This isn’t a line of code; it’s a meticulously crafted set of instructions for an AI. It would describe everything the AI needs to know to build a functional, secure, and idiomatic library.

The developer would then take this prompt, paste it into their AI coding tool, and watch as the AI instantly generates a custom-built solution, tailored to their project’s language, style, and needs.

Introducing the Prompt Registry

If every provider is giving out prompts instead of packages, the next logical step is a centralized place to find them. This is where a prompt registry comes in.

This new kind of registry wouldn’t store code. It would store verified, versioned, and community-rated prompts. It would become the new source of truth for developer components, with a number of key advantages:

Security: You’re not downloading executable code from an unknown source. The code is generated on your machine, in your environment, and you can review it before you ever run it. This eliminates the core risk of a supply chain attack. The prompt is just a markdown file in english, so it’s easier to scan for malicious intent than a whole package with an anormous dependency graph.
Customization: A prompt can be modified on the fly. Need the wrapper in Python instead of TypeScript? Just tell the AI. Want to add a specific caching mechanism? Edit the prompt. You get exactly what you need, with no extra bloat.
Zero Transitive Dependencies: Since the generated code is self-contained and only uses core language libraries, the problem of inheriting a graph of dependencies is eliminated.
Cross-Language Compatibility: A single, well-written prompt can generate code for a dozen different languages and frameworks, a task that would require maintaining a dozen separate packages today.
Safe CI/CD: Devs could stop installing a plethora of packages during each CI/CD run, maintaining private code registries, and running sophisticated automated scans for Licensing issues.

Yes, AI prompts could completely replace packages for many types of software components, fundamentally changing how we distribute and use code. This shift would be a move from distributing artifacts (pre-built code) to distributing instructions for code generation.

The current system of package managers, like npm or pip, is a product of a pre-AI world. It’s built on the assumption that code is a static entity that must be bundled, versioned, and shared as-is. AI challenges that assumption by making code dynamic and context-aware.

This new paradigm offers a solution to several major problems we face today.

The End of the Transitive Dependency Tree

The most significant benefit of a prompt-based registry though is the elimination of the transitive dependency problem. In the traditional model, a single dependency can pull in a vast, hidden network of other packages. This creates a supply chain security nightmare and leads to bloat and version conflicts.

With a prompt-based system, a developer uses an AI to generate code. That code is created from the ground up, not by pulling in a pre-built binary. The AI, with a well-crafted prompt, can write a component that is self-contained or relies only on a handful of common, core libraries. The “dependency” is no longer a complex, nested package but the AI’s ability to interpret and execute the instructions. The developer has full visibility into the code from the moment it is generated, with no hidden surprises.

The New Software Supply Chain

Instead of a supply chain based on artifacts and distribution, the new model is based on knowledge and intent.

Prompt Engineering is the New Publishing: Authors of prompts become the new publishers. Their job is not to maintain code but to craft prompts that produce secure, efficient, and up-to-date code.
The AI is the Compiler and Generator: The AI’s role shifts from a code assistant to a core part of the development process, turning instructions into tangible, runnable code.
Developer is the Auditor: The developer’s role is enhanced, moving beyond simply running npm install to reviewing and auditing the generated code, ensuring it meets their security and quality standards.

Back to Reality

This vision of a prompt-based registry is, for the time being, a theoretical one. The technology isn’t quite there yet. Today’s AI coding tools, while incredibly powerful for generating boilerplate and simple functions, still have significant limitations. They can produce code that contains bugs, security vulnerabilities, or simply doesn’t adhere to a project’s specific architectural patterns. They lack a deep understanding of business context and complex system design, which is what separates a working script from a maintainable, enterprise-grade solution.

Debugging AI-generated code is another challenge. When the AI makes a mistake, the “why” isn’t always clear, forcing developers to spend time reverse-engineering a solution they didn’t write. The crucial step of human review and refinement is still non-negotiable, and it adds friction to the process.

However, the rapid pace of development in the world of AI coding suggests this is not a permanent state. The AI models we have today are just the beginning. As large language models (LLMs) become more sophisticated, they’ll learn to understand context and identify subtle security flaws.

The future of developer tooling is not about AI replacing developers, but about AI changing the very foundation of how we build things. The transition from a package-based supply chain to a prompt-based one is an ambitious vision, but as AI continues to evolve at a blistering pace, what seems like science fiction today might just be a standard development workflow tomorrow. We’re on the cusp of a shift that will redefine what it means to be a programmer. The package manager of the future isn’t a tool that downloads code, but one that helps us create it.

WebGPU bugs are holding back the browser AI revolution

Marcelo Emmerich — Thu, 10 Jul 2025 05:42:54 GMT

Browser-based LLM inference through WebGPU promises to democratize AI by enabling privacy-preserving, zero-cost inference directly in users’ browsers, but critical implementation bugs and compatibility issues across Chrome, Firefox, and Safari are preventing this technology from reaching its massive potential. Despite WebLLM achieving 80% of native performance in ideal conditions, real-world deployment faces showstopping bugs including memory allocation failures, GPU driver crashes, and a fragmented ecosystem where only 65% of users have WebGPU-capable browsers. The gap between WebGPU’s theoretical capabilities and its buggy reality represents one of the most significant missed opportunities in modern web development. In this analysis I discuss how browser vendors’ implementation issues are creating a bottleneck that prevents millions of developers from deploying local AI applications that could transform privacy, reduce costs, and enable offline AI capabilities.

Current WebGPU implementation reveals critical gaps across all major browsers

WebGPU implementation status across browsers in July 2025 shows a deeply fragmented landscape that significantly hampers LLM inference deployment.

Chrome and Edge have supported WebGPU since version 113 (April 2023), but suffer from multi-GPU limitations preventing simultaneous adapter usage and power management bugs that force laptop users onto integrated graphics regardless of settings.
Firefox’s implementation remains experimental in Nightly builds only, with the team constrained by having just 3 full-time developers compared to Chrome’s “order of magnitude more” resources, resulting in only 90% spec compliance and frequent crashes with invalid shader modules.
Safari only recently enabled WebGPU in Safari 26 beta (June 2025), limiting availability to users on the latest beta operating systems.

The most critical issue affecting LLM inference is buffer size limitations that make large model deployment impossible on many devices. Safari’s Metal backend imposes a 256MB default buffer size limit on iPhone 6 devices, scaling up to only 993MB on iPad Pro, while Chrome’s maxStorageBufferBindingSize is often limited to 128MB despite reporting higher capabilities. These limitations create a "Binding size is larger than the maximum binding size" error that prevents loading model weights exceeding these thresholds. GPU driver compatibility varies dramatically across vendors, with NVIDIA GPUs on Windows generally performing well through DirectX 12, but Linux users face Vulkan driver issues requiring manual flag configuration. AMD GPUs experience more frequent driver-related crashes and rendering artifacts specific to Chrome, while Intel integrated graphics often fall back to software rendering or produce corrupted visual output.

Leading libraries struggle with WebGPU’s implementation inconsistencies

The browser-based LLM inference ecosystem has matured significantly in 2024, with WebLLM (v0.2.79) and Transformers.js (v3.0.0) emerging as production-ready solutions despite underlying platform challenges. WebLLM achieves up to 80% of native MLC-LLM performance, with benchmarks showing Llama-3.1–8B running at 41.1 tokens/second (71.2% native speed) and Phi-3.5-mini achieving 71.1 tokens/second (79.6% native speed) when WebGPU functions correctly. However, developers report frequent failures with fine-tuned models, encountering “ThreadPoolBuildError” and registry panic errors that have no clear resolution.

Transformers.js v3.0.0’s October 2024 release brought WebGPU support claiming “up to 100x faster than WASM,” but this performance gain is only achievable on the ~70% of browsers with proper WebGPU support. The framework provides over 1,200 pre-converted models, yet developers struggle with operator coverage gaps where certain ONNX operations fail on WebGPU, forcing fallback to slower WASM execution. Memory leak issues plague production deployments, with severe problems reported in Whisper transcription pipelines where GPU memory isn’t properly released, eventually causing browser crashes. Integration challenges extend to popular frameworks like LangChain.js, where API changes in WebLLM cause breakage requiring developers to pin specific versions, highlighting the ecosystem’s instability.

Technical architecture reveals fundamental memory and performance bottlenecks

WebGPU’s technical implementation exposes critical gaps between specification promises and browser reality that directly impact LLM inference viability. The W3C Candidate Recommendation specification defines robust memory management capabilities, but browser implementations suffer from garbage collection failures where GPU memory allocations aren’t properly released, requiring manual .destroy() calls that many developers miss. Browser memory limits create hard ceilings for model deployment, with practical limits around 4-6GB for quantized models before performance degradation or crashes occur, compared to native applications that can utilize full system memory.

WASM integration for CPU-side computations achieves near-native performance for specific operations, with recent additions like Relaxed SIMD providing 1.5–3x speedups for vector operations critical to LLM inference. However, the JavaScript-to-GPU memory transfer bottleneck remains severe, with traditional WebGL-style memory copying eliminating much of WebGPU’s theoretical advantage. Shader compilation represents another major pain point, with complex LLM kernels taking several seconds to compile on first use, creating unacceptable user experience delays. The situation is particularly problematic for mobile devices where shader compilation can timeout entirely, preventing model loading.

Developer community reports widespread frustration with production deployment

Community feedback from GitHub, Stack Overflow, and developer forums reveals deep frustration with WebGPU’s production readiness for LLM applications. Performance benchmarks show promise, with WebLLM demonstrating 62 tokens/second for Llama 3.2 models in optimal conditions, but developers report these numbers are rarely achievable in real-world deployments. Common failure modes include GPU process crashes with “GpuProcessHost: The GPU process died due to out of memory” errors that provide no recovery mechanism, and silent failures where processes terminate without error messages or crash dumps.

The most successful deployments remain limited to controlled environments: Chrome extensions with persistent background processing, demonstration websites with prominent compatibility warnings, and internal tools where IT departments can mandate specific browser versions. Enterprise adoption remains near zero due to WebGPU compatibility requirements conflicting with corporate security policies that disable experimental features. Mobile deployment is particularly problematic, with performance degrading by 60–80% on mobile GPUs compared to desktop, making real-time inference impractical for all but the smallest models.

Specific bugs create insurmountable barriers for reliable deployment

Documented bugs across browser implementations create a minefield for developers attempting production deployment. Firefox’s Bug #1864698 blocks WebGPU by default on macOS with the cryptic error “FEATURE_FAILURE_MAC_WGPU_NO_METAL_BOUNDS_CHECKS,” requiring manual configuration that 99% of users will never perform. Chrome’s Issue #329211593 prevents using multiple GPU adapters simultaneously, breaking laptop deployments where developers need to leverage discrete GPUs for inference while maintaining integrated graphics for display.

Memory-related crashes represent the most severe category of bugs, with GPUOutOfMemoryError conditions causing immediate browser tab crashes with no graceful degradation. Canvas resizing operations trigger memory leaks (GitHub issue #2520) where GPU memory isn’t released, gradually degrading performance until the browser becomes unresponsive. GPU vendor-specific issues compound these problems: NVIDIA’s 572.xx drivers cause crashes with RTX 30/40 series cards, AMD Radeon HD 7700 series produces triangulation artifacts in Chrome but not Edge, Intel integrated graphics experience driver hangs with indirect draw calls, and Apple Silicon devices show Metal backend issues preventing WebGPU initialization entirely.

Future roadmap shows promise but timeline remains uncertain

The path forward for WebGPU implementation shows both promise and continued uncertainty that will impact LLM inference capabilities through 2026. Firefox has tentatively planned WebGPU release for version 141, though Mozilla developers acknowledge their implementation remains at 90% functionality with significant spec compliance work remaining. Safari’s WebGPU support in Safari 26 beta represents progress, but full availability requires users to run beta operating systems, limiting real-world adoption. The W3C’s push toward Candidate Recommendation status indicates specification stability, but doesn’t address the implementation quality issues plaguing current browsers.

Browser vendors are pursuing critical improvements including Bindless resource support to remove texture and buffer count limitations essential for large model deployment, Multi-draw indirect capabilities enabling GPU-driven rendering optimizations that could improve inference performance, and 64-bit atomic operations necessary for advanced memory management patterns. However, no browser vendor has committed to specific timelines for fixing critical bugs like buffer size limitations or multi-GPU support. The WebNN (Web Neural Network API) represents a potential alternative approach, offering hardware-agnostic neural network acceleration, but remains in early experimental stages with deployment limited to Windows through DirectML.

The gap between potential and reality demands urgent action

WebGPU’s promise of democratizing AI through browser-based LLM inference remains tantalizingly close yet frustratingly unattainable due to implementation bugs and ecosystem fragmentation. While WebLLM demonstrates that browser-based inference can achieve 80% of native performance, the 20% performance gap combined with compatibility issues affecting 35% of users, memory limitations preventing large model deployment, and platform-specific bugs requiring extensive workarounds creates an environment where production deployment remains impractical for most use cases. The technology stands at a critical juncture where browser vendors must prioritize fixing fundamental issues over adding new features, or risk WebGPU becoming another abandoned web standard that never achieved its transformative potential. For organizations evaluating browser-based LLM deployment, the recommendation remains clear: prototype with WebGPU to understand its capabilities, but maintain native or cloud-based solutions for production workloads until at least 2026 when the ecosystem may mature sufficiently for reliable deployment.