Are foundation model applications complex systems?

Davis Treybig
Innovation Endeavors
8 min readDec 13, 2023

Over the last year, we have seen the first wave of foundation model native companies start to be built on top of foundation models.

Over this timeframe, the predominant architecture for building applications or features on top of such models has evolved dramatically, from zero shot inference, to in-context learning, to retrieval augmented generation, and now increasingly to more sophisticated agentic systems which integrate memory & planning sub-systems with some sort of recursive control loop.

Looking at this progression and playing it forward, it is clear to me that most foundation model applications in the future will look like complex systems. For those unfamiliar — a “complex system” in computer science parlance is a system primarily defined by emergent, difficult to predict behavior that results from complex interactions between system components. Importantly — systems which are complex typically need to be built, designed, and tested in very specific ways.

This essay articulates why I think this is the case for applications built on foundation models, and the implications of this on the way we build, test, and run such applications.

“Foundation System” Architecture

Many of the most interesting applied LLM startups today have a core system architecture that includes the following components:

  1. Data pre-processing and chunking pipeline
  2. Embedding models
  3. Retrieval stack (Offline) — index creation & data storage
  4. Retrieval stack (Online) —query expansion, filtering, semantic search, lexical search, multi-stage reranking, and other forms of context expansion
  5. Large language model(s)
  6. A “Control Loop” or planning layer which guides a series of recursive calls across your large pre-trained models, simpler classifier models, a retrieval stack, and other business logic.
  7. A memory layer that keeps track of what the system has attempted or learned
  8. Secondary modules such a Trust & Safety layer, or a compilers/linter layer if you are in LLMs for software engineering

You might refer to this stack as a “Foundation System”, where the end user task is only solved by all of these components working together in unison, and where the model is actually only a very small sub-component. This is analogous to the idea that recommendation engines are better thought of as systems, not models.

I would argue that foundation model applications built from these constituent parts demonstrate many of the defining characteristics of a complex system:

  1. The task-specific performance of the end-to-end system is a complex, emergent property of how all these elements interact
  2. The system has complex interdependencies. Changes X and Y may each, in isolation, improve system performance but in conjunction, degrade system performance
  3. Attribution is very complex. It is hard to understand what root cause drove quality improvement or degradation
  4. The permutations of ways to optimize the system are almost infinite

You can easily see these dynamics reflected when you speak with founders building complex LLM-based applications. Many feel they are playing whack-a-mole as they make changes, having little to no ability to reason about whether individual component changes will improve system performance, and struggling to attribute quality improvements or degradations to specific code changes.

Practically though, this means that it is harder to fix small bugs. You need to thoroughly test every change, and debugging can become infuriating. We definitely feel like we are playing “whack-a-mole” sometimes. — Einblick

Tweet

Building & Testing Complex Systems

What is interesting to then consider is the dichotomy between how foundation model based applications are built & tested today, relative to what is known about how to build and test complex systems.

The way we program and evaluate FM applications today looks very much like deterministic, classic software systems. Prompt Engineering is a very pure expression of this — involving humans manually tuning an array of prompts for each step in their LLM chain — though what I am saying is not isolated to only prompt engineering.

The status quo, more broadly, is engineers manually tuning each and every piece of the system architecture above in a modular fashion, hoping aggregate quality improves over time. For example — manually testing a few embedding models before determining which should be used as part of a complex RAG.

Unfortunately, it is quite common for this sort of change to test well locally (e.g. the embedding model improves retrieval tests) but degrade global quality (end task performance suffers), or for this sort of change to reduce quality when combined with other seemingly independent changes, such as a chunking strategy or retrieval pipeline logic.

If you look at other classical areas of complex systems like autonomous vehicles, mechanical and electrical engineering, spacecraft, and similar — the clear takeaway is that complex systems can only be optimized and evaluated via large scale simulation and, potentially, end-to-end learning.

The rationale behind this is simple — because system complexity is so high and difficult to reason through and subject to some degree of randomness or stochasticity, it is untenable for a human to manually tweak individual components. Rather, the parameter space must be explored at large, and optimized end-to-end.

Concrete examples of this include how the autonomous vehicle industry is moving to end to end learning, how simulation tools like Simulink are the de-facto way almost all mechanical and electrical systems are built and optimized, and how NASA does extensive simulation testing for their spacecraft.

You also see hints of this approach in “normal” software systems when complex or non-deterministic behavior is possible. For example, chaos engineering is used to test complex chains of microservices operating over a network, and fuzzing is often used to test complex, mission critical systems like web browsers and operating systems by sweeping over the entire input space of the system. Simulation is widely used in complex distributed systems, such as databases, as well — see here and here.

I therefore believe that the developer toolchain for building with foundation models is going to have to move in this direction, especially as we try to build increasingly sophisticated applications such as long running agent-based systems. What is particularly interesting about foundation models is that this complexity creeps in on essentially day 1 of building software, whereas in “classical” software systems this sort of complex non-determinism typically only emerged at extreme scale in highly networked systems or in highly asynchronous systems.

If you look closely, you can already see some early examples of this taking shape.

Early examples of this in the foundation model market

DSPy

DSPy is a declarative framework for writing language model chains. All you input is the function signatures you want your chain to adhere to and a few labeled examples of input/output pairs, and the framework compiles the intermediate prompts and in-context-learning examples for every single sub-step of the chain.

This stands in stark contrast to many mainstream LLM frameworks, such as Langchain, which are imperative by nature and involve the developer manually writing and optimizing each intermediate step.

Prompt Optimization

Declarative prompt engineering is a paper out of Berkeley that highlights ways that you might produce and optimize prompts for large models via a declarative system input.

We similarly envision users of LLMs specifying their data processing objectives at a high level with the system decomposing it into unit calls to one or more types of LLMs with appropriate prompts, acting on one or more units of data, orchestrating the calls, and issuing more as needed, until completion to the desired degree of accuracy, within the specified monetary budget.

You’ll note that this is very similar in spirit to the idea of DSPy — the user only provides a high level task input, and all intermediate steps & actions are procedurally optimized by the system.

This paper on large language models as optimizers and PromptBreeder have some similar ideas along these lines.

LLM Evaluation

I am now seeing companies working on large language model evaluation that are considering how simulation might be applied to RAG & agent systems.

The core idea is that you can train an eval or reward model that mimics user preferences in a given domain, and then run the end-to-end RAG system as simulation.

You can then essentially do a hyper parameter sweep over subcomponents of the system — chunking, embedding, retrieval, etc — optimizing against the objective function of your domain or task specific reward model.

Outside of domains where massive user scale makes it relatively easy to A/B test RAG-style systems (e.g. Copilot), it seems to me that simulation may be the only way to gather enough signal to fully optimize a foundation model system.

Synthetic Data Generation and RLAIF

Anthropic’s Consitutional AI concept is another variation of this idea. A set of baseline principles are given to the model, and then the model proceeds to dynamically generate training data based on those principles, evaluate & refine that training data, and then learn from that data.

You can almost think of this as a declarative form of fine tuning — where the only input was a set of constitutional principles, and everything else was left to be optimized by the system.

Mixture of Expert Models

Most of the major large language model providers are moving towards “Mixture of Expert” architectures, where you jointly train a gating network and different expert models which are then routed to by the gating network. In other words, the entire inference engine is trained end-to-end and is closer to a system than just a model.

This tweet provides a good, more in-depth explanation.

Looking to the future

In general, it seems to me that we are in the assembly stage of writing and testing foundation model systems. Higher level abstractions which move away from prompting and hand-tuning each sub-component, towards declarative inputs which are then “compiled” into a dynamically optimized system, feel like the only reasonable way to push applications like this to the next level of scale and complexity.

This will have a significant implication on the way tooling for building, testing, evaluating, and optimizing foundation model applications will evolve. I think the default developer abstraction for systems like this will increasingly trend towards :

  1. High level task specification & principles
  2. A small set of labeled input/output data pairs
  3. Perhaps a basic overview of system components (retrieval stack, embedding model, etc)

This will kick off a cascade of synthetic data generation, system simulation, and optimization. While I certainly don’t think hand engineering sub-modules will go away entirely, I think the future will look very different from how systems like this are built today. Exactly how far this sort of idea can be taken remains to be seen, but it will be interesting to observe.

Tweet

If these observations are correct, I think this will have a large impact on LLM developer frameworks. It may also allow for an Applied Intuition-esque opportunity in this market.

If you have thoughts on simulation, declarative frameworks, or optimization of LLM systems, or if you disagree with all of this, I’d love to chat — davis (at) innovationendeavors.com

Thanks to Omar Khattab and Jacopo Tagliabue for reviewing this

--

--

Davis Treybig
Innovation Endeavors

Early stage investor at Innovation Endeavors, former Google PM