Training Compound AI Systems

RAG has ushered innovative ways to build around LLMs and data. Lets now see how to train Compound AI Systems that harness the power of LLMs across multiple dimensions

Skanda Vivek
6 min readApr 21, 2024


Encapsulating Compound AI Systems | Skanda Vivek

Large Language Models (LLMs) have the potential to be a gamechanger in so many industries. However, during conversations with multiple folks — the observation is often that LLMs out of the box are great for general tasks like information gathering, or coding — but not so much in enterprise settings. Often times, responses are too generic. One complaint is that LLM responses seem too mechanistic, and less authentic (too many exclamation points for example).

So far, incorporating private data within LLM using methods like Retrieval Augmented Generation (RAG) have proved to be a way to mitigate some of these concerns. Through RAG, LLMs can be tailored to specific scenarios, and lead to valuable, personalized results. An example, is a website chatbot that answers user queries using an LLM in combination with retrieval of the right documents. This can save valuable resources for the company, and also help customers that now don’t have to wait to reach someone for customer support.

However, training these systems can be hard. One issue is that these chatbots do not always return relevant documents. Another issue is hallucinations. For example, an Air Canada chatbot gave incorrect refund information to a customer, and was subsequently held liable. Rather than focus on training the LLM itself, the idea behind compound AI systems, is to improve task performance, through system design.

Optimizing LLMs for tasks

Of recent, there have been a few frameworks like DSPy, that train LLMs for prompts that maximize performance during tasks. The premise of DSPy is fascinating — what if we could train prompts in the same way we train model parameters? This idea has shown promise in academic settings — led by Stanford research. In a recent blog, I’ve also shown that it does well on representative tasks like Q&A over documents.

DSPy article | Skanda Vivek

Optimizing RAG Systems

Optimizing RAG systems requires a multi-pronged approach. The embedding model used, chunking strategy, LLM for generating responses, context retrieval strategy — represent 4 large components for optimizing RAG systems.

Training RAG Systems | Goku Mohandas

Recently, there have been a lot of innovations in optimizing RAG retrieval. Here’s just a few examples:


In Self-RAG, the authors develop a clever way for a fine-tuned LM (Llama2–7B and 13B) to output special tokens [Retrieval], [No Retrieval], [Relevant], [Irrelevant], [No support / Contradictory], [Partially supported], [Utility], etc. appended to LM generations to decide whether or not a context is relevant/irrelevant, the LM generated text from the context is supported or not, and the utility of the generation.


HyDE uses an LLM to create a hypothetical document in relation to a query. This helps during retrieval, where the hypothetical document is used to retrieve the actual documents from the database. The advantage of this is that some user queries can be quite brief, and not enough context for embedding models, that thrive on rich text information. The idea here is that adding in a hypothetical “ideal” document helps retrieve more relevant documents.


Re-ranking is a simple (yet powerful) idea. The idea is that you retrieve a large number of documents (say n=25) first. Next, you train a smaller reranker model to select the top k (say 3) documents out of the 25, and feed that as the LLM context. This is a pretty cool technique, and it makes a lot of sense to train a smaller re-ranker model for specific RAG scenarios.

Forward-Looking Active Retrieval Augmented Generation (FLARE):

FLARE is used to handle cases where you want answers to be correct, and up to date — for which it makes sense to augment LLM knowledge with a real-time updated knowledge hub (the Internet). As you can see below, one solution is to combine iteratively internet searches, and LLM knowledge.

In this workflow, first the user asks a question, and the LLM generates an initial partial sentence. This partial generation acts as a seed for an Internet search query (e.g. Joe Biden attended [X]). The result from this query is then integrated into the LLM response, and this process of continuous search and update continues, until the end of the generation.

Optimizing Agents And Flows

LLM Agents consist of multiple LLMs, orchestrated to plan and execute complex tasks. These can be very useful in answering complex questions like “How much did sales grow for company X between Q1 of 2024 and Q2 of 2024?” This type of user request potentially involves making multiple LLM calls, gathering multiple documents, and planning and executing these as below:

LLM Agent Prototype | NVIDIA

In addition to agents, many new results are emerging showing that chaining of multiple components in unique ways can result in breakthrough performance. One study shows that these 3 key features:

  1. Asking key questions about a certain topic to identify information nuggets
  2. Simulating conversations to synthesize this information
  3. Drawing these pieces of information into an outline

leads to high quality Wikipedia like articles. I think this is pretty amazing, as it could lead to more tailored, human like content through the right mix of components.

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (STORM) Paper

Optimizing Compound AI Systems?

As you can see, there are a lot of aspects to optimize in compound AI systems. If you thought that optimizing RAG embeddings, chunking, retrieval, LLM were hard enough — add multiple more dimensions, each having their own set of 1–10 levers. So is it feasible to keep track of so many parameters?

Here’s an idea: what if we treat these parameters akin to standard ML parameter choice, similar to something like gridsearchCV hyperparameters? Let’s even give this a name — AIsearchCV.

Another way to visualize this is that you want to compress any output you have, into the minimal necessary inputs to reconstruct this — using the ingredients you have (here all the aspects that make up compound AI systems). This way, you can reconstruct this text through minimal topic inputs, and the rest of the work is done automatically through LLM calls, back and forth conversations, etc.

Decomposing Ideal Outputs Into Compound AI Action Space| Skanda Vivek


Getting AI systems to do useful tasks in specific domains is where the real value of LLMs lie. However, this involves careful system design of compound AI systems. This can often be overwhelming due to the amount of new work in this space, and the slowly growing number of dimensions to optimize. It is slowly becoming clear though, that this is necessary due to the multiple advantages of designing around LLMs, as opposed to training LLMs themselves.

I’ve offered a few seeds of ideas for future methods that tune compound AI system hyperparameters similar to how we tune standard ML models through libraries like GridSearchCV. I’m excited to see how enterprises adopt LLMs in their specific domains, and for innovations in this area!

If you like this post, follow EMAlpha — where we dive into the intersections of AI, finance, and data.