The researcher developing a scaling law for compound AI systems

Matt Weinberger
Vertex Ventures US
Published in
3 min readMay 17, 2024

It seems intuitive, maybe even obvious: The more resources you allocate to a compound AI system, the better the performance. Right?

Well, maybe not. In a recent paper, researchers show that in a compound system — where queries are posed to several different models, with the system aggregating them and returning the best answer — the curve is actually U-shaped, based on the complexity of the original query. In other words, there’s a point at which making more API calls actually lessens performance, instead of raising it.

On a new episode of Neural Notes, Sandeep Bhadra and Chase Roberts of Vertex Ventures US spoke with Lingjiao Chen, the leading author of the paper in question to discuss the implications of this research. In short, Lingjiao says that the team was able to lay the groundwork for a scaling law of compound AI systems that can predict their performance as they scale up. This, in turn, paves the way for better and more responsive AI agents in general.

You can watch the full episode of Neural Notes below, and don’t miss part one of our conversation with Lingjiao on his research into how ChatGPT’s answers have changed over time.

The genesis of this paper was the observation that when Google first released its Gemini chatbot, it ran the same prompt against the model 32 times before taking the average of the answers to deliver a result. The researchers thought this was an arbitrary number, leading them to wonder if there were a way for AI compound systems to calculate the ideal number of runs for any query, so as to maximize both accuracy and performance.

Lingjiao defines a compound system as a “mixture of experts,” where a query to an LLM chatbot is passed to several different models. The system assesses the answers and fuses them as necessary, in the spirit of providing the best possible answer. It hinges on the notion that different models might be better or worse at answering different sorts of questions, so users get the best of all possible worlds.

The drawback of this approach is that a complicated query puts more strain on a compound AI system: Not only does it have to pose the question to all of the models involved; it also has to sift through all the answers and sift the wrong answers from the right ones. With all of those models returning answers, it also makes it more likely that inaccurate data will worm its way in — with those errors getting compounded by the process of synthesizing all the data to return a response.

Ultimately, the more work it has to do to assess the quality of the answer and synthesize a response that’s optimized for accuracy, the bigger the hit to performance.

What Lingjiao’s team is working towards now is an algorithm that can assess and approximate the difficulty of any query, the better for the system to decide how many runs would return the optimal result.

Like Google Gemini, most systems today use a predetermined number of runs, one-size-fits-all. Having this kind of intelligence would give compound AI systems the ability to do many runs for simple queries, while taking a more targeted approach for more complex prompts, perhaps doing fewer runs against models judged to be more appropriate for the task.

This would make performance more consistent and higher overall. And, as a bonus, Lingjiao suggests that this approach might result in making AI cheaper to operate, too, as lowering the number of API calls also lowers your bills with whichever service providers are hosting the model.

--

--