Coffee Time Papers: Mixture of Agents

Enhances Large Language Model Capabilities

Dagang Wei

6 min readJun 25, 2024

This blog post is part of the series Coffee Time Papers.

Paper

https://arxiv.org/abs/2406.04692

Overview

This paper introduces a novel Mixture-of-Agents (MoA) framework designed to harness the collective capabilities of multiple Large Language Models (LLMs). The authors observed a phenomenon they term the “collaborativeness of LLMs,” where LLMs tend to produce better responses when given access to outputs from other models. Building on this observation, MoA is designed as a layered architecture where each layer consists of multiple LLM agents. These agents iteratively refine the generated responses, with each agent using the outputs from the previous layer as additional information.

The MoA framework was evaluated on three benchmarks: AlpacaEval 2.0, MT-Bench, and FLASK. The results demonstrate that MoA achieves state-of-the-art performance, even surpassing GPT-4 Omni on AlpacaEval 2.0 using only open-source LLMs. The authors conducted further experiments to understand the internal mechanisms of MoA, including the impact of model diversity and the number of proposers. They also performed a budget and token analysis, showing that MoA can be cost-effective while maintaining high performance.

Overall, this paper presents a promising approach to improving LLM capabilities by leveraging the strengths of multiple models through a collaborative framework. The findings suggest that MoA can lead to significant improvements in response quality and cost-effectiveness compared to using a single LLM.

Architecture

Mixture-of-Agents (MoA) is a layered architecture where each layer has multiple Large Language Models (LLMs) acting as agents. These agents work together to refine the answer to a query over multiple iterations.

Layer 1: In the first layer, each LLM agent independently generates a response to the initial query.
Subsequent Layers: The responses from the first layer are then passed on to the next layer. In this layer, each agent uses the outputs from the previous layer as additional information to generate a more refined response.
Iterative Refinement: This process continues for several layers, with each layer’s agents refining the responses from the previous layer.
Final Output: The final output is the response generated by an LLM in the last layer, which is considered the most refined and comprehensive answer to the query.

Example

Let’s illustrate the MoA architecture with an example query and the corresponding prompts used at each layer.

User Prompt: “What are the benefits of exercise?”

Layer 1:

Each agent in the first layer receives the following prompt:

What are the benefits of exercise?

They independently generate responses based on their training and knowledge.

Layer 2:

The agents in the second layer receive a prompt that includes the responses from Layer 1 and an instruction to aggregate and synthesize the information:

You have been provided with a set of responses from various open-source models
to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.
Responses from models:
1. [Model Response from Layer 1 Agent 1]
2. [Model Response from Layer 1 Agent 2]
3. [Model Response from Layer 1 Agent 3]

Layer 3 (Final Aggregator):

The final aggregator in Layer 3 receives a similar prompt as the agents in Layer 2, but with the refined responses from Layer 2 as input. The final aggregator then produces the most refined and comprehensive answer to the query.

You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the
information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given
answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.
Responses from models:
1. [Model Response from Layer 2 Agent 1]
2. [Model Response from Layer 2 Agent 2]
3. [Model Response from Layer 2 Agent 3]

This iterative process allows MoA to leverage the strengths of different LLMs and produce a final answer that is more comprehensive and accurate than any individual LLM could generate alone.

Q & A

Q: What is the Mixture-of-Agents (MoA) framework?

A: The Mixture-of-Agents (MoA) framework is a novel approach that leverages the collective strengths of multiple Large Language Models (LLMs) to enhance their natural language understanding and generation capabilities. It is designed as a layered architecture where each layer comprises multiple LLM agents that iteratively refine the generated responses. Each agent in a layer uses the outputs from the previous layer as additional information, leading to a more refined and comprehensive final response.

Q: What is the “collaborativeness of LLMs” phenomenon?

A: The “collaborativeness of LLMs” is a phenomenon observed by the authors where LLMs tend to generate better responses when provided with outputs from other models, even if those outputs are of lower quality. This suggests that LLMs can learn and improve from each other’s responses, leading to a collaborative improvement in overall performance.

Q: How was the MoA framework evaluated, and what were the results?

A: The MoA framework was evaluated on three benchmarks: AlpacaEval 2.0, MT-Bench, and FLASK. The results demonstrated that MoA achieved state-of-the-art performance on these benchmarks, even surpassing GPT-4 Omni on AlpacaEval 2.0 using only open-source LLMs. This highlights the effectiveness of MoA in leveraging the collective strengths of multiple LLMs to achieve superior performance.

Q: What are the key insights gained from the experiments conducted to understand the internal mechanisms of MoA?

A: The experiments revealed several key insights:

MoA significantly outperforms LLM rankers, indicating that the aggregator in MoA does more than simply select the best response from the proposers.
MoA tends to incorporate the best-proposed answers, as evidenced by a positive correlation between the win rate and similarity scores like BLEU.
Model diversity and the number of proposers positively impact the final output quality, suggesting that having more diverse LLM agents in each layer can improve performance.
Certain models specialize in specific roles within the MoA ecosystem, with some excelling as proposers and others as aggregators.

Q: What are the implications of the budget and token analysis?

A: The budget and token analysis revealed that MoA can be cost-effective while maintaining high performance. Specifically, the MoA-Lite variant, which uses fewer layers and a less computationally expensive aggregator, can match the performance of GPT-4o while being more cost-effective. This demonstrates the potential of MoA to provide a more affordable solution for achieving high-quality language generation.

Q: What are the limitations of the MoA framework?

A: One limitation of MoA is the potential for high Time to First Token (TTFT) due to the iterative aggregation process. This can negatively impact user experience, especially in real-time applications. Another limitation is the increased computational cost associated with using multiple LLMs compared to a single model.

Q: What are the broader impacts of this research?

This research has the potential to enhance the effectiveness of LLM-driven chat assistants, making AI more accessible and user-friendly. The improved interpretability of MoA, due to the use of natural language in intermediate outputs, can also facilitate better alignment with human reasoning. This could lead to more reliable and trustworthy AI systems in the future.