Mixture Of Agents Using Groq

Plaban Nayak

Published in

The AI Forum

13 min readAug 10, 2024

What is an Agent ?

An agent is an autonomous unit programed to perform the following:

Perform a task
Make Decisions
Collaborate on tasks

What is Mixture of Agents ?

Mixture of Agents (MoA) is a novel approach that leverages multiple large language models (LLMs) to enhance their collective capabilities.

MoA uses a layered architecture where each layer consists of multiple LLM agents. Each agent takes the outputs from all agents in the previous layer as auxiliary information to generate its own response.

The layered structure allows for iterative refinement of the output. Intermediate results are passed between layers, enabling the agents to collaboratively enhance the final answer.

MoA models have been shown to outperform individual LLMs like GPT-4 on various benchmarks. For example, an open-source MoA achieved a score of 65.1% on AlpacaEval 2.0, compared to 57.5% for GPT-4 Omni.

The aggregator agent in MoA does not simply select one of the proposer agents’ outputs. It performs a more sophisticated aggregation over all proposed generations, which is key to its strong performance.

The current LLMs are restricted to their model sizes and training data and hence scaling them is super costly and compute intensive.

The idea over here in MOA is to harness the collective expertise of multiple LLMs to create a more capable and robust system whereby diverse models with particular strengths trained on unique data for special use cases benefit from each other’s responses supporting for more complex and demanding tasks.

How is Mixture Of Agents are different from other multi-agent systems

1. Architecture and Functionality

Mixture of Agents: MoA operates on a layered architecture where agents work in parallel and sequentially to refine outputs iteratively. It utilizes both proposers, which generate diverse responses, and aggregators, which synthesize these responses into a high-quality final output. This structure allows for sophisticated aggregation and collaboration among the agents, enhancing overall performance.
Traditional Multi-Agent Systems: These systems typically consist of agents that may operate independently or collaboratively but often lack the structured layering and iterative refinement seen in MoA. They may focus on task distribution or competition among agents rather than on collaborative output generation.

2. Collaboration and Output Quality

MoA: The framework emphasizes the collaborativeness of LLMs, where agents improve their responses based on outputs from other agents. This leads to higher quality responses, even when some inputs are of lower quality. The aggregation process in MoA is designed to incorporate the best proposed answers, resulting in superior outcomes.
Other Multi-Agent Systems: While they may facilitate collaboration, traditional systems often do not optimize for output quality in the same way. They may rely on simpler voting or selection mechanisms to determine the best response, which can limit the potential for enhanced output quality.

https://github.com/togethercomputer/MoA?tab=readme-ov-file#multi-layer-moa-example

The mixture of agents demonstrates multiple agents receiving the same initial prompt and generate responses independently. Each response is then passed forward to the corresponding agent in the next layer for further refinement.

The second layer may or may not use the same model when refining the responses. This iterative refinement process continues for several cycles until obtaining a more robust and accurate response.

The most important part of the layered approach is the automatic selection of models for the next based on their performances.The model selection takes into account model diversity and performance that compensates for the deficiency of a single model and improves overall quality of the responses.

For each MoA system there exist three main components:

Proposer (or Reference) models — these models propose responses to a user’s prompt.
Aggregators — these models take the proposed outputs, along with a system prompt and the user’s prompt to synthesize a response.
Layers — each MoA system has a number of layers that corresponds to the number of proposal stages and aggregation stages.

Collaborativeness in the MOA is achieved by the above components

Proposers are the agents that excel at generating useful responses to be used by other models. While a good proposer may not necessarily produce responses with high scores by itself, it should offer more context and diverse perspectives ultimately contributing to better final responses when used by an aggregator.

Aggregators are agents with specialized proficiencies that receive the context and perspective from the proposer agents and synthesize the information into high quality final responses.

How MoA Works

Several proposer agents independently generate initial responses to a given prompt.
Aggregator agents in the next layer synthesize the different proposer responses into higher-quality responses.
This iterative process continues through several layers until a more robust and comprehensive response is achieved.
Agents are categorized as proposers or aggregators based on their strengths in different aspects of collaboration.
Consistent and monotonic performance gains are achieved after each layer, with the same proposer agents and varying aggregators.

The basic idea here is that each layer will accept some intermediate responses from the previous layer and then output the response set generated to the next layer just like deep learning neural networks are constructed but in an expanded level.

GROQ:

Groq’s LPU Inference Engine provides a cutting-edge solution for running large language models at scale, offering exceptional speed, efficiency, and cost-effectiveness. By leveraging Groq’s innovative architecture and compiler technology, developers can build high-performance AI applications that push the boundaries of what’s possible with LLMs.

Implementation Steps

Here we have incorporated a simple MOA using the same concept used by Together MoA but we have used models from Groq for faster inferencing.

Clone the togethercomputer MOA package

git clone https://github.com/togethercomputer/MoA.git

2. Set up Groq API Key

go to MoA folder and create a .env file

GROQ_API_KEY="your groq api Key"

3. install the required dependencies

pip install -r requiremnets.txt
pip install -U groq
pip install python-dotenv

4. Set up proposals model and agrregator model

We can select any number of models here (though cost of each run is directly related to how many models are in each of our layers).

proposal_models = ["llama3–8b-8192",
"mixtral-8x7b-32768",
"llama3-70b-8192",
"gemma2-9b-it",
]

aggregator_model = "llama3-70b-8192"

In order to use GROQ model inferencing we need to make chages in two scripts:

bot.py
utils.py

#########################bot.py####################
import datasets
from functools import partial
from loguru import logger
from utils import (
    generate_together_stream,
    generate_with_references,
    DEBUG,
)
import typer
from rich import print
from rich.console import Console
from rich.markdown import Markdown
from rich.prompt import Prompt
from datasets.utils.logging import disable_progress_bar
from time import sleep

disable_progress_bar()

console = Console()

welcome_message = """
# Welcome to the Together AI MoA (Mixture-of-Agents) interactive demo!

Mixture of Agents (MoA) is a novel approach that leverages the collective strengths of multiple LLMs to enhance performance, achieving state-of-the-art results. By employing a layered architecture where each layer comprises several LLM agents, MoA significantly outperforms GPT-4 Omni’s 57.5% on AlpacaEval 2.0 with a score of 65.1%, using only open-source models!

This demo uses the following LLMs as reference models, then passes the results to the aggregate model for the final response:
- llama3–8b-8192
- mixtral-8x7b-32768
- llama3-70b-8192
- gemma2-9b-it

"""
#### Place your Groq poposal models here
default_reference_models = [
    "llama3–8b-8192",
    "mixtral-8x7b-32768",
    "llama3-70b-8192",
    "gemma2-9b-it",
]


def process_fn(
    item,
    temperature=0.7,
    max_tokens=2048,
):
    """
    Processes a single item (e.g., a conversational turn) using specified model parameters to generate a response.

    Args:
        item (dict): A dictionary containing details about the conversational turn. It should include:
                     - 'references': a list of reference responses that the model may use for context.
                     - 'model': the identifier of the model to use for generating the response.
                     - 'instruction': the user's input or prompt for which the response is to be generated.
        temperature (float): Controls the randomness and creativity of the generated response. A higher temperature
                             results in more varied outputs. Default is 0.7.
        max_tokens (int): The maximum number of tokens to generate. This restricts the length of the model's response.
                          Default is 2048.

    Returns:
        dict: A dictionary containing the 'output' key with the generated response as its value.
    """

    references = item.get("references", [])
    model = item["model"]
    messages = item["instruction"]

    output = generate_with_references(
        model=model,
        messages=messages,
        references=references,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    if DEBUG:
        logger.info(
            f"model: {model}, instruction: {item['instruction']}, output: {output[:20]}"
        )

    print(f"\nFinished querying [bold]{model}.[/bold]")

    return {"output": output}

### Sepicfy your GroqAggregator model here
def main(
    model: str = "llama3-70b-8192",
    reference_models: list[str] = default_reference_models,
    temperature: float = 0.7,
    max_tokens: int = 2048,
    rounds: int = 1,
    multi_turn=True,
):
    """
    Runs a continuous conversation between user and MoA.

    Args:
    - model (str): The primary model identifier used for generating the final response. This model aggregates the outputs from the reference models to produce the final response.
    - reference_models (List[str]): A list of model identifiers that are used as references in the initial rounds of generation. These models provide diverse perspectives and are aggregated by the primary model.
    - temperature (float): A parameter controlling the randomness of the response generation. Higher values result in more varied outputs. The default value is 0.7.
    - max_tokens (int): The maximum number of tokens that can be generated in the response. This limits the length of the output from each model per turn. Default is 2048.
    - rounds (int): The number of processing rounds to refine the responses. In each round, the input is processed through the reference models, and their outputs are aggregated. Default is 1.
    - multi_turn (bool): Enables multi-turn interaction, allowing the conversation to build context over multiple exchanges. When True, the system maintains context and builds upon previous interactions. Default is True. When False, the system generates responses independently for each input.
    """
    md = Markdown(welcome_message)
    console.print(md)
    sleep(0.75)
    console.print(
        "\n[bold]To use this demo, answer the questions below to get started [cyan](press enter to use the defaults)[/cyan][/bold]:"
    )

    data = {
        "instruction": [[] for _ in range(len(reference_models))],
        "references": [""] * len(reference_models),
        "model": [m for m in reference_models],
    }

    num_proc = len(reference_models)

    model = Prompt.ask(
        "\n1. What main model do you want to use?",
        default="llama3-70b-8192",
    )
    console.print(f"Selected {model}.", style="yellow italic")
    temperature = float(
        Prompt.ask(
            "2. What temperature do you want to use? [cyan bold](0.7) [/cyan bold]",
            default=0.7,
            show_default=True,
        )
    )
    console.print(f"Selected {temperature}.", style="yellow italic")
    max_tokens = int(
        Prompt.ask(
            "3. What max tokens do you want to use? [cyan bold](2048) [/cyan bold]",
            default=512,
            show_default=True,
        )
    )
    console.print(f"Selected {max_tokens}.", style="yellow italic")

    while True:

        try:
            instruction = Prompt.ask(
                "\n[cyan bold]Prompt >>[/cyan bold] ",
                default="Top things to do in NYC",
                show_default=True,
            )
        except EOFError:
            break

        if instruction == "exit" or instruction == "quit":
            print("Goodbye!")
            break
        if multi_turn:
            for i in range(len(reference_models)):
                data["instruction"][i].append({"role": "user", "content": instruction})
                data["references"] = [""] * len(reference_models)
        else:
            data = {
                "instruction": [[{"role": "user", "content": instruction}]]
                * len(reference_models),
                "references": [""] * len(reference_models),
                "model": [m for m in reference_models],
            }

        eval_set = datasets.Dataset.from_dict(data)

        with console.status("[bold green]Querying all the models...") as status:
            for i_round in range(rounds):
                eval_set = eval_set.map(
                    partial(
                        process_fn,
                        temperature=temperature,
                        max_tokens=max_tokens,
                    ),
                    batched=False,
                    num_proc=num_proc,
                )
                references = [item["output"] for item in eval_set]
                data["references"] = references
                eval_set = datasets.Dataset.from_dict(data)

        console.print(
            "[cyan bold]Aggregating results & querying the aggregate model...[/cyan bold]"
        )
        output = generate_with_references(
            model=model,
            temperature=temperature,
            max_tokens=max_tokens,
            messages=data["instruction"][0],
            references=references,
            generate_fn=generate_together_stream,
        )

        all_output = ""
        print("\n")
        console.log(Markdown(f"## Final answer from {model}"))

        for chunk in output:
            if out is not None:
                out = chunk.choices[0].delta.content
                console.print(out, end="")
                all_output += out
        print()

        if DEBUG:
            logger.info(
                f"model: {model}, instruction: {data['instruction'][0]}, output: {all_output[:20]}"
            )
        if multi_turn:
            for i in range(len(reference_models)):
                data["instruction"][i].append(
                    {"role": "assistant", "content": all_output}
                )


if __name__ == "__main__":
    typer.run(main)

utils.py

change all occurrences of from together.xyz endpoints to groq api endpoints
change all occurrences of setting TOGETHER_API_KEY to GROQ_API_KEY

import os
import json
import time
import requests
import openai
import copy

from loguru import logger
### Load the GROQ API KEY
from dotenv import load_dotenv
load_dotenv()


DEBUG = int(os.environ.get("DEBUG", "0"))


def generate_together(
    model,
    messages,
    max_tokens=2048,
    temperature=0.7,
    streaming=False,
):

    output = None

    for sleep_time in [1, 2, 4, 8, 16, 32]:

        try:
            # chage to Groq endpoint
            endpoint = "https://api.groq.com/openai/v1/chat/completions"

            if DEBUG:
                logger.debug(
                    f"Sending messages ({len(messages)}) (last message: `{messages[-1]['content'][:20]}...`) to `{model}`."
                )

            res = requests.post(
                endpoint,
                json={
                    "model": model,
                    "max_tokens": max_tokens,
                    "temperature": (temperature if temperature > 1e-4 else 0),
                    "messages": messages,
                },
                headers={
                    "Authorization": f"Bearer {os.environ.get('GROQ_API_KEY')}",
                },
            )
            if "error" in res.json():
                logger.error(res.json())
                if res.json()["error"]["type"] == "invalid_request_error":
                    logger.info("Input + output is longer than max_position_id.")
                    return None

            output = res.json()["choices"][0]["message"]["content"]

            break

        except Exception as e:
            logger.error(e)
            if DEBUG:
                logger.debug(f"Msgs: `{messages}`")

            logger.info(f"Retry in {sleep_time}s..")
            time.sleep(sleep_time)

    if output is None:

        return output

    output = output.strip()

    if DEBUG:
        logger.debug(f"Output: `{output[:20]}...`.")

    return output


def generate_together_stream(
    model,
    messages,
    max_tokens=2048,
    temperature=0.7,
):
    endpoint = "https://api.groq.com/openai/v1"
    client = openai.OpenAI(
        api_key=os.environ.get("GROQ_API_KEY"), base_url=endpoint
    )
    endpoint = "https://api.groq.com/openai/v1/chat/completions"
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature if temperature > 1e-4 else 0,
        max_tokens=max_tokens,
        stream=True,  # this time, we set stream=True
    )

    return response


def generate_openai(
    model,
    messages,
    max_tokens=2048,
    temperature=0.7,
):

    client = openai.OpenAI(
        api_key=os.environ.get("OPENAI_API_KEY"),
    )

    for sleep_time in [1, 2, 4, 8, 16, 32]:
        try:

            if DEBUG:
                logger.debug(
                    f"Sending messages ({len(messages)}) (last message: `{messages[-1]['content'][:20]}`) to `{model}`."
                )

            completion = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=temperature,
                max_tokens=max_tokens,
            )
            output = completion.choices[0].message.content
            break

        except Exception as e:
            logger.error(e)
            logger.info(f"Retry in {sleep_time}s..")
            time.sleep(sleep_time)

    output = output.strip()

    return output


def inject_references_to_messages(
    messages,
    references,
):

    messages = copy.deepcopy(messages)

    system = f"""You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from models:"""

    for i, reference in enumerate(references):

        system += f"\n{i+1}. {reference}"

    if messages[0]["role"] == "system":

        messages[0]["content"] += "\n\n" + system

    else:

        messages = [{"role": "system", "content": system}] + messages

    return messages


def generate_with_references(
    model,
    messages,
    references=[],
    max_tokens=2048,
    temperature=0.7,
    generate_fn=generate_together,
):

    if len(references) > 0:

        messages = inject_references_to_messages(messages, references)

    return generate_fn(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )

5. Type the below command in the command prompt

python bot.py

Final response generated by the aggregator model

Final answer from llama3-70b-8192
A Mixture of Agents (MoA) is a concept that refers to a collection of autonomous entities or agents that work together to achieve a common goal or solve a complex problem. These agents can be heterogeneous, meaning they have different capabilities, behaviors, and roles, which allows them to bring unique strengths and perspectives to the table.

In various domains, a Mixture of Agents can take different forms:

1. **Multi-Agent Systems**: In artificial intelligence and computer science, a Mixture of Agents refers to a system composed of multiple autonomous agents that interact and collaborate to achieve a common objective. These agents can be software-based, hardware-based, or a combination of both.
2. **Tourism and Travel**: In the context of tourism, a Mixture of Agents refers to a business model where a travel agency or tour operator combines the services of different types of travel agents or intermediaries to offer a comprehensive travel solution to customers. This can include traditional travel agencies, online travel agencies, tour operators, and concierge services.
3. **Tourism Industry**: In the broader tourism industry, a Mixture of Agents refers to the diverse range of stakeholders and entities that come together to create and deliver tourism experiences to visitors. These agents can include tour operators, travel agencies, accommodation providers, transportation providers, activity providers, destination management organizations, local communities, government agencies, and tour guides.

The benefits of a Mixture of Agents include:

* **Increased efficiency**: By leveraging the strengths of different agents, a Mixture of Agents can streamline processes and reduce errors.
* **Improved customer service**: A Mixture of Agents can offer personalized service and attention to detail, thanks to the combination of human expertise and technology.
* **Competitive pricing**: By comparing prices across multiple sources, a Mixture of Agents can offer competitive pricing and value to customers.
* **Broader range of options**: A Mixture of Agents can offer customers a wider range of options, including bespoke itineraries, off-the-beaten-path destinations, and unique experiences.

However, a Mixture of Agents also presents some challenges, such as:

* **Integration**: Coordinating the services of different agents and ensuring seamless integration can be complex and time-consuming.
* **Communication**: Effective communication between agents and customers is crucial to ensure that plans are executed correctly and customer expectations are met.
* **Cost**: A Mixture of Agents may require significant investment in technology, training, and staff to manage the different components of the business.

Overall, a Mixture of Agents offers a powerful approach to solving complex problems and delivering comprehensive solutions by leveraging the strengths of diverse autonomous entities.

Overall, a Mixture of Agents offers a powerful approach to solving complex problems and delivering comprehensive solutions by leveraging the strengths of diverse autonomous entities.

Conclusion

In summary, Mixture of Agents (MoA) is a novel approach that leverages the collective intelligence of multiple LLMs to push the boundaries of language model capabilities, achieving state-of-the-art results on benchmarks while using only open-source models

Referrences

GroqCloud

Experience the fastest inference in the world

console.groq.com

GitHub - togethercomputer/MoA: Together Mixture-Of-Agents (MoA) - 65.1% on AlpacaEval with OSS…

Together Mixture-Of-Agents (MoA) - 65.1% on AlpacaEval with OSS models - togethercomputer/MoA

github.com

https://www.researchgate.net/publication/381294672_Mixture-of-Agents_Enhances_Large_Language_Model_Capabilities

connect with me

Note: In no way I claim that the concepts explained is proprietary. The article has been written by referring to content published online