Can 7B Models Now Master AI Agents? A Look at Kwai’s Recent LLM Open-Source Release

5 min readDec 28, 2023

Kwai has recently open-sourced KwaiAgents, an impressive system that, when asked about planning a ski trip for the weekend, not only finds a venue but also takes the day’s weather into account — talk about thoroughness!

Technical Report: https://arxiv.org/abs/2312.04889
Project Homepage: https://github.com/KwaiKEG/KwaiAgents

It’s common knowledge that Large Language Models (LLMs) acquire vast amounts of knowledge through language modeling and possess certain cognitive and reasoning capabilities. Yet, even with the latest and greatest like GPT-4, using it in isolation can lead to some earnest gibberish, as it struggles to interact with the world in real-time. AI Agents are one path to solving this issue, leveraging the big models’ capabilities for task planning, reflection, and tool utilization, thus enhancing the accuracy of generated content with the aid of real-world tools, and even solving complex problems. This time, Kwai, in collaboration with Harbin Institute of Technology, developed KwaiAgents, elevating the capabilities of “smaller” big models like 7B/13B to surpass the performance of GPT-3.5. And they’ve made everything open-source: systems, models, data, and benchmarks!

From the KwaiAgents GitHub page, the open-source content includes:

System (KAgentSys-Lite): A lightweight AI Agents system equipped with factual and time-aware toolsets.
Models (KAgentLMs): A series of large models with generalized agent abilities post Meta-Agent Tuning (MAT), along with their training data(partially human-edited).
Evaluation (KAgentBench): A plug-and-play benchmark for automated evaluation of agent capabilities and manual evaluation results.

System

KAgentSys is an automated system based on a large model cognitive core, supplemented with memory mechanisms and a tool library. It includes:

Memory Mechanism: With memories of knowledge, conversations, and task history, relying on a retrieval framework of mixed vector and keyword searches, the system finds the necessary information during each planning phase.
Tool Library: Includes a factuality-enhancing toolset with heterogeneous search and browsing mechanisms that aggregate knowledge from web pages, wikipedias, and video encyclopedias; and a time-aware toolset with calendars, holidays, time differences, and weather.
Agent Loop: In a round of conversation, the user poses a question, files as external knowledge and additional persona settings. It updates and retrieves memory, calls on the large model for task planning, and if necessary, uses the tools, or otherwise moves to the conclusion stage, where the model synthesizes historical information to provide an expected response.

Model

To avoid the problem of overfitting to single templates during training, the team has proposed a Meta-Agent Tuning (MAT) method, which enhances the generality and effectiveness of the large models in agent capabilities by introducing a variety of agent prompt templates into the training data.

MAT consists of two phases:

(1) Template Generation Phase: Designing Meta-Agents to generate instantiated agent prompt templates for specific problem sets and selecting high-quality templates by comparing the results with open-source templates using LLLM scoring.

(2) Directive Fine-Tuning Phase: Based on tens of thousands of templates, over 200,000 agent instruction-finetuning data was constructeed and partially edited by human. The team has tuned popular open-source models like Qwen-7B and Baichuan2–13B for public use and will continue to release others.

Evaluation

KAgentBench allows us to evaluate a large model’s agent capabilities across different templates with just one command line, thanks to thousands of manually annotated data points.

The following table below shows the improvement in various capabilities of 7B-13B models after MAT adjustment, surpassing the performance of GPT-3.5.

Additionally, human annotators assessed 200 factual and time-sensitive questions (e.g., “How old is Andy Lau this year?”) across different large models and agent systems, showing significant improvements in the KAgentSys system and models post-MAT (the percentage before the number indicates the accuracy rate, with the number in parentheses being the average on a 5-point scale).

Traditional search engines exhibit limitations when faced with long-tail or trending questions. For instance, as depicted in the follwing Figure, when querying the age difference between Antonela and Messi, two issues arise: (1) the trendiness of “Messi and his wife” skews search results towards news articles that tempt user engagement with irrelevant content, such as relationship timelines; and (2) the question, which pertains to their respective birthdates (a detail not widely sought), is a “long-tail” search query. Large language models (LLMs) alone struggle with this, as they might recall Messi’s birthdate but forget Antonela’s. Combining LLMs with a search engine also falls short, as the surfaced information, while related, fails to precisely address the inquiry. KwaiAgents overcomes these hurdles by incorporating entity linking and extracting relevant details from resources like Wikipedia. Specifically, this system first retrieves Messi and his wife’s birthdates, then accurately computes the time difference using the time_delta tool, delivering the correct response to the posed question.

The field of AI Agents presents a significant potential pathway, we hope that more projects like KwaiAgents will incessantly flood the entire community with fresh vitality.

Can 7B Models Now Master AI Agents? A Look at Kwai’s Recent LLM Open-Source Release

System

Model

Evaluation

Written by Larry Pan