Learning to Teach for Collective Intelligence

MIT-IBM Watson AI Lab
5 min readFeb 12, 2019

--

Many human activities require collaboration, including soccer, medical treatment and performing arts. The high level goal of our research is to develop intelligent automated agents that can perform these tasks as well as humans. These tasks in general can be formulated as multiagent reinforcement learning (MARL) problems that involve sequential decision making. In cooperative settings, where agents interact to work together as a team in an environment, the agents receive local observations and a shared team reward at every time step. Without knowing the underlying dynamics of the environment, the goal of MARL is to learn a behavior policy for each agent through trial and error, such that the cumulative team reward is maximized. Motivated by human social learning in various activities, our team of researchers from IBM, MIT and Northeastern University studied the problem of how artificial agents (robots) can effectively pass on their knowledge and learn from their peers to solve tasks requiring cooperation.

MARL is long-standing research area in AI and has many open issues, including learning from partially observable data (due to communication constraints, each agent has to learn based on its own local stream of observations), delayed credit assignment (an agent won’t learn anything until feedback is provided), and changing behavior of agents (due to agents interacting with the environment while learning at the same time). In addition to inheriting these challenges from MARL, the problem of “learning to advise” has its unique challenges. First, agents need to learn when and what to teach. Second, each agent has to learn on its own while respecting constraints, such as privacy, that prohibit sharing everything. Third, agents must accurately estimate the impact of each piece of advice on teammates’ learning progress (teacher reward). Because of these difficulties and the high computational complexity, the problem of “learning to advise” for multiagent system is largely unexplored in the literature.

The problem of advising, or “teaching” agents to improve learning has been investigated previously, but these methods are confined to single-agent settings, where a student executes actions suggested by a teacher, who is typically an expert always advising the optimal action. In many real-world problems, it is unlikely that agents can directly teach each other with perfect knowledge. Yet due to the distributed nature of multiagent systems, each agent might learn different skills and knowledge, which could be beneficial to accelerate teamwide learning progress. On the other hand, existing work on peer-to-peer action-advising in MARL uses simple rules to update the parameters determining both a student policy and a teacher policy. Such an approach cannot guarantee achieving optimal policies.

To improve effectiveness of the entire learning process, we formulated the “learning to teach” problem as a higher level MARL problem. Specifically, we developed an algorithm, termed LeCTR (Learning to Coordinate and Teach Reinforcement), by which an agent learns to assume the role of a student, teacher, or even both simultaneously. These advising policies are optimized along with the task policies by alternating between the following two steps:

In step one, each agent plays both student and teacher roles to exchange advice with each other for multiple steps. Specifically, we consider two agents i and j, playing a cooperative room game together as an example (Figure 1). At every time step, agent i (student) first checks its knowledge according to its local observation about the game state. Assuming agent i has a finite number of action choices (e.g., left/right/up/down in room navigation), agent i’s knowledge is encoded by a value function measuring the value of taking a particular action based on its observation. Given agent i’s knowledge and its observation, its student policy decides whether to ask agent j for advice or not. If the decision is no, agent i follows its own policy (take the action that has the highest value or random action). Otherwise, agent i asks agent j for advice. After receiving the advice request, agent j (teacher) first checks its own knowledge about agent i’s situation, and agent j’s advising policy decides what advice should be given to agent i: either an action from agent i’s action set, or a special no advice action. After agent i executes the action from its own policy or suggested by agent j, the task-level policy of agent i is updated. At the same time, agent j also decides whether to ask advice from agent i by following the same protocol. The process continues until either the agents reach their goals or the maximum time step is reached. We repeat this process multiple times and record the teaching data (student’s decision and teacher’s advice, as well as learning progress measured by score improvement rate).

Figure 1. In this example, agent i uses its student policy to request help, agent j advises action a~j, which the student executes instead of its originally-intended action ai. By learning to transform the local knowledge captured in task-level policies into action advice, the agents can help one another learn.

In step two, a deep reinforcement learning technique is applied to update the advising policies (which decides when to advice and what advice to give) using the teaching data recorded from the first phase.

LeCTR is the first algorithm for learning to teach in multiagent environments. We demonstrate the effectiveness of LeCTR on a few simple benchmark problems and show that LeCTR achieves more than half of the learning speed of other methods. Moreover, we also show that LeCTR allows agents to teach each other even when these agents have different action sets, which is potentially beneficial for coordination among different types of agents, such as air vehicles and ground vehicles in search-and-rescue situations. Our future work involves applying LeCTR to more challenging domains that involve more agents and high-dimensional state spaces, such as robocup soccer and video games.

This work was presented in a paper titled “Learning to Teach in Cooperative Multiagent Reinforcement Learning” (authors: Shayegan Omidshafiei, Dong-Ki Kim, Miao Liu, Gerald Tesauro, Matthew Riemer, Christopher Amato, Murray Campbell, Jonathan P How), which won an honorable mention for best student paper at the 2019 AAAI Conference on Artificial Intelligence in Hawaii, Jan. 27. — Feb. 1.

Authored by Miao Liu and Gerald Tesauro (IBM Research)

--

--

MIT-IBM Watson AI Lab

This is the official Medium account of the MIT-IBM Watson AI Lab. The account follows the IBM Social Computing Guidelines. @MITIBMLab