A tutorial on MADDPG

Original article https://arxiv.org/abs/1911.10635

Published in

Machine Intelligence and Deep Learning

9 min readMay 2, 2022

In the last years, there have been several new efforts to address Reinforcement Learning (RL) problems. If we consider real-life scenarios, it is easy to see why many of these approaches have focused on frameworks that include more than one agent. A robot, for example, needs to consider not only its environment but also other robots or humans that are present in that same environment. These algorithms are referred to as Multi-Agent Reinforcement Learning (MARL). For a survey about MARL, we refer to [1]. In this article, we review one of these algorithms named Multi-Agent Deep Deterministic Policy Gradient (MADDPG), developed by Lowe et al. [2].

The problem with MARL

The majority of late successes in Reinforcement Learning have been limited to a single-agent scenario, where the agent takes a series of sequential actions that affect the environment. The environment returns a new state (determined by a transition probability) and an associated reward to the agent. The environment is generally modeled as a Markov Decision Process (MDP), whose solution traduces into finding a policy that maps from the state space to a distribution over the action space so that the cumulative reward is maximized. In this context, one major assumption is that the environment is stationary. This assumption cannot be extended to multi-agent frameworks because the transition of states and the reward received by each agent are affected by the actions of the other agents. Furthermore, each agent needs to optimize its own cumulative reward, which is a function of the other agents’ policies [1]. Accordingly, a MARL problem is generally addressed as a generalization of a MDP called Markov game (MG).

Models for single and multi agent reinfocement learning [1]

Traditional RL approaches such as Q-Learning or policy gradient cannot be easily extended to multi-agent scenarios. Regarding Q-learning, a changing environment presents limitations in aspects like the use of a replay buffer and the learning stability as the policy previously optimized for one instance of the environment is no longer useful once the environment has changed. Regarding policy gradient, multi-agent scenarios generate high variance gradient estimates when coordination is required.

Deep Deterministic Policy Gradient (DDPG)

For policy gradient approaches, we update the policy directly; this policy maps the state space to a probability distribution over the action space. This is generally done using some variant of gradient ascent for a specific performance function, generally equal to the state value function:

In Deterministic Policy Gradient (DPG), we extend this approach to deterministic policies (the distribution collapses to a single element in the action space). Deep Deterministic Policy Gradient is a variant of DPG where we approximate this deterministic policy and the critic using deep neural networks. This is an off-policy algorithm that employs a replay buffer that requires the action space to be continuous. MADDPG can be seen as an extension of DDPG. For DDPG, the gradient is given by:

To deal with the previously described limitations, the authors propose an algorithm that i. uses only local information at execution time. ii. makes no assumption either of a differentiable model of the environment dynamics or of any communication method and iii. is applicable to cooperative and competitive frameworks.

Centralized training and decentralized execution

The basic idea of MADDPG is to expand the information used in actor-critic policy gradient methods. During training, a centralized critic for each agent has access to its own policy and to the policies of all the other agents. However, during execution, each agent employs only its own policy in a decentralized approach. Despite being centralized, having one critic for each agent instead of a single critic for all the agents has the advantage that we can consider different reward functions, which is necessary for competitive frameworks.

Figure 1: Overview of multi-agent decentralized actor, centralized critic approach [2]

Updating the Critic

We consider N agents, each one with its own continuous deterministic policy and its own centralized action-value function. This function takes the actions of all the agents and information about the state (in its simplest form this corresponds to the observations of all the agents) as input and returns the agent’s Q-value as output. To update each agent’s critic, we employ a replay buffer just like in DQN [3] where the state transitions, actions, and corresponding rewards of all the agents are accessed from. Then, the loss function used to update the critic is defined in the same way as in DQN:

This equation works only in cases where each agent has access to other agents’ policies so the actions can be calculated and used as input to the critic. When this assumption is not realistic, each agent can use approximations of the other agents’ policies. Then, the approximations of agent i to the parameters that define the policy of agent j are updated in an online fashion, using an entropy regularizer, which changes the target employed to update the critic:

The effect of learning the policies of other agents is analyzed in a cooperative communication scenario (described later) achieving the same success rate as with true policies without a considerable decrease in the convergence speed.

Updating the Actor

The real purpose of calculating the critic is using it to update the actor, which will actually be the one used during execution time. For that purpose, we can iteratively move in the direction of the negative/positive of the gradient, which for deterministic policies, whose input is the agents’ observations, is equal to:

In a competitive setting, non-stationarity represents a particularly challenging problem; agents may overfit the other agents’ behaviors. To deal with this problem, the authors propose training a collection of K sub-policies for each agent. At every episode, each agent randomly chooses one of its K sub-policies, and samples from a replay buffer specific to that sub-policy. In this case, the gradient update for the ensemble of policies becomes

The authors analyze the influence of using these ensembles in three scenarios named Keep-away, Physical deception, and predator-prey and conclude that the ensemble shows a better performance especially when it is put to compete against a single policy framework.

Under, all the previously described considerations, the full MADDPG algorithm for N agents is the following:

Comparative Results

For all the considered environments, the authors use feed-forward continuous policies with the following parameters:

The authors consider competitive and cooperative games within an environment composed by N agents and L landmarks. They can take two types of actions: physical and communication. For the comparison with other RL algorithms, the authors only consider one scenario named Cooperative communication in which there are a speaker, a listener, and three landmarks. In each episode, the listener moves to one landmark and gets a reward depending on how far it is from the correct landmark. However, only the speaker, not the listener knows which the correct landmark is. The speaker can send a message to the listener. The performance, measured as the mean episode reward, of MADDPG and other algorithms, is presented in Figure 3.

Figure 2: Policies parameterized by a two-layer ReLU MLP with 64 units. Messages were sent as approximations to a discrete message using Gumbel-Softmax estimator. Left: mean episode reward vs episode Right: Percentage of target reached after 25 000 episodes

In Cooperative communication, the listener goes to the right landmark 84% of the time using MADDPG, while all the other algorithms fail. The authors also consider other scenarios to compare MADDPG to the centralized version DDPG. First, they consider a scenario named Cooperative navigation, in which agents must cooperate through physical actions to reach a group of landmarks while avoiding collisions among them. MADDPG agents beat DDPG agents in cooperative navigation (N=2); they present a slightly smaller average distance and half the average number of collisions per episode.

Second, the consider a scenario named Keep-away composed of a target landmark and two teams of cooperating agents; one team knows which is the target landmark and is rewarded based on their distance to the target, and another team that does not which is the target, and tries to impede the initial team from reaching the target by physically pushing the agents and occupying the landmark temporarily.

Third, they consider a scenario named physical deception where a group of agents cooperate to get to a landmark and are rewarded based on the minimum distance between any member of the team and the target. There is also one single adversary with the same objective who does not know which landmark is the target. The members of the initial team need not only one member to get to the target but also to occupy spaces so the adversary does not know which is the correct target. Under a physical deception task with two landmarks, the adversary is deceived (all landmarks are covered) 94% of the time and the success rate of the adversary is relatively low when the cooperative team employs MADDPG. In contrast, when the cooperative team employs DDPG, they fail to deceive the adversary, even when this employs DDPG and not MADDPG.

The fourth considered scenario is a predator-prey game in which a set of slower agents try to collide with a faster prey in a randomly created world. In a MADDPG predator — DDPG prey setting, the collision rate is 16.1, in comparison to 10.3 under a DDPG predator-MADDPG prey. The fifth scenario is named Covert communication. A speaker named Alice sends an encoded message to a listener Bob, both of whom have the key. There is a second listener called Eve who wants to reconstruct the message. Alice and Bob are penalized if this occurs. Results show that Bob achieves a success rate of 52.4% with MADDPG and 25.1% with DDPF. Only when Alice uses MADDPG, Eve gets almost random accuracy. The behaviour after the training process is shown in the following video:

Concluding remarks

MADDPG is a suitable and relatively algorithm in cases in which we can assume that the existence of a centralized critic is reasonable. This implies that the agents shall have access to the information from all the other agents. Furthermore, relying on the information from other agents implies that the input space for the critic increases linearly with the number of agents.

In the context of multi-agent approaches to Reinforcement Learning, MADDPG represents a middle point in terms of the amount of information considered. On one hand, we have approaches like the one presented in [4] where each agent considers the other agents as part of the environment. It has been shown that these approaches are not good in several scenarios. On the other hand, we have approaches where every agent has access to the information of all the other agents and that work only for cooperative settings. MADDPG considers the information of all the agents during training but not during execution and works for both cooperative and competitive settings.

References

[1] K. Zhang, Z. Yang, and T. Ba¸sar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” 2019.

[2] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” 2017.

[3] Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). https://doi.org/10.1038/nature14236

[4] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in In Proceedings of the Tenth International Conference on Machine Learning, pp. 330–337, Morgan Kaufmann, 1993.