Improving OpenAI multi-agent actor-critic RL algorithm

abhishek kushwaha
Brillio Data Science
5 min readJan 1, 2019

Multi-Agent RL algorithms are notoriously unstable to train. This article describe a way to stabilise training along with experimental results for Unity Tennis environment.

OpenAI published a paper on Jan 2018 for multi -agent RL which uses decentralised actor, centralised critic approach. Tough it improved much upon existing MA-RL algorithm and showed very good result but it is still unstable to train. DDPG algorithm is difficult to train but is more stable. I will describe way to highly improve upon this. Lets first understand OpenAI paper in brief.

I assume that the readers have knowledge of reinforcement learning (actor -critic in specific) so not going into it. Directly jumping to single and multi-agent cases.

Single agent vs Multi-agent environment

In single agent, learning is not effected by the environment dynamics (eg. gravitational force) as it is stationary from the the agents perspective (agent can learn environment model) . but with multi-agent ,the environment becomes non-stationary from the perspective of any individual agent (in a way that is not explainable by changes in the agent’s own policy, environment model is not stationary for agent to learn).

DDPG algorithm (single agent)

In DDPG algorithm, agent learns policy(function) to do a task in environment. Here the function is neural network (NN) which is trained through back-propagation. Policy learning is guided by Q-value, which itself is learned (NN) by Q-learning. (DDPG has many other component to achieve good training like, replay buffer, target network, etc. but i am not touching them as not required for current purpose) . Agent observers the environment (through sensors), based on these takes an action, and gets a reward from the environment. Using these reward signal it tries to learn policy.

MaDDPG algorithm (Multi-agent)

OpenAI adapted the above DDPG algorithm for multi agent environment. It uses a ‘decentralised actor, centralised critic training approach’. In this approach, all agents have access to all other agent’s state observation and actions during critic training but tries to predict its action with only its own state observation during execution. This helps in easing the training as the environment becomes stationary for each agent. Below diagram explains the process which was present in the published article.

Overview of our multi-agent decentralized actor, centralized critic approach (from openAI paper)

All the agents’ states and action are concatenated which serves as input to critic Neural Network. The output is Q-Value for that state. This Q-value is used as baseline to train actor, which gets only particular agent state as input and outputs its action values.

Training : DDPG vs MaDDPG

Environment: I choose Unity multi agent Tennis environment. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. The goal of each agent is to keep the ball in play. Scores are calculated as average reward of last 100 episodes. Reward for each episode is maximum of either agent. The game is episodic. Maximum Reward that can be collected is 5. Environment is considered solved when avg. score is at least +0.5. the training is done with self play. That is , AI agent learn by playing against itself.

Unity Tennis environment.(multi agent)

DDPG Results: With DDPG algorithm, it takes around 3000+ episodes to solve the environment. Extending the training will improve the scores and passes 3. The algorithm is finding it harder to train. For nearly 2300+ episodes agent is performing poorly. But once it starts learning it is quite stable. Below Graph shows the training graph.

Scores plot DDPG.

MaDDPG result (with NN same as in paper): With MaDDPG algorithm, training is eased and agent starts learning early as compared to DDPG agent. The environment is solved around 1500 episodes. But as can be seen from below graph, learning is unstable. The agent performance quickly comes down to zero. It again starts to learn but again start performing horribly.

Scores for MaDDPG (original)

Improving MaDDPG

In original MaDDPG, input to critic network is all agents’ state and all agents’ action concatenated into one vector. Instead we divide this input into state (for keeping the environment stationary for that particular agent), which includes state of all agents and actions of all OTHER agents except the acting agent. These all serve as input to NN in first layer. Now acting agent action is fead into NN only after first layer to get the Q-Value for this (state, action) pair. This stabilises the agent greatly with all other parameters kept same.

New MaDDPG result: As can be seen from the Training Graph above, agents performance is constantly improving. With new MaDDPG, agent is able to reach avg. scores (over 100 episodes) of +4.4 within 4000 episodes. Original MaDDPG could only reach avg. score +1.03 within 4000 episodes.

Scores for New MaDDPG

Though Tennis is easier problem, but improving algorithm will help us solve harder real life problem better which is our intended goal.

Comment on article for any suggestion/clarification. Clap the article if you like. Follow me for regular articles on Machine learning Domain.

Go to this github repo which has jupyter notebook and instructions to run the experiment.

Speeding Deep Learning inference by upto 20X: Check out my post on Speeding Deep Learning inference using TensorRT. If you are not using TensorRT then your whole deployment is outdated.

--

--

abhishek kushwaha
Brillio Data Science

A Data scientist & Deep learning engineer with Computer vision and NLP specialisation