How to Craft and Solve Multi-Agent Problems: A Casual Stroll with RLlib and Tensorforce

A brief tutorial on defining a Multi-Agent problem and solving it using powerful Reinforcement Learning Libraries

8 min readJun 23, 2020

Source: Cover Page of Multiagent-Systems-Algorithmic-Game-Theoretic-Foundations — Source: Cover Page of the book Multiagent-Systems-Algorithmic-Game-Theoretic-Foundations by Yoav Shoham and Kevin Leyton-Brown

Multi-Agent systems are everywhere. From a flying flock of birds and a wolfpack hunting deer to people driving cars and trading in stocks. These real-world cognition problems involve multiple intelligent agents interacting with each other. But what’s driving us to study them? Curiosity, perhaps?

If only we could mimic complex group behaviours using Artificial Intelligence.

And that’s where Multi-Agent Reinforcement Learning (MARL) comes in. While in a single-agent Reinforcement Learning (RL) setup, the state of the environment changes due to action of one agent, in MARL, the state transition occurs based on the joint action of multiple agents. Formally, it is an extension of the classic Markov Decision Process (MDP) framework to include multiple agents and is represented using a Stochastic Game (SG). Typically, a MARL based multi-agent system is designed with the following characteristics in mind:

Autonomy: Agents must be at least partially independent and self-aware
Local View: Decision making is done subject to only local observability of the system
Decentralization: No agent can control the whole system

This design is in contrast to a more traditional monolithic problem-solving approach which can be severely limited in terms of scalability and complex decision making.

Okay, enough theory. Let’s get our hands dirty with some code.

A Toy Multi-Agent Problem

Consider five farmers. Each of them wants to withdraw water from a stream to irrigate their fields (grow more corns!). However, the problem is, if an upstream farmer withdraws too much water, there will be water scarcity for the farmers who are situated downstream. Let’s now try to frame a MARL problem that nudges farmers to not be greedy and cooperate.

Observation Space

Each agent observes the amount of water flowing in the stream on a particular month. For simplicity, we randomly choose water flowing in the stream on a specific day to be between 200 and 800 volume units. (Note that here we allow agents to have global observability because the observation is quite simple)

Action Space

Each agent chooses a proportion of water to withdraw from the stream. This proportion is between 0 and 1. Note that action selection happens at the start of every episode. You can consider it like an announcement made by every farmer at the start of the episode. (A bit unrealistic, but hey, this is a toy example!)

This month, 380 metric literes of water will flow in the stream. Hmm, I think 110 metric literes is enough for my corn field. I declare that I will withdraw 30 % of the water from the stream this month.

What if the actual water in the stream is less than water demanded (because upstream agents already withdrew the water!). In that case, the farmer withdraws all that is left in the stream and a penalty is imposed on the system (described in the next section).

Reward

More water means better irrigation. But water more than a limit can damage crops. Moreover, there is a minimum requirement of water for every farmer. Let’s first put a simple bound on water withdrawn w for every agent as 0<w<200. We then define reward from crop yield as a quadratic function of water withdrawn, with positive reward between 0 and 200.

R(w) = -w² + 200w

Further, we define a penalty proportional to the amount of water deficit for every agent.

water deficit =water demanded — water withdrawn

Penalty = 100* (water deficit)

To promote cooperation, we give global reward to all the agents, which is equal to the sum of individual rewards and penalty for every agent.

Implementation using RLlib

First, we need to implement a custom environment. In RLlib, this is done by inheriting from aMultiAgentEnvclass. A brief outline of custom environment design is as follows:

The custom environment must define a resetand a step function. Add other helper functions if you want to.
The reset function returns a dictionary of observations with keys being agent ids. Agent ids could be anything that is unique for the agents (duh!) but must be consistent across all functions in the environment definition.
step function takes as input a dictionary of actions with keys being agent ids (same as above). The function must return a dictionary of observations, rewards, dones (boolean whether the episode is terminated or not) and any additional info. Again, the keys for all these dictionaries are agent ids.
dones dictionary has an additional key __all__ which must be True only when all agents have completed the episode.
Lastly, a powerful but rather confusing design choice. Not all agents need to be present in the game at any time step. This necessitates that

The action_dict passed to thestep function always contains actions for observations returned in the previous timestep.
The observations returned at any timesteps need not be for the same agents for which actions were received.
The keys for observations, rewards, dones and info must be the same.

This design choice allows decoupling agent action and reward, which is useful in many multi-agent scenarios.

And now, coming back to our game of happy farmers, the custom environment can be defined something like this.

The Train Driver

RLlib needs some information before starting a heavy-duty training. This includes

Registering the custom environment

def env_creator(_):
    return IrrigationEnv()
single_env = IrrigationEnv()
env_name = "IrrigationEnv"
register_env(env_name, env_creator)

2. Defining a mapping between agents and policies

obs_space = single_env.observation_space
act_space = single_env.action_space
num_agents = single_env.num_agentsdef gen_policy():
    return (None, obs_space, act_space, {})policy_graphs = {}
for i in range(num_agents):
    policy_graphs['agent-' + str(i)] = gen_policy()def policy_mapping_fn(agent_id):
        return 'agent-' + str(agent_id)

3. Hyperparameters and training configuration details (for my humble training setup)

config={
    "log_level": "WARN",
    "num_workers": 3,
    "num_cpus_for_driver": 1,
    "num_cpus_per_worker": 1,
    "lr": 5e-3,
    "model":{"fcnet_hiddens": [8, 8]},
    "multiagent": {
        "policies": policy_graphs,
        "policy_mapping_fn": policy_mapping_fn,
    },
    "env": "IrrigationEnv"
}

4. Lastly, the training driver code

exp_name = 'more_corns_yey'
exp_dict = {
        'name': exp_name,
        'run_or_experiment': 'PG',
        "stop": {
            "training_iteration": 100
        },
        'checkpoint_freq': 20,
        "config": config,
}ray.init()
tune.run(**exp_dict)

And that’s it, the mighty Policy Gradient (PG) will optimize your system towards a socially optimal and cooperative behaviour. You can use Proximal Policy Optimization PPO instead of PG to get some improvements in the results. More algorithm choices and hyperparameter details are available in the RLlib docs.

Implementation using Tensorforce

Unlike RLlib, Tensorforce doesn’t natively support Multi-Agent RL. Why do we want to try it then? Well, from my personal experience, if you wish to implement complex network architectures for policy function, or need a very efficient training pipeline over multiple clusters, RLlib truly shines. But it seems of an overdo when you just want a simple multiagent system that has somewhat tricky inter-agent interactions. And that’s where Tensorforce can be very handy.

Just like RLlib’s tune.run, Tensorforce too has a similar API. But we aren’t going to discuss that. Instead, we will focus on the act and observe workflow. It is super flexible and gives you the freedom to decouple agent actions, environment step execution, and internal model update. To achieve similar autonomy in RLlib is relatively hard. Okay, now back to code snippets.

Firstly, we need an environment. Here, we don’t need any particular format. The only requirement is that we should be able to get initial observations when reset is called and get new observations, rewards, terminal states and any additional info on environment step execution. So, for simplicity’s sake, let’s just reuse previously defined environment.

Secondly, we need Agents. We can create them with required specifications using Agent.from_spec method. Let us create 5 of those. (Note that we use state_space and action_space from the environment definition)

env = IrrigationEnv()
num_agents = env.num_agentsstate_space = dict(type='float', shape=(1,))
action_space = dict(type='float', shape=(1,), min_value=0., max_value=1.)config = dict(
            states=state_space,
            actions=action_space,
            network=[
                        dict(type='dense', size=8),
                        dict(type='dense', size=8),
                    ]
)agent_list = []
for i in range(num_agents):
    agent_list.append(Agent.from_spec(spec='ppo_agent', kwargs=config))

The agent configuration can be provided in multiple ways. This is a minimal example but the reader can refer to Tensorforce docs for more details.

And lastly, we need the training code. I must confess that I have written it in a very messy way. Thus, here is the basic outline of the workflow.

Create a batch of environments

env_batch = []
for i in range(batch_size):
    env_batch.append(IrrigationEnv())

2. Loop over training iterations and in every iteration, reset the batch of environments to obtain initial observations.

for _ in range(training_iterations):
    for b in range(batch_size):
        obs = env_batch[b].reset()

3. Loop over agent_ids for which observations are returned and call Agent.act on batch of observations.

    for agent_id in obs:
        actions = agent_list[agent_id].act(states =       obs_batch[agent_id])

4. Loop over all the environments in the batch and apply the actions to each environment. We get new observations, rewards, terminal states and additional info.

    for b in range(batch_size):
        new_obs, rew, dones, info = env_batch[b].step(action_batch[b])

5. Lastly, for every agent for which we called Agent.act, call Agent.model.observe with rewards and terminal states to internalize experience trajectories.

    for agent_id in new_obs:
        agent_list[agent_id].model.observe(reward = rew_batch[agent_id], terminal = done_batch[agent_id])

The rest is just some code gymnastics for being able to conveniently access values for every agent and every element in the batch. The complete code for training is

env_batch = []
for i in range(batch_size):
    env_batch.append(IrrigationEnv())for _ in range(training_iterations):    obs_batch = {i:[] for i in range(num_agents)}
    rew_batch = {i:[] for i in range(num_agents)}
    done_batch = {i:[] for i in range(num_agents)}
    action_batch = {b:{} for b in range(batch_size)}    for b in range(batch_size):
        obs = env_batch[b].reset()
        for agent_id in range(num_agents):
            obs_batch[agent_id].append(obs[agent_id])    for agent_id in obs:
        actions = agent_list[agent_id].act(states = obs_batch[agent_id])
        for b in range(batch_size):
            action_batch[b][agent_id] = actions[b]    for b in range(batch_size):
        new_obs, rew, dones, info = env_batch[b].step(action_batch[b])
        for agent_id in obs:
            rew_batch[agent_id].append(rew[agent_id])
            done_batch[agent_id].append(dones[agent_id])    for agent_id in new_obs:
        agent_list[agent_id].model.observe(reward = rew_batch[agent_id], terminal = done_batch[agent_id])
    
    print(np.mean(rew_batch[0]))

Now, I must say that this isn't a very scalable code. Certainly, some efficiency improvements can be brought in by parallelising the for loops and using threading. However, what I wanted to showcase here is the ease with which one can quickly prototype of agent interactions in multi-agent systems. More complex mechanisms such as inter-agent teaching, advising, or multi-agent communication are also straightforward to implement using this workflow.

Conclusion and the mysteries of future

Tensorforce and RLlib are both remarkable libraries for training RL agents at the moment. However, they suffer from a hard to read and often times chaotic documentation. There is also a severe lack of examples of advanced use-cases. Moreover, help is limited on the internet as the community is still not very large. I have thus decided to write a series of blogs highlighting how interesting multi-agent systems can be crafted using these libraries. I hope this will especially be useful to those who are stuck for days trying to put to code what is in their mind. Let me know if you have any suggestions, comments or fun ideas! I am also open to collaboration.

Oh and all the code is available in the repository here. Cheerio!