Multi-Agent Reinforcement Learning (MARL) algorithms

Independent, Neighborhood and Mean-field Q Learning explained

Mehul Gupta
Data Science in your pocket
5 min readJun 19, 2023

--

Photo by Adlan on Unsplash

Adding one more interesting piece to the longest blog series I ever wrote (almost 15 posts now, you can check them below), this one will cover Multi-Agent environments and algorithms for training such environments in Reinforcement Learning that are based on Q-Learning.

My debut book “LangChain in your Pocket” is out now

Most of the games we play be it Contra, PubG, or Need For Speed involve multiple players/characters. Now, these multiple players can be of 3 types

Actual player: Like you and me

NPC: Non-Player Character. Such characters aren’t competing with you but can be taken as secondary characters in the game. For example: In Mario, the dragon can be taken as an NPC. Do note that NPC is usually not handled by any human but by some computer logic

Bots: Such characters are also not handled by humans but using some computer logic but they compete against actual players. For example: You must have heard people saying ‘Bots’ in PUBG which aren’t as intelligent as human competitors or ‘V/S Computer’ mode in many games.

Note: Folks usually get confused between NPC and Bots. The only difference being Bots compete against the actual player and NPCs don’t.

You might need a few references (Temporal Difference, Q Learning, and DQNs) from my previous posts on Reinforcement Learning to understand this one.

Be it NPC or Bots, usually some algorithm is used to operate them but high-end games, to provide a more realistic experience uses AI for training these characters as well. Multi-Agent reinforcement learning finds its implementation in such cases as you would be having multiple NPCs/Bots most of the time.

Training in a Multi-Agent environment brings in a big challenge that we should consider before jumping ahead i.e.

An Action taken by each Agent will affect the Action taken by other agents.

Hence, the environment isn’t stationary anymore where just using the context of the state, action/q-value can be chosen for an agent.

Before moving ahead, You need to know the DQNs that I have covered in my previous posts

We will start off with a naive approach of

Independent Q-Learning (IQL)

Going with a baseline approach, the easiest method one can think of instantly is to have separate DQNs for each agent involved. Feed it the state and get q-values and eventually choose an action for each action. In this solution, we are assuming other agents are part of the environment and don’t have any thinking capabilities. Hence assumed as stationary elements.

This may work for a few cases, but the problem is evident i.e. it assumes other agents to be stationary which is not true. Also, if the number of NPCs is very high, training so many DQNs isn’t feasible. And hence for most environments, this approach might not work.

We somehow need to incorporate potential actions that can be taken by other agents present in the environment before choosing action for any agent.

Also, we need to use lesser number of DQNs

To incorporate other agents' actions while calculating the q-value for an agent, another baseline idea is to provide other agents' actions alongside Current State as Input and not just the current state as we do in traditional DQN. The below diagram can explain things better

A) traditional DQN B)updated DQN for MARL

This method, though should work fine for some environments, has major drawbacks

If the number of NPCs is big, the input dimension can be huge (say some war games where we can have an army of 1000s of soldiers, each being an NPC/Bot)

If you wish to add a new NPC, the whole model may require retraining as architecture changes are required.

The idea of adding actions from other agents is nice alongside State information to a DQN, but this has to be of a constant size and not too big as well.

Nearest Neighborhood Q-Learning

The idea of Nearest Neighborhood takes inspiration from KNN where we assume that an agent is influenced most by his K nearest neighbors and the rest of other NPCs can be ignored.

What does this mean?

If you’re playing PUBG and you landed in ‘Pochinki’, you would choose your actions dependent on enemies in ‘Pochinki’ but not depending on players in ‘Hospital’ or ‘School’. You won’t be caring about them until they are closer to you

In short, we don't need to consider all the NPCs/Bots while taking the decision most of the time but the closest ones are the most important and hence come Nearest Neighbor Q-learning. So even if there are 100s of bots, depending upon ‘K’, we would be feeding the potential actions of ‘k’ neighbors of an agent alongside the current state to DQN for predicting the q-value for the current agent.

As you must have got, this can help in maintaining the size of the input vector as even if we add or remove NPCs/Bots, the neighbors considered are constant, and the input vector size is constant. Also, the size of the input vector can also be controlled with ‘k’.

Mean Field Q-Learning

In some games,

  • We might need to consider all NPCs/Bots
  • ‘K’ can be big

In such cases, Nearest Neighbors might not be of great help. So, Mean Field Q-Learning can be used where we move on with a baseline assumption that all other agents behave in the same average or mean way as the current agent. Hence instead of feeding action by each agent separately to the DQN, we would be calculating a single mean action vector for all agents by

Converting actions by different agents into vectors

Adding these vectors

Normalizing this added vector such that all elements add up to one.

So assume agent A action=[1,0,0], B=[0,0,1],C=[0,0,1], then normalized mean action vector=[1,0,2]/3 = [0.33,0,0.66]

Now feed this vector to the DQN alongside the State to get the -value. Though the approach is quite simple, it helps in maintaining constant input and considering more neighbors hence more information. Also, do remember that the assumption that agents behave in the same average way might not be correct at times leading to approximation error.

You can explore a few multi-agent environments in petting-zoo below

--

--