Custom Gym environment with agents that collaborate

Published in

Analytics Vidhya

4 min readDec 29, 2019

In the previous article, I explained how to make an Open AI’s gym environment with multiple agents. In this one, I will explain how to make them collaborate to pursue a common goal: exit the maze !

All the code for this article is available on my GitHub.

The custom environment

The custom environment will be a maze (similar to the one in the previous article) but with some change to it.

0: Empty area, The agents can go there

1: Agent 1 who will try to find the exit

2: Agent 2 who will also try to find the exit

3: Traps, if an agent go there, he loose the game

4: Teleporter, if an a agent move to a teleporter, the other agent instantly move 3 cells to the north.

5: Exit, to exit the maze and win the game

In a game, agents 1 starts to move then agent 2 move, then agent 1,… The game end if an agent fall into a trap or found the exit. It is a turn-based game.

The reward function

The reward function is essential to find en efficient way to find the better policy for our agents. Our reward function depends on the results of the action (win the game==’W’, lose the game==’L’ , continue the game==’P’)

If an agent find the exit he will be heavily rewarded. Notice that it also depends on the current_steps because we want our agent to find the exit in the less possible steps. So the less steps steps we make to win the bigger the reward.

If you check the code, you will also see that we implemented a bonus_reward. That bonus is just 1 if an agent move to a cell that no agents have already visited else 0. It his made to promote discovery.

If an agent move to an empty cell he will be rewarded -2 if an agent already went there and -1 if no one has been there before (notice that the reward is still negative).

The Training

Here come the part that we were waiting, will our agent find the best strategy ?

Episode 0 — 50

At the start of the learning agents often fail to find the exit in less than 3 steps and when they do it is mostly because an agent found a trap.

Episode 51–1000

We start to see some Success, a success in 22 steps is far from optimal, agents clearly do not use the teleporter efficiently but that’s a still an improvement.

Episode 1001–20 000

AHH! It is almost only success in this range. Let’s see if we find the shortest path now.

Episode 25000 and beyond

The shortest path can be done in 4 steps by using the teleporter. After 25000 episodes we train a model that finds the optimal solution.

Conclusion

Collaboration works in Reinforcement learning and it’s kinda interesting to watch. However the major downside is the training needed to obtain the result which is very long.

Did you watch the hide & seek video made by Open AI? They used a computing power that only a company can have (The result is still really impressive).