My Reinforcement Learning Adventures
Can you imagine how cool it will be to do military strategies in games? How about having a game that is a master at military tactics, the art of war, and everything that you can play any moment of the day?
2 years ago, after reading a bit of the hit manga Kingdom(amazing story btw), I decided that I want to see what I can do to make a game like that. A game where, if I just position the troops the right way, it can come to the same conclusion on where to place the troops as the generals of 2250 odd years ago.
For all the models below, I used the proximal policy optimization model.
Version 1
For my first version, I drew some inspiration from chess.
Environment
- Action space
Every iteration each player(the white/blue dots) chooses to attack 1 square around it out of the 4 directions. If the attack square has an enemy player in it, then the player gets +1 reward. Another action the environment outputs are the direction in which the agent moves out of the 4 directions. So, the dimension of the action space is [number of players, 8].
- Observation space
Holds x position, y position, rank, side, alive for all the players within every player’s vision. So, the dimension of the observation space is [number of players, number of players the player can see, 6]. I also had a hierarchy observation space but I won’t go into that here.
- Basic model explanation
I took the observation and used an LSTM to compress into [number of players, 256] and then did some fully connected layers to output [number of players, 8]. The action gets chosen by doing a softmax with the first 4 indices and choose an attack direction and do the same thing with the latter 4 indices and choose a moving direction.
I also made it so that the model does not only fight against itself but rather fights against older versions of itself which I thought will help.
Results
Basically, as you can see from the plots and the animation, they don’t seem to be learning much. The main problems with this environment were
- The main problem is that there doesn’t seem to be much coordination between players
- The environment is too simple to see military tactics. Like moving diagonally left down is not faster than moving left then down.
- Also, the policy and the value stopped learning after a bit. At the time I made this version, I assumed that the environment was too simple but now I think it is because the observation space was encoded in such an overcomplicated way.
Version 2
Environment
- Action space
I now changed it so that the action outputs movement angle, movement length, orientation, and attack angle. The orientation and attack angle basically specifies the range of attack. The smaller the range, the larger the damage. As now each player has hp, I thought this might lead to some duels between players. The above animation is an example of this.
The wings signify the attack range.
- Observation space
No important change from before.
- Basic model explanation
The main change is outputting log standard deviation and mean for the action space and sample a value from the distribution. That’s how I got the values for the actions
Results
While the value loss seems to improve, the policy loss seems to plateau. But, one thing I noticed was that while the wings/attack orientation seems to improve, the coordination between players doesn’t seem to be appearing. So, the main problems are:
- Players can move randomly without coordination
- The wings attack mechanism is too complicated
- Like before, the observation space is too complicated
- And finally, this is applicable to version 1 too but the action space is a bit too complicated to be human-understandable.
Version 3
Environment
- Action space
I made the action space a 4 by 4 vector field so that it is understandable which direction the model is going towards. I also made it so that the vector field represents the force acting on the players.
- Observation space
I made the observations space a simplified version of the render above so that it is human understandable and possible easier to learn for the model than before. And now, running multiple games at once is possible!
- QOL changes
I made it so, to encourage cooperation, that the players stick together by having a hiarchy and having springs attached to them like so
Also, for attacking, I just made it so that the velocity is the direction of attack like so and there’s a triangle of attack
- Basic model explanation
I wrote a separate series on this on medium called Understanding OpenAI baseline source code and making it do self-play but I basically modified OpenAI baselines to make it so that you can do selfplay in them! I remember it taking like 1 month, maybe 2, but I’m happy with the results. I did this mainly because I wasn’t sure if the problem was in my model or an environment, and since I knew Open AI is one of the best AI companies out there, I thought I can’t go wrong by using their code. The code still doesn’t work for anything other than ppo2 for now.
For the model, I just used cnn-lstm with action space of box(continuous action space) which was quite logical for me because the observation space is an image and the output is just forces which is continuous.
Results
While the policy seems to be improving, when I check the videos, I noticed that there’s no significant change on what is being done across the 100 epochs run. One problem that lead to this was that the model fought against itself. So, at any moment in time, it didn’t make much sense for one side to consistently win against another. Basically, the main problem with this model was that I had no idea if the model was actually learning.
Version 4
- Action space
For the action space, I changed it so that I can specify what size I want. For example, the above animation is a 2x2 vector field. I also made it so that the vector field is not an acceleration/force vector field but rather a velocity one and I just apply force that moves the model in the direction of the force. I just thought that’ll be easier to comprehend.
- Observation space
No change.
- QOL changes
Since I didn’t know if the model was learning, I decided to go with the easiest strategy: fighting against a stationary opponent. So, I’ll know that the model is improving if it is able to defeat the enemy in a faster way.
Another thing I did was I removed the spring mechanic because, if possible, I wanted the model to learn to stick together by itself.
Results
As you might have seen from the video, it didn’t go to great. The first plot above is basically the reward so, in the beginning, with a random policy, it worked for a while but then it just stopped doing anything productive and just went to a corner.
For the longest time I was confused about why this was happening but what happened was quite simple. For my action space, I output a continuous actions space with the shape (2, 2, 2) where action space[0][0] represents the top left vector. The problem with this was I used actor critics which outputs a mean and a log standard deviation. So let’s say for example, the model learns that shifting the mean to say -5 is quite a good idea for all actions with the standard deviation of 1. What happens there is that then all that part of the action space ever does is point towards left bottom and it’ll just remain there because all the numbers sampled will always be negative. Just to note, this problem occurred on a 1x1 action space as well.
The reason the policy seems fine is because the value loss is initially really high so the value loss improves and eventually just predicts 0 which leads to the policy loss to appear to be improving.
Version 5(Current)
- Action space
Instead of the continuous action space I used before, I decided to just use a multidiscrete action space. Then, the model just does a softmax and chooses an action without going into all the mean changing problems. For this model, it’s just a 1x1 action space. Currently, I started training on a 2x2 action space
- Observation space
No change.
Results
While the model doesn’t seem to improve with all the models, I see that for models like impala_cnn and cnn_lstm, the models rewards are increasing. Also, it’s a bit noisy but the problem of the finding an invalid policy seems to have stopped happening which I’m happy about! Also, the policy loss seems normal. So I’m finally getting something after 2 years!
Next Steps
Since I have a self play environment set up, I plan to just have a model fight an older version of itself. For example, making one model learn for 1 epoch and then make the other model fight against it and see how it improves.