Learning to Play Doom from Demonstrations
by Nick Bergh
Ever since I learned about Open AI’s toolkit for reinforcement learning, Gym, I have been hooked on trying to get deep learning agents to solve some of the environments they have available, with somewhat mixed results. Lately I’d been exploring some of Gym’s Doom environments, trying to think of ways to apply new algorithms to perform better on these tasks than some of the approaches people have posted on the Gym site. Specifically, I’d been thinking of how deep learning agents could watch demonstrations of how to complete tasks, and leverage what they learn to develop high performing policies faster than they would if they had to figure out how to complete tasks all by themselves.
A lot of deep reinforcement learning algorithms rely on some sort of exploration by the agent in order to develop a good policy. This isn’t necessarily a bad thing. There are many good algorithms which benefit from trying random things and seeing what works and what doesn’t. As the environments that one tries to solve become increasingly complex, however, the number of suboptimal policies grows much faster than the number of optimal policies. Think about it this way, if you were to teach an agent how to play Tic-Tac-Toe, after a number of random moves, it would almost certainly stumble upon a winning game and would be able to work out how it got there. If you were to teach an agent how to play chess, however, an agent that is playing randomly would never get close to beating an agent that has even a simple strategy for winning due to the sheer number of bad moves compared to the number of good moves in chess. And although computers are able to try things much faster than humans, some environments are complex enough that it is even inefficient for a computer to explore randomly.
In my search for how to use demonstrations to jump start deep learning agents, I found the algorithm Deep Q-learning from Demonstrations (DQfD). This algorithm is essentially Q-learning with some added terms to the loss function and some pre-training using only transitions from demonstrations. Very briefly, the algorithm is as follows:
- Sample mini batch from demonstrations
- Calculate loss on transitions and update network based on the gradients of the loss w.r.t. model’s parameters
- Sample action from model’s policy using an epsilon greedy policy and play that action, storing transition in replay buffer
- Sample mini batch from replay buffer, with demonstrations given some priority
- Calculate loss on transitions and update network based on the gradients of the loss w.r.t. model’s parameters
DQfD’s loss function is a combination of four losses:
The first term, DQ is the standard Q-learning loss, the second is an n-step return loss, the third is the supervised loss, and the last is an L2 regularization on the parameters of the model. The importance of each term is weighted by some constant coefficient lambda. The supervised loss is as follows:
Where l is a margin function that is 0 when a is the action chosen by the demonstrator for state s and positive otherwise. This loss is only applied to transitions from the demonstrations and is meant to ensure that the Q values for all actions not chosen by the demonstrator given state s are a margin below the Q value for the action that the demonstrator did choose. This is somewhat of a heavy-handed way of incentivizing the agent to act like the demonstrator, which may not necessarily be a bad thing. The more incentivized an agent is to act like a demonstrator, the quicker it may come upon a reasonable policy, however if the demonstrator does not demonstrate the optimal policy, the agent may have a harder time finding that optimal policy if it is focused on acting like the demonstrator. This was my hypothesis anyway. The environments I was using in this project though were not too complex and the optimal policy was not too far from any policy that reasonably completed the tasks set out in the environment, so I was happy to use this loss term to hopefully speed up training.
There are nine unique Doom environments accessible through Gym right now. The first two are very simple control environments with a limited action space where the solution is very obvious. The next six are slightly more complex environments which test different skills required to play Doom individually, such as aiming, dodging projectiles, prioritizing enemies, finding items, and pathfinding. These all have limited action spaces as well, which is around 3–4 actions for each environment. The last environment allows every action available in Doom (~30) and is meant to test all skills at once. The results I’ll be showing are from the first two environments and two environments from the second group (but I’m working towards the last one!). Here’s a brief description of each:
- DoomBasic: the agent is in a room with a monster in it, the agent must move either left or right and then shoot the monster
- DoomCorridor: the agent is in a straight corridor with enemies and must get to the end of the hallway as quickly as possible without dying
- DoomDefendCenter: the agent is stuck in the middle of a room and may turn right and left and shoot the enemies coming towards it. The agent must shoot as many enemies as possible with its 26 available bullets (dying will end the episode)
- DoomMyWayHome: the agent is placed in an environment with eight connected rooms and must find a vest. The rooms are always the same and the location of the vest is always the same but the agent’s starting location is random.
The first two are of course very simple, DefendCenter was slightly more complex and the last one was more difficult due to its sparse rewards and length of an episode.
Frames from the environment were scaled down to a 1x80x80 (grayscale) image and were used as the representation of a state. The agent used an epsilon greedy policy, with epsilon linearly decreasing from 1 to 0 over some number of steps. The network used to estimate the Q values was a Convolutional Neural Network with 3 convolutional layers and 3 linear layers (the full model can be seen here). The number of different objects the agent had to recognize in these environments was relatively small so it didn’t make sense to have any more convolutional layers, and the agent didn’t have to do much complex reasoning so a larger network didn’t seem necessary. The L2 regularization was not used as it did not improve training results in my experiments. More implementation details can be seen in the code used for training, including many hyperparameter settings.
This is the total reward achieved by the agent in each successive episode of training. Gym considers a successful episode as shooting the monster within 3 seconds with one shot, a reward of 10. The agent was able to do this almost every time after about episode 400, which took ~3.5 minutes training with my GTX 1060 GPU. This wasn’t as good as the best algorithms on Gym’s leaderboard, but for such a trivial task I wasn’t too worried about having the fastest training agent. The one hyperparameter that had to be changed for each environment was the margin used in the supervised loss (the positive value for l). I found that if it was too large, the model wouldn’t be able to estimate Q values very accurately. For this environment the margin was set to 10.
1,000 points is considered success in this environment, which happened consistently after about 200 episodes (~5.5 minutes on my machine). This was about as good as the best algorithms on Gym’s leaderboard for this environment The margin in this environment was set to 10 as well. The training for this agent was fairly interesting, as the agent developed a policy that was different than the demonstrator’s (mine), probably because I had forgotten to set the margin higher to match the scale of the reward in this environment. When I was recording demonstrations for the agent to use in training, I shot some of the enemies in the corridor on the way to the vest in order to prolong my survival, but as training went on, the agent found it did better by just running past the enemies. Here are some recordings of one episode from earlier in the training and one from later in the training. Notice how the agent started out by attempting to shoot some enemies along the way in episode 64, but beelined towards the vest in episode 216.
Success in this environment was shooting 11 monsters (~10 points). The agent was able to do this consistently just after 300 episodes (~22 minutes on my machine). This was better than the best algorithm with a writeup on Gym’s leaderboard. The margin was set to 1.0 for this environment. One hyperparameter that was the same for each training run was the frame skip (set to 4), which is the number of frames each action is repeated for. This is used because these environments don’t require a new action to be chosen every frame (0.02857 seconds) in order for an agent to do well, and skipping 75% of the frames speeds up training significantly. This made it harder for me to record demonstrations, however, as it is more difficult to play Doom when 75% of the frames are skipped, and it led to me wasting a lot of my ammo when recording demonstrations for this environment. The agent also seemed to waste a lot of ammo, even when doing well, which I feel may be because it saw me shooting inaccurately. Here are some recordings of one episode from earlier in the training and one from later in the training.
Success in this environment was finding the vest (at least 0.50 points). The agent was able to do this consistently after about 2,000 episodes (~5 hours on my machine). None of the algorithms on Gym’s leaderboard that attempted this environment were successful. This environment was very difficult as mentioned before due to the sparse reward structure. Unlike DoomCorridor, where the agent received reward for moving towards the vest, the agent didn’t receive any reward in this environment until it reached the vest. Additionally, the agent had to learn how to reach the vest from any of the rooms as it was placed in a random room each time, and it was very difficult to stumble upon the vest. Acting randomly, the agent was much more likely to get stuck in a corner. After a long time of training though, the agent was able to learn how to identify each room by the unique textures which patterned its walls, and learn the shortest path to the vest from that room. Here are some recordings which demonstrate the agent’s progression as it trained. For this environment, a clear progression wasn’t always evident when watching the agent perform. The agent usually made it the vest if it was close enough, the true progression was in consistency and efficiency.
My next attempt will probably be on DoomDeathMatch (the environment which tests all skills at once). Since this environment uses a continuous action space, I will probably have to implement an algorithm like DDPG and will try it with demonstrations permanently in the replay buffer to see if they help.
I hope you’ve enjoyed learning about my attempt to teach an agent to play Doom! If you’re still curious about this project, here’s a link to the code repo again and feel free to ask me any questions you have about my work.