Deep reinforcement learning is at the forefront of AI. In combining recent advances in Deep Learning with the psychology based reinforcement learning, we’ve been able to create crazy AI that’s been able to outperform humans in complex environments- all without any labelled data!
To experiment with the power of DRL, I built a program that allows an agent to self-play doom deathmatch — in this article, I’ll be breaking down the techniques used in creating my agent, alongside with a few algorithms that are at the forefront of DRL.
Github link for program: Doom
Summary: Deep Q Networks
In Q Learning, the Q value for any particular state and action is normally given by the TD algorithm for Q values. (Goal of minimizing the update value)
In contrast, Deep Q Learning uses a similar logic to find Q, but rather than calculating it with Bellman’s equation, we directly approximate it with a neural network, where state and actions are used as inputs.
Rather than updating every individual Q value with every new state action pair however, Deep Q Networks go through backpropagation by calculating a loss function with the update value used in Bellman’s Equation.
(The first term is the target value)
In using Deep Q Learning in place of Q Learning, we get a huge boost in both model performance and storage! Optimal Q values are found much more quickly, and Q values don’t need to be stored for every individual state action pair (as states and actions are instead directly fed into the Deep Q Network!).
Creating the Brain of a Doom Deathmatch Agent
Deep Q Networks alone are sufficient to create a self playing doom agent, but they are suboptimal; they can still be improved upon! While Deep Q Learning was used for the program, 2 main alterations were used:
- Target Deep Q Networks
- Dueling Deep Q Networks
Target Deep Q Networks
In normal Deep Q Networks, the same network predicts the Q value for both Q(s,a) and the target value — as such, when the network is updated, both Q and the target will shift together — this in turn makes it harder to minimize the loss, as the Q value begins to ‘chase’ after the target value.
Here’s an analogy:
Let’s say the Q value and target are represented by Bob and Joe respectively. Bob and Joe are playing a game of tag, and Bob wants to minimize the distance between himself and Joe. However, as Joe moves every time Bob moves, it becomes hard for Bob to catch up to Joe (i.e. the loss).
To prevent the Q value from ‘chasing’ the Target value, we can occasionally freeze the network that’s used for predicting the Target value — all while continuing to update the network that estimates the Q value (hence naming the frozen network as the Target Network). Under the analogy before, Joe would get tired much more quickly than Bob, and would need to occasionally stop to rest- this in turn would close the gap between the two.
*The action used for finding the target Q might seem complex, but it’s just a fancier way of representing the a’ used in the update value — the normal Q network finds the next action, but the target network uses it to find the target Q.
Implementation in Code:
In the snippet of code above, a separate target network is used to predict the target values, where the target network copies the model used to predict the Q value every 10000 steps. This in turn substantially boosts the model performance.
Dueling Deep Q Networks
Given that the Q value determines the benefits of making an action at a given state, Q can be broken down into:
Where A determines the benefits for each action at state s, while V determines the benefits of being at the state s. The Q value evaluates the benefits of taking an action at a particular state. This in turn can be broken into 2 components — the value and the advantage (V and A). By evaluating the advantage and state value separately, our network can better determine the optimal value of Q.
Take this as an example: an agent is put in a car game, where it has to move to avoid incoming cars. When there aren’t any incoming cars on the screen, the action doesn’t matter much — the agent is in a very good state, and doesn’t need to move. However, when a car gets very close to the agent, it values current actions a lot more than the state.
By prioritizing values like this, we can find better Q values much more quickly in training!
Deep Reinforcement Learning is a very new field of machine learning, and it’s consistently improving. While algorithms such as DQNetworks may be among the most widely known, we’re still far from unlocking the true potential of DRL. Target and Dueling DQNs may contribute towards unlocking this potential, but we ultimately still have a long way to go.
If you enjoyed this article, then please leave a clap or share!