There has been a lot of hype around reinforcement learning with deep neural networks recently, but not a lot of clear, simple, explanations of these topics. So we here at the University of Waterloo are trying to change that, starting with Deep Q.
This is application that made Deep Q so popular. Google’s Deepmind built a system that could learn most atari games from just pixels using Deep Q learning. To understand how it works, let’s break it down.
** this article expects basic knowledge of reinforcement learning and deep learning **
Every reinforcement learning system works something like this: An agent observes an environment, makes a decision based on its observations, and receives some reward that it tries to maximize. Since rewards are usually delayed from the actions that caused them, it’s important to figure out how to assign a reward to time steps where there was no immediate reward. This is where Q learning comes in.
Q learning is works like this: given a state (s), score (R) every possible action (a1 …ai … an) using a function R = Q(s|ai), in terms of expected future reward R. Then play the action the highest score. Keep doing this until some reward is received. Assume the reward is received at time step T, just update all the previous time steps t1 … tn where tn = T with the same reward, however slightly discount the reward each time. Then train the Q function to predict the correct reward for the state at every time step. You probably want to read this paragraph again to make sure it makes sense. :p
It is usually done with some kind of linear classifier because non-linear classifiers, like neural networks, tend to overfit or diverge. Deepmind found a way to improve convergence of these deep networks significantly by using two main techniques.
1. Experience replay
After every action, the current state (s’), the previous state (s), the previous action (a), and the reward for the previous action (r) is stored in what’s called an experience replay (a list). Then at the end of each episode, a random mini batch of the experience replay is used to train. This helps prevent overfitting significantly. Furthermore, this method uses a single training example more than once.
2. Figuring out the correct reward
So once a random mini batch is drawn from the experience replay, the Q function needs to be trained. To figure out how to train it, we must first figure out what the best possible Q function, let’s call it Q*(s|a), would do. Q* (s|a) would predict the future reward for any action (a) at any state (s). In other words Q*(s|a) = Q*(s’|a’), where s’ is the state that occurs after action a, and a’ is the best action for the future state. So you set every training value of Q(s|a) to the max (Q(s’|a’1)…Q(s’|a’n) * discount. This allows your algorithm to start predicting future reward instead of current reward.
3. Pseudo Code
- Frozen Network:
Instead of only having one Q network, we keep two. The frozen network and the training network. We use the frozen network to calculate the reward, and use the other network for everything else. This helps prevent overfitting to the current model and avoids local minima.
- Prioritized Replay:
Instead of just taking random minibatches from the replay, pick the ones that will allow the network to learn the most. The idea is to assign each replay a probability of being picked instead of making it a uniform distribution. The probability of each replay is determined by using absolute TD-error.
- Double Deep Q:
This one is simple yet clever at the same time. Instead of calculating the reward of a non terminal step as r + max(Q(s’ a’)), you calculate it as r + max(Q (s’ | argmax(Q s’ | a’))| a’’) or basically, instead of giving the Q-score at the next state, give the Q-score at the next-to-next possible state which is calculated by selecting the Q-function’s best action at the next possible state. You probably want to read that a second time too!
5. Implementations and Resources:
Thanks for reading!