Deep Q Learning

6 min readJun 3, 2018

This article will be a Introductory overview of Deep Q Learning and understand its methodology and update steps, equations are written using word/latex.

This will be my Notes, to come back to, when I would need to revise what Deep Q learning is. This is based on my learning from papers and from the Deep learning section of Udacity.

Overview

What is Deep Q Learning = Deep Neural Networks + Q Learning (Reinforcement Learning).

This is not a very Old concept, as you see below, AlphaGo just beat Lee Sedol recently in 2016, using some of these Techniques. The Atari game solver agent by “Deepmind” that is so famous, are also based on these concepts

**Alphago Beating Lee Sedol in 2016 !!**

Understanding the Update step for Deep Q-Learning

We Describe the Below is some details here. Deep Q Learning will constitute of the following parts :

**Steps to Update the Value function in Deep Q Learning**

Neural Networks are universal function approximators, so we can use Neural networks as value functions and then use the methods described in the left to find out how to update the value function.

There are different ways to do the Update step, for gradient Descent,Below I will show the Monte-carlo Update step.

Overview of Monte-Carlo Update Step

**Update Steps using Monte Carlo Learning**

SARSA is a on-policy Algorithm, as which means we are updating the same policy that we are following to carry out actions, It converges fast, because we are using the most updated policy to take each decision, but it has its issues, policy being learned and followed are intimately tied to each other, hence we use a off-policy learning Algorithm like Q-Learning which is a variant of TD-Learning.

On-Policy Learning, Policy used to update the Action is same as that the agent is learning
Good for Online-learning
Q-values are affected by exploration
Less accurate as less exploration

Overview of SARSA Update Step

Q-learning is an off-policy variant of TD learning. We sometimes need the policy to learn more smartly from the environment using Greedy under the limit of exploration(GLIE), we can adapt it to work with function approximation. The main Difference is in the “Update” step, Instead of picking the next action from the same epsilon-greedy policy, we choose an action greedily which will maximize the expected value going forward. Note that we do not take this action, it is only used for performing the update. In-fact we do not need to pick this action, we can simply use the max Q value for the next state, This is why Q-Learning is called an off-policy method.

We use one policy to take actions, the epsilon-greedy policy π, and yet another policy to perform “value update” a greedy policy.
Although the underlying principles are the same, these two are indeed different policies. Note: the policy the agent is learning is different from the policy it is following to update the action.
There are two ways we apply Q learning, we can use the concepts of episodes or we can use it for continuous tasks
This is Bad for Online Learning, more accurate and Q-values are unaffected by exploration
Why this is so popular? The agent uses a more exploratory policy while acting, and yet the optimal value Function, At one point we can stop exploring and follow the optimal policy for better results
Supports offline, Batch learning is possible which is important for training Neural Networks

**Q-Learning (off policy Learning) using Episodic Tasks**

Deep Q-Learning and its implementation to the Atari Neural Network Agent Structure

Screen images are converted into gray scale and processed by Convolution layers, By this system can exploit spatial relationships
4 frames are stacked and provided as input
Original DQN used 3 Conv. Layers with RELU activation, followed by one fully connected hidden layer with RELU activation, and a final fully connected linear output layer that produced the vector of action values
Each game was learned from scratch with a freshly initialized network
But there are situations where the network weights can oscillate or diverge, due to the high correlation between actions and weights, and hence as expected the policy can be ineffective
This can be solved using two different methods, Experience Replay and Fixed Q-targets

Overview of Experience Replay

This is similar to memoization, where we store the values that need to be used in the future, so we store the Experience tuple in a Replay buffer or memory where we sample from them in order to learn from it.As a result we are able to learn from individual tuples multiple times, recall rate occurrences and make better use of our experience
We can sample from this ringbuffer at random, so it is not like a stack, of queue. It does not have to be in a sequence, this helps break the correlation between action and weights and helps the action values from catastrophically diverging.
The things are simple with a discreet state space, problems always arise with a continuous state space, hence it is important to not learn while practicing. The agent should choose random tuples and explore the state space for rare “tuples”. The agent should not reinforce the same action over and over in a loop.
After a While, the agent can go back to its updated value function, collect experience and learn using batch learning. In this way Experience replay can avoid the inherent correlation observed in the consecutive experience tuples by sampling them out of order

Overview of Fixed Q Targets

Fixing the function parameters used to generate the TD target to decouple the TD target from the parameters
Lets go over in more details, while looking at the Fixed Q-target equations when doing the gradient-descent step of weight updates while using Q-learning.

Overview of Deep Q Learning Algorithm

Conclusion

This is where I will end this overview, as this is quite comprehensive, I will start the next DQN series with

Some of the improvements on DQN, like Double DQN’s, Prioritized replay, Dueling Networks
How to solve the Atari game, with code from open-AI
Going through some of the papers in these areas as put in the reading section below

Readings

https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf
Human-level control through deep reinforcement learning, https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
Mnih et al., 2015. Human-level control through deep reinforcement learning. (DQN paper)
He et al., 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. (weight initialization)
Thrun & Schwartz, 1993. Issues in Using Function Approximation for Reinforcement Learning. (overestimation of Q-values)
van Hasselt et al., 2015. Deep Reinforcement Learning with Double Q-Learning.
Schaul et al., 2016. Prioritized Experience Replay.
Wang et al., 2015. Dueling Network Architectures for Deep Reinforcement Learning.
Hausknecht & Stone, 2015. Deep Recurrent Q-Learning for Partially Observable MDPs.
http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html