Deep Q Learning — Explained

NancyJemimah
6 min readApr 24, 2020

--

Image Source: MIT Introduction to Deep Learning

From my previous blog, I hope you got an understanding of what Q-learning does. This blog is primarily intended to focus on combining our Deep Learning and Q learning to make a robust algorithm. As mentioned earlier, we used our q-table as a lookup table to take optimal action from the current state. The Q-table is updated each time with the below formula.

Source: Hackermoon

The Q learning with Q-table can be performed to the environment in which our state spaces are discrete. If example, if we take our atari games (like breakout, space invaders, doom) we have a state-space which is continuous so the algorithm of updating q-values would not be suited.

To solve this problem, we get help from our deep learning models. The best deep learning model used for images is the Convolutional Neural Network.

We used the CNN as our Q-function approximator to give the q-values of each state-action pair. Now, you would think about how we are gonna do this. Let me give a detailed explanation of how each step we take to build this Deep Q network.

1. Input to our CNN — Preprocessing images

First and foremost, the input to CNN is going to be the images of the game. For that, we need to preprocess our images accordingly so that our model won’t be taking much time to process.

Source: Wikipedia

Step 1: Convert the image to grayscale as colors don't contribute anything to our Q-values.

Step 2: With respect to the environment we work, we can crop and resize our image

These two steps are important that we need to do before feeding our network with input.

So, by giving images as the inputs, we are telling our network about the states of the environment as I mentioned earlier, the state space is continuous so the agent learns from the images what state we are starting our game.

This rises to one question, with the help of one image can we understand what our state is? Definitely it is no. We can see the above image and understand whether the ball is going in the direction to hit the bricks or is it coming to the paddle. What can we do now?

Solution: Stack of frames can be done to understand the motion of the agent and understand the state. So how many images we need to stack, according to DeepMind paper which I read they are using four images to stack.

2. Create a Neural Network (CNN) to give the preprocessed image as input

Handwritten

The above image shows us how the flow of our images is going to be. As mentioned we are taking stacked images and feeding it as an input. So, now the image height and width is given as an input to our first hidden layer. The 4 represents the number of stacked images.

In our example, we have two hidden layers, the images are passed and it moves to the fully connected layer and then we get the output as the Q value for our state-action pairs. The output is going to a number of actions with respect to the game.

The atari breakout game which I explained above have 3 actions as know the paddle moves left/right/still. So our output will be three q-values and we would need to look at those values and take the maximum value to make our agent take better action.

So, here, CNN is used as a function approximator to give our Q-values. This is because our state space is continuous.

3. Experience Replay or Replay Memory

As games are played through simulation, the observations are collected into a library of experiences and the training is performed by randomly sampling the previous experiences in batches. So that the system does not overfit a particular evolution of the simulation. The above image shows us the experience tuple that stores the ( current state, action, reward, next state) which records all our experiences.

Note: We need to specify the capacity of the replay memory. The size needs to be mentioned by us

4. Training our Agent

We use a CNN to estimate the value of the current state in action-pair and the next, thereby using it multiple times. As we perform the network, we are updating the network. So the target function inside the loss function changes, which causes problems for stability. So we fix the network and only update it say every 100 steps.

We need to train this also in a similar way as our taxi environment with the bellman equation.

We need to compute the loss function, for this, we will take our current q-value and we will calculate the target Q-value with the help our policy network and at the end, we will get our loss. So, we use optimizers such as Adam/RMSprop to reduce our loss.

As we are using the same network to compute our Q-value for our current and the target value we will end up having more correlations between our value. This is considered as one of the downfalls of DQN which gets improved with the help of DDQN ( solution to our correlation problem)

5. Hyperparameters Used

The following are the hyperparameters that we can use to train our model. There is no correct value for each parameter we need to tune it in a way that makes our agent play the game in a better way

state_size = [84, 84, 4] — Our input is a stack of 4 frames hence 84x84x4 (Width, height, channels)
action_size = env.action_space.n — no of possible actions
learning_rate = 0.00025 — Alpha (aka learning rate)

total_episodes = 100 — training episode count
max_steps = 50000 — maximum steps are taken at each episode
batch_size = 64 — training sample in the batch

explore_start = 1.0 — exploration probability at start
explore_stop = 0.01 — minimum exploration probability
decay_rate = 0.00001 — exponential decay rate for exploration prob

gamma = 0.9 — Discounting rate

pretrain_length = batch_size — Number of experiences stored in the Memory when initialized for the first time
memory_size = 1000000 — Number of experiences the Memory can keep

stack_size = 4- Number of frames stacked

6. Testing our Agent in the environment

Conclusion

  • The network gets input as an image which is preprocessed
  • Then we introduce the replay memory to store our experience tuple.
  • Then we the CNN model to give our q values for the state-action pair and we would select the best action from the output
  • Hyper-parameters mentioned above should be tuned in a way to an optimized output.
  • Then we need to train our agent and test it in the environment.

The above-mentioned steps summarize how the Deep Q Network works.

Check out your knowledge

  1. What function we are using to find the Q-values?

(q-table, CNN, user-defined function)

2. Do we need to fix the capacity of our replay memory to store our experiences?

(yes, no)

3. What is our output from the deep learning model?

(q-values, states images, action values, rewards)

Thank you for reading the article. Please feel free to send some suggestions and questions about the article to nancyjemimah@gmail.com.

Happy Reading! Happy Learning!

--

--

NancyJemimah

I'm a searcher of life and I love reading self improvement books which enrich my vision.The quest to learn why I live here and what I do to the world is my joy