Deep Reinforcement learning: DQN, Double DQN, Dueling DQN, Noisy DQN and DQN with Prioritized Experience Replay

10 min readJul 21, 2019

Abstract

In this blog article we will discuss deep Q-learning and four of its most important supplements. Double DQN, Dueling DQN, Noisy DQN and DQN with Prioritized Experience Replay are these four supplements which each of them handle a different aspect of an agent. an agent has different aspects such as its mind, the experience it gets from the environment and its exploration (learning new things regardless of what has been learned). Each of the approaches mentioned, handle one of these aspects. So the way they work and how they work and the most important question of all, “why” do they work will be explained in this article.

The link to the code: https://github.com/Parsa33033/Deep-Reinforcement-Learning-DQN

1 Introduction

For explaining each of the methods we first have to introduce the algorithm which started them all, the holly “Q-learning”. In general, reinforcement learning has different parts. An agent, agent’s action, an environment within which an agent takes actions and the state or observation that an agent gets as a consequence to an action it takes. The mind of the agent in Q-learning is a table with the rows as the State or Observation of the agent from the environment and the columns as the actions to take. Each of the cells of the table will be filled with a value called Q-value which is the value that an action brings considering the state it is in. let’s call this table Q-table. The Q-table is actually the brain of the agent.

There are multiple steps to reinforcement learning:

1) The agent starts taking an action in the environment and starts a Q-table initialized with zeros in all the cells.

2) The agent gets to a new state or observation (state is the information of the environment that an agent is in and observation is an actual image that the agent sees. Depending on an environment an agent gets state or observation as an input) by doing an action from the table. The state is a row in the table containing Q-values for each action and the highest value in the row will be the action that the agent takes (the column with the highest value in that specific row). It is like seeing a new landscape by turning around. There is an observation from the current state which is the landscape that you are seeing now, and an observation from the next state by taking the action of turning around.

3) The agent gets a reward by doing the action. The reward has a meaning. the higher the value the better, but sometimes the lower value could mean that the agent has taken the better action. The reward comes from the environment and the environment defines which of the lower or higher reward is better. The environment gives the agent the reward for taking an action in a specific state.

4) The agent keeps doing steps 1 to 3 and gathers information in its “memory”. The memory contains tuples of state, next state, action, reward and a Boolean value for indicating the termination of the agent. this steps keep on going and the agent memorizes the info until the termination happens.

5) The agent sometimes has to renounce the Q-table and explore for learning new things. This is called the exploration of the agent. the basic method for this is to have a probability of exploration and during the training of the agent, this probability lowers. So at first the agent learns new things regardless of what it has learned (the Q-table). But as the time goes on, the agent relies more on what it has learned. Actually, Noisy DQN does the job of exploration in a different way which will be explained later.

6) After the termination of the agent which could mean completing the task or failing, the agent starts replaying the experiences it gathered in its memory. A batch of a particular size will be chosen from the memory and the task of training will be performed on it. Basically this means that the Q-table starts filling up. This is called Experience Replay.

What is Q-value then? Basically, Q-value is the reward obtained from the current state plus the maximum Q-value from the next state. So, that means the agent has to get the reward and the next state from its experience in the memory and add the reward to the highest Q-value derived from the row of the next state in the Q-table and the result will go into the row of the current state and the column of the action, both obtained from the experience in the memory.

Here s is the state and a is the action and Q(s,a) is a value of the Q-table cell and R is the reward and gamma (between zero and one. Normally is 0.9) is the discount factor which basically tells the agent to not rely too much on the next state.

2 Deep Q-Learning (DQN)

The only difference between Q-learning and DQN is the agent’s brain. The agent’s brain in Q-learning is the Q-table, but in DQN the agent’s brain is a deep neural network.

The input of the neural network will be the state or the observation and the number of output neurons would be the number of the actions that an agent can take. For training the neural network the targets would be the Q-values of each of the actions and the input would be the state that the agent is in.

3 Double DQN

Double DQN uses two identical neural network models. One learns during the experience replay, just like DQN does, and the other one is a copy of the last episode of the first model. The Q-value is actually computed with this second model. Now, why is that? In DQN, Q-value is calculated with the reward added to the next state maximum Q-value. Obviously, if every time the Q-value calculates a high number for a certain state, the value that is obtained from the output of the neural network for that specific state, will become higher every time. Each output neuron value will get higher and higher until the difference between each output value is high. Now if let’s say for state s action a is a higher value than action b, then action a will get chosen every time for state s. now consider if for some memory experience action b becomes the better action for state s. then since the neural network is trained in a way to give a much higher value for action a when given state s, it is difficult to train the network to learn that action b is the better action in some conditions. So what do we do to bring down the difference between the output values (actions)? Use a secondary model that is the copy of the main model from the last episode and obviously, since the difference between values of the second model are lower than the main model, we use this second model to attain the Q-value:

That is the way Q-value gets calculated in Double DQN. We find the index of the highest Q-value from the main model and use that index to obtain the action from the second model. And the rest is history.

4 Dueling DQN

The difference in Dueling DQN is in the structure of the model. The model is created in a way to output the formula below:

Here, the V(s) stands for the Value of state s and A is the Advantage of doing action a while in state s. The Value of a state is independent of action. It means how good is it to be in a particular state. But what is an Advantage? How is it different from Q-value? Let’s explain it with an example. The agent might be in a state where each of the actions would give the same Q-value. So there is no good action in this state. What would happen if we divide the Q-value to Value of a state and the Advantage that each action has. If every action has the same result, then the Advantage of each action will have the same value. Now, if we subtract the mean of all the Advantages from each advantage, we get zero (or close to zero) and Q-value would actually be the Value that the state has.

So overtime the Q-value would not overshoot. The states that are independent of action would not have a high Q-value to train on.

Mind you, the output of the model would be the state Value plus the Advantage of actions. But for training the model we use the same Q-value for targets as before:

We can even use Double DQN method.

5 Noisy Net (Noisy DQN)

As you might recall from the introduction, the agent sometimes explores the environment regardless of what the Q-table or neural network (its brain) told it to do so. As we mentioned, the exploration was a probability that reduced over time. This would happen with a probability of 1 reducing to, for instance, 0.01. but Noisy Net does this job in a different way. Noisy Net gives noise to the output of the neural network, so in that way the agent explores the environment whenever there is a noise in the neural network output and different action gets a higher value when the real action to be taken is another one.

The way to do this is by defining different weights for the neural network. As you recall from neural networks, the output of a neuron is the sum of the inputs multiplied by the weight of their connection to the corresponding neuron. So weights (W for short) were the parameters to learn in neural networks. But in Noisy Net (W = Mu + Sigma * epsilon) where Mu is a Variable with a random initialization, Sigma is a Variable with a constant initialization and epsilon is actually the noise with a random value between zero and one. So the fully connected layers would have weights like this to learn and over time if exploration is not needed anymore the value of sigma would be close to zero to neutralize the epsilon.

6 DQN with Prioritized Experience Replay

As mentioned in the introduction the agent will start taking actions in an environment and memorized the experience as a tuple of state, next state, action, reward and a Boolean value for indicating the termination of the agent . Furthermore, in Experience Replay step, a batch of certain shape would get chosen from the memory and training the neural network would perform on that particular batch. But what if the batch chosen from the memory is not an eligible batch. What if the experience memories chosen are not so important to learn. In that case we have to choose the important experience memories for putting in the batch.

So how do we recognize importance? If the Q-value from the next state is a lot different than the Q-value from the current state, that means that the importance is high whether the Q-value in the next state increases or decreases. This deference is called Temporal Difference error (TD error)

So for each memory we have TD error as below:

Then the probability of an experience memory being chosen would be:

Now, here epsilon is a small value to prevent division by zero and alpha is a random value between zero and one which if it’s one it chooses the most important memories and if it is zero the batch would get filled randomly.

So p will be the probability that an experience is important and the batch would get filled considering experience probabilities. But there is one more thing. As you might know training the network happens stochastically. That means each experience is used for training individually. So, if an experience has a high probability, then this experience would get chosen each time and the neural network would Overfit on this particular experience (for more information about Overfitting refer to https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76) so in order to overcome this problem we multiply the value below to the training loss:

Where b is a value starting from zero and gradually reaching to 1. So in this formula the importance is calculated from the distribution that the experience came from. So if a probability is high it won’t get chosen all the time. Therefore, the training loss will be calculated as:

7 Results

The result of the methods depicted as the mean of scores in each 200 episodes in cartpole-v1

The result of the methods depicted as the mean of scores in each 500 episodes in cartpole-v1

The link to the code: https://github.com/Parsa33033/Deep-Reinforcement-Learning-DQN

Deep Reinforcement learning: DQN, Double DQN, Dueling DQN, Noisy DQN and DQN with Prioritized Experience Replay

Written by Parsa Heidary Moghadam