anindya sarkar
5 min readJan 15, 2019

Solution to Navigation Project Using Double DQN in UDACITY NanoDegree Programme:

Problem Definition:

The simulation contains a single agent that navigates a large environment. At each time step, it has four actions at its disposal:

  • 0 - walk forward
  • 1 - walk backward
  • 2 - turn left
  • 3 - turn right

The state space has 37 dimensions and contains the agent’s velocity, along with ray-based perception of objects around agent’s forward direction. A reward of +1 is achieved for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana.

Our goal is to choose the optimum action given the state at each time step so as to maximize the reward.That is I need to move the agent in a direction so that it can collect maximum yellow banana with as minimum step as possible.

The environment is solved once our trained agent reach average episodic reward of 13 over last 100 episodes.

Solution:

I solved this problem with Double DQN learning algorithm. Let’s brief the algorithm I used and the result I achieved.

There are two broad divisions in model free RL algorithm, One of them is policy based method, where we directly learn and optimize the policy parameters, e.g. REINFORCE, On the other hand there is value based method, e.g. DQN, where we learn a optimum value function (V(s)) or Q-function (Q(s,a)), which indirectly gives us optimal policy. That is for each state,action pair we calculate Q value, and take the action which maximizes Q-value at that given state.

Let’s derive the training objective for DQN algorithm first. We can write Bellman equation as following:

Q(s,a) = r(s,a) + y * V(s+1)

where, r(s,a) is the reward achieved at state s and taking action a.

Y is the discount factor.

V(s+1) is the Expected Future reward at next time state (s+1),i.e. value at next state.

Q(s,a) is expected future reward at state s and taking action a.

So,we can tell V(s) or value function is Expectation of Q(s,a) (Q function), where expectation is taken over action.

Let’s define Advantage function, A(s,a) = Q(s,a) — V(s).

Advantage function is a measure of “how much better the action a is than the average action taken in that state s.”

Now, We can define our optimum policy as following:

choose : a = argmax(a) [ A(s,a) ] with probability 1

This can guarantee that our policy is at least as good as the random policy and must becomes better and better as the approximation for Q function and Value function approaches to true value at later part of training.

But, to calculate argmax(a) [A(s,a)] , as we noticed V(s) is not dependent on a. So, if we can find argmax(a) [ Q(s,a) ],this will suffice.

So, our optimum policy becomes,

choose : a = argmax(a) [Q(s,a)] with probability 1

Now, how we train Q(s,a)?

The formula to calculate Q(s,a) is as follows:

Q(s,a) = r(s,a) + y * V(s+1)

where, the value of r(s,a), we can get from the environment.

Y is known as discount rate, a hyper-parameter of our network, typically set as 0.99.

As, for our policy we are taking only one action(the action corresponds to maximum Q(s,a)) with probability 1 in a state s. As a result, we can write V(s) = max (Q(s,a)).

Following the same line, we can write,

V(s+1) = max [(Q(s+1,a+1)]

Now, we can write Q(s,a) as:

Q(s,a) = r(s,a) + y * max [Q(s+1,a+1)] ……(A)

We now form it as a regression problem

Design a neural network, which produces the estimation of Q(s,a) as a output given a state s, this is predicted Q value. And the true Q value is obtained from equation (A).

Now, we train the parameters of that neural network ,by minimizing the MSE loss between predicted Q and true Q.

The overall training procedure is depicted as fig2.

fig2. Shows the flow of training of Double DQN architecture

As we know, DQN overestimate Q values. Double DQN solves this issue by introducing another neural network called “target network”. Target network decorrelates the two noise process. One is due to “action selection process” and the other is due to “calculation of Q value”. As we have seen both these values can be noisy, especially at the initial part of the training. And as these noise are correlated to each other, as a result due to the multiplicative effect of the noise, the resultant Q value for DQN overestimates Q value.

Double DQN uses the following trick to decorrelate these two noise as shown in below fig1.

Fig.1 How Double DQN uses target network

Now, To choose the action at the next state we use DQN network as before, but to calculate Q value for that action at the next state we use Target network. As a result two noise process becomes detached from each other,and it solves the overestimate of Q value issue.

For learning target network parameter, we follow “soft update rule”, where we change target network parameter very very slowly compared to original Q network parameter.

T(n) = 0.99 T(n) + 0.01 Q(n) …where T(n) target network parameter and Q(n) is original Q network parameter.

i.e., after every update step we update the Original Q network value, But we keep the target network value almost fixed but change only by the factor of 0.01 Q(n).

This slowly updating mechanism of target network solves the moving target problem ,which we encounter in DQN training scenario.

Replay buffer:

A convergence criterion for supervised regression problem is the samples in a batch should be i.i.d. samples. To solve this purpose Replay buffer is introduced. Replay buffer works in the following manner:

For instance, we put the last million transitions (or video frames) into a buffer and sample a mini-batch of samples of size 32 from this buffer to train the deep network. This forms an input dataset which is stable enough for training. As we randomly sample from the replay buffer, the data is more independent of each other and closer to i.i.d.

Exploration and epsilon greedy policy:

DQN uses epsilon — greedy policy to choose it’s best action. To improve the training,

epsilon starts at 1 and anneal to 0.01 or 0.05 over the first million frames using the following policy.

The epsilon -greedy policy can be formalized as shown in figure 3:

fig3. Epsilon greedy policy for better exploration,especially at the initial part of training

where m is the number of possible actions,

At the beginning of the training, we uniformly select possible actions but as the training progress,we select the optimal action a* more frequently. This allows maximum exploration at the beginning which eventually switch to exploitation.

Result I have achieved with Double DQN:

Fig4. shows the increasing trend of episodic reward for the navigation environment

Implementation Link: