GoAi #4: Continuous Deep Q-Learning with Model-based Acceleration

Reference: Continuous Deep Q-Learning with Model-based Acceleration

Introduction

Model-free reinforcement learning such as DQN has great success on the environment of discrete action space. However, when DQN is applied on high dimensional target, it gets bad performance because of high computation. For example, continuous action space. Therefore, DDPG which evaluates policy probability function has been published to decrease the complexity of continuous control tasks.

We can use two neural network, one for evaluate Q function(DQN) and the other evaluate policy function(DDPG), to implement continuous control tasks.

This paper explore an algorithm which uses only one neural network on continuous control tasks and has better performance than DQN + DDPG.
To further improve the efficiency of their approach, they also explore model-based algorithm for accelerating model-free reinforcement learning.

Algorithm

Continuous Q-Learning with Normalized Advantage Functions

First, we define advantage function:

and we implement a neural network to evaluate such value:

We try to minimize the advantage function value. (Different from DQN)
The experience result show that it perform better. Following is the algorithm:

Accelerating Learning with Imagination Rollouts

There are two goal: first, we try to exploration instead of using epsilon greedy; second, we produce more data from learned model.

To improve the efficiency of exploration, the authors try continuous version of softmax function and iLQG policy.

To produce more data, the paper use a linear model to train environment and use the technique referred imagination rollouts, on-policy samples generated under the learned model. This method is similar to Dyna-Q method but can be applied on larger scale tasks.

Spirit of the algorithm is :

  1. Take exploration policy
  2. Execute action and get data from environment
  3. Get data from learned environment model by imagination rollouts
  4. Train Q netwoek through NAF function
  5. Update the data replay
  6. Train the learned environment model

Discuss Question

  1. What is the mathematic meaning of the advantage function equation?(the longer one which has u pro and p pro)
  2. What is iLOG?
  3. Why the algorithm use linear model instead nonlinear model?
  4. What is the advantage of using model-based reinforcement learning?

Thinking more About Model-based RL

It is a big question in my heart that what is the advantage to use model-based reinforcement learning. Now, I have some idea about it and I share with you.

  1. Getting data cost a lot. In the real world case such as robot, every time you start up robot to interaction with environment is not simple.
  2. Converge quickly on the smaller data.
  3. Take good action on environment and take bad action on learned model. It may let your robot broken when you take bad action. However, we still want our agent can learn from the experience of bad action, so we test it on learned model to avoid broken our agent.

The paper from GoogleMind is awesome and very difficult. Lots of knowledge and mathematic idea I have to realize. I think it is very interesting and I will try my best to understand them.

#Reinforcement Learning
#AI