[Paper Notes 1]QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
This paper proposes a novel Deep Reinforcement Learning algorithm QT-Opt for vision-based robotic manipulation problem. The main idea is to train a Deep Q-Network only for continuous control without the need to use a policy network. In order to stabilize training, QT-Opt uses Cross-Entropy Method(CEM) to find the action that maximizes the Q value. With asynchronous off-poliy on-policy training of Deep Q-Network, it achieves 96% success rate on grasping unseen objects, which is a huge improvement on this problem.
2 What is the contribution?
- A scalable and more stable off-policy deep reinforcement learning algorithm for continuous control. DDPG and its variants like D4PG is a state-of-the-art off-policy algorithm for continuous control. However, it is notoriously unstable. QT-Opt do not use a policy network but only train a deep Q-network. In order to obtain the actions, QT-Opt uses Cross-Entropy Method(CEM),an evolution approach, to find the action that maximize Q-network’s output. This approach is smart and it shows that it is more stable. By using CEM, it is very easy to train asynchronously. This is the reason why QT-Opt is said to be Scalable.
- It achieves a remarkable result on vision based robotic manipulation tasks. Its generalization ability to other unseen objects is impressive.
3 What can we learn from this paper?
This paper shows a very promising Deep Reinforcement Learning based approach for vision-based robotic manipulation and also other continuous control tasks. The algorithm is scalable, which means we can train the skill using many many robots simultaneously. Even though only Google can afford to do such kind of experiments, I still feel very inspiring.
4 Some Conclusions
1 Large scale robot learning system is a must choice to achieve better results.
2 Evolution strategies are shown their power again. Combining evolution strategies with reinforcement learning could be better.
3 Real world asks for off-policy RL algorithms.