Summary: Deep Deterministic Policy Gradients

Zac Wellmer
Arxiv Bytes
Published in
2 min readNov 10, 2017

This post is a summary of Continuous Control With Deep Reinforcement Learning.

This basic goal of this paper was to transfer the success from deep Q learning achieved in discrete action domain to a continuous action domain. In addition it shows that deep function approximators are the most promising approach to successful reinforcement learning in large, high dimensional domains. Now I will go through the key ingredients that make this a successful approach.

Deterministic Actions

Popular approaches in reinforcement learning use the Bellman equation(below). Notice the expectation over the action space.

By making our policy deterministic we can avoid this expectation over the action space.

Now the expectation only depends on the environment. This means Q can be learned off policy using transitions from a different policy.

This is important because now we are able to use a replay buffer and separate target/evaluator networks. The replay buffer prevents temporally correlated updates and the target/evaluator networks provide stability in the training process.

Soft Updates

Deep Q learning has proven to be unstable in many environments, because the Q being updated is also being bootstrapped to calculate the target value. In this paper it is proposed to use soft updates, which update the weights of the target network rather than directly copying the weights of the evaluator.

soft updates constrain the target networks to change slowly and improves the stability of learning. Without soft updates the critic is difficult to train without diverging.

Batch Normalization

Finding hyperparameters that generalize across environments. This is because different environments have different scales for their state and action space. This was overcome with batch normalization. Batch normalization normalizes each dimension of a samples in a minibatch to have unit mean and standard deviation.

Architecture

The Deep Deterministic Policy Gradient(DDPG) algorithm follows an actor critic architecture. Below is an illustration taken from Sutton and Barto’s textbook.

--

--