Demystifying Deep Deterministic Policy Gradient (DDPG) and it’s implementation in ChainerRL and OpenAI-baselines

An in-depth explanation of DDPG, a popular Reinforcement learning technique and its breezy implementation using ChainerRL and Tensorflow.

Published in

Analytics Vidhya

5 min readNov 26, 2019

Deep Deterministic Policy Gradient or commonly known as DDPG is basically an off-policy method that learns a Q-function and a policy to iterate over actions. It employs the use of off-policy data and the Bellman equation to learn the Q function which is in turn used to derive and learn the policy.

Learning Process:

The learning process is very closely related to Q-learning where if you know the optimal-action-value function Q*(s, a), the best and optimal action to taken in that state can be found out using a*(s) which is :

Optimal Action Estimation

Continuous Action Space Derivation:

DDPG was developed specifically for dealing with environments with continuous action spaces and in essence that is to estimate the max over actions in max Q*(s, a).

In the case of Discrete action spaces, Q-values for each action can be estimated separately and hence compared easily for the best max value.
In the case of Continuous action spaces, computation and individual comparison for each Q-value becomes very exhaustive leading to non-stationary target values and unstable learning. Not to mention, the process for such is quite exhaustive and computationally expensive.

Q-learning based algorithms, specifically DDPG employs the use of the following to deal with a continuous action space:

Make use of the Bellman equation to obtain the optimal action for a given state using its state-action/Q-value.

In the equation s’~P refers to the next state s’ being obtained from the environment from a probability distribution of P(.|s, a).

DDPG employs the use of mean-squared Bellman error (MSBE) function which estimates how close Q* comes close to satisfying the Bellman equation as shown in the equation:

Mean Squared Bellman Error value equation

Making use of Experience Replay Buffer which is a set of previous experiences which helps in providing Q-learning based approximators a stable learning behavior
DDPG also deploys the use of a Target network to deal with non-stationary target values and make the learning more stable. Following describes what a Target is because when we minimize the MSBE loss, we are trying to make the Q-function be more like this target.

Target Value

DDPG’s target network which is just copied over from the main network some-fixed-number of steps is updated once per main network update by Polyak averaging:

Target network update — Polyak Averaging

Thus DDPG deals with this humongous continuous action space challenge and expensive computation by using a target policy network to compute an action that approximately maximizes Q*(Target).

Key Takeaways :

Q-learning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent.
DDPG is essentially and Off-policy learning method
It is basically Q-learning for continuous action spaces.
It uses Target network with Experience replay for stable and efficient learning.

Understood everything? If not there is no harm in taking a few steps back

Implementation:

We shall implement DDPG using two frameworks:

ChainerRL — https://github.com/chainer/chainerrl
Tensorflow /OpenAi Baselines — https://github.com/openai/baselines

ChainerRL

Chainer is a newly developed DL based framework and its specialty is that it is really fast and operating on Cupy ( perhaps a faster version of numpy for GPU usages) and supports parallelization for GPU. Install chainerRL directly via —

pip install chainerrl

After installing the setup, RL algorithms can be run directly with the test scripts. Here we will run DDPG with the train_ddpg script given below-

Training code for DDPG in mujoco — Link

Or you can use the training code directly from here- train_ddpg.py.

make_env function for custom environments

In the script, you can modify the make_env() function to suit and train on the environment of your own choice or you can also add created environment files by directly importing them and then calling them in the function itself.

The happiness of understanding and seeing a running code

Tensorflow /OpenAi Baselines

Running training setups from the OpenAI baselines is also relatively easy and simple. Firstly clone and install TensorFlow and then the baseline packages-

git clone https://github.com/openai/baselines.git #clone the repo
cd baselines #change directorypip install tensorflow-gpu==1.14 # if you have a CUDA-compatible gpu and proper driverspip install -e . # for baseline installation

To run DDPG use the following command —

python -m baselines.run — alg=ddpg — env=HalfCheetah-v2 — num_timesteps=1e6

runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (-h) for more options. For more information check this link.