DEEP DETERMINISTIC POLICY GRADIENT FOR CONTINUOUS ACTION SPACE

Published in

Analytics Vidhya

8 min readJun 23, 2021

In the previous article about Policy gradient methods, we discussed the shortcomings of PG-based methods. They are not sample-efficient as they discard the previously learnt policy in each iteration. We can say they are on-policy learning methods as the same policy generates the action and updates the value function. These methods don’t utilize the value function information, which requires an off-policy setting. So we have a new set of algorithms called Actor-Critic methods and DDPG is one of them.

Policy gradient is preferred over value-based methods in the continuous space domain, as they don’t solely depend on the value function of the next state (like in 1-step TD learning) for optimization. They have a scoring function, which improves by taking actions that maximize our rewards. But we need to mix the best of both worlds (Value-based and Policy-based methods) to build algorithms.

LIMITATIONS OF VALUE-BASED METHODS

Algorithms like DQN, utilize the value functions of state V(s) or Action value function Q(s, a). These algorithms work in the discrete space domain. But the curse of dimensionality always kicks in discretization. In the paper, the authors give an example of the human arm

For example, a 7 degree of freedom system (as in the human arm) with the coarsest discretization aᵢ ∈ {−k, 0, k} for each joint leads to an action space with dimensionality: 3^₇ = 2187.

DEEP DETERMINISTIC POLICY GRADIENT FOR CONTINUOUS ACTION SPACE

LIMITATIONS OF VALUE-BASED METHODS

Written by Astarag Mohapatra