Deep Deterministic Policy Gradient(DDPG) — an off-policy Reinforcement Learning algorithm

Dhanoop Karunakaran
Intro to Artificial Intelligence
4 min readNov 23, 2020
Source: [6]

Let’s go through a few concepts before DDPG.

Deterministic Policy Gradient(DPG)

Traditionally, policy gradient algorithms are bein used with stochastic policy function π(.|s). That means policy function π(.|s) is represented as a distribution over actions. For a given state, there will be probability distribution for each action in the action space. In DPG, instead of the stochastic policy, π, deterministic policy μ(.|s) is followed. For a given state,s, there will be deterministic decision: a=μ(s) instead of distribution over actions.

The grad objective function of the stochastic Policy gradient algorithm can be written as below:

We can also write the Policy gradient in a different form with G as well or based on the baseline function. Source: [2]

We can rewrite the equation for deterministic policy by replacing π with μ. Till 2014, deterministic policy to a policy gradient algorithm is not possible. In [2] paper, David Silver conceived the idea of DPG and provided the proof.

Now we can rewrite the above equation with the deterministic policy as shown below:

Source: [2]

Q-learning

learning is a value-based off-policy temporal difference(TD) reinforcement learning. Off-policy means an agent follows a behaviour policy for choosing the action to reach the next state s_t+1 from state s_t. From s_t+1, it uses a policy π that is different from behaviour policy. In Q-learning, we take absolute greedy action as policy π from the next state s_t+1.

Computation of Q-value in Q-learning. Source[4]

As we discussed in the action-value function, the above equation indicates how we compute the Q-value for an action a starting from state s in Q learning. It is the sum of immediate reward using a behaviour policy(ϵ-soft, ϵ-greedy or softmax). From state s_t+1, it takes the absolute greedy action (choose the action that has maximum Q value over other actions).

Basic update rule in Q-learning

It is important to mention the update rule in Q-learning. New Q value is the sum of old Q value and TD error.

Expanding the TD error in Q-learning

TD error is computed by subtracting the new Q value from the old Q value.

Updating rule in Q-learning. Source[3]

The above equation shows the elaborate view of the updating rule.

If you want to know more about Q learning, read this blog.

Actor-critic

In a simple term, Actor-Critic is a Temporal Difference(TD) version of Policy gradient[3]. It has two networks: Actor and Critic. The actor decided which action should be taken and critic inform the actor how good was the action and how it should adjust. The learning of the actor is based on policy gradient approach. In comparison, critics evaluate the action produced by the actor by computing the value function.

If you want to know more about, please read this blog.

Deep Deterministic Policy Gradient(DDPG)

DDPG is a model-free off-policy actor-critic algorithm that combines Deep Q Learning(DQN) and DPG. Orginal DQN works in a discrete action space and DPG extends it to the continuous action space while learning a deterministic policy[1].

As it is an off-policy algorithm, it uses two separate policies for the exploration and updates [5]. It uses a stochastic behaviour policy for the exploration and deterministic policy for the target update[5].

DDPG is an actor-critic algorithm; it has two networks: actor and critic. Technically, the actor produces the action to explore. During the update process of the actor, TD error from a critic is used. The critic network gets updated based on the TD error similar to Q-learning update rule.

We did learn the fact that the instability issue that can raise in Q-learning with the deep neural network as the functional approximator. To solve this, experience replay and target networks are being used.

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.

--

--